We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Understanding Python's "for" statement

Fredrik Lundh | November 2006 | Originally posted to online.effbot.org

One of the things I noticed when skimming through the various reactions to my recent “with”-article is that some people seem to have a somewhat fuzzy understanding of Python’s other block statement, the good old for-in loop statement. The with statement didn’t introduce code blocks in Python; they’ve always been there. To rectify this, for-in probably deserves it’s own article, so here we go (but be warned that the following is a bit rough; I reserve the right to tweak it a little over the next few days).

On the surface, Python’s for-in statement is taken right away from Python’s predecessor ABC, where it’s described as:

   FOR name,... IN train:
       commands
           Take each element of train in turn

In ABC, what’s called statements in Python are known as commands, and sequences are known as trains. (The whole language is like that, by the way; lots of common mechanisms described using less-common names. Maybe they thought that renaming everything would make it easier for people to pick up the subtle details of the language, instead of assuming that everything worked exactly as other seemingly similar languages, or maybe it only makes sense if you’re Dutch.)

Anyway, to take each element (item) from a train (sequence) in turn, we can simply do (using a psuedo-Python syntax):

name = train[0]
do something with name
name = train[1]
do something with name
name = train[2]
do something with name
... etc ...

and keep doing that until we run out of items. When we do, we’ll get an IndexError exception, which tells us that it’s time to stop.

And in its simplest and original form, this is exactly what the for-in statement does; when you write

for name in train:
    do something with name

the interpreter will simply fetch train[0] and assign it to name, and then execute the code block. It’ll then fetch train[1], train[2], and so on, until it gets an IndexError.

The code inside the for-in loop is executed in the same scope as the surrounding code; in the following example:

train = 1, 2, 3
for name in train:
    value = name * 10
    print value

the variables train, name, and value all live in the same namespace.

This is pretty straightforward, of course, but it immediately gets a bit more interesting once you realize that you can use custom objects as trains. Just implement the __getitem__ method, and you can control how the loop behaves. The following code:

class MyTrain:
    def __getitem__(self, index):
        if not condition:
            raise IndexError("that's enough!")
        value = fetch item identified by index
        return value # hand control back to the block

for name in MyTrain():
    do something with name

will run the loop as long as the given condition is true, with values provided by the custom train. In other words, the do something part is turned into a block of code that’s being executed under the control of the custom sequence object. The above is equivalent to:

index = 0
while True: # run forever
    if not condition:
        break
    name = fetch item identified by index
    do something with name
    index = index + 1

except that index is a hidden variable, and the controlling code is placed in a separate object.

You can use this mechanism for everything from generating sequence elements on the fly (like xrange):

class MySequence:
   def __getitem__(self, index):
       if index > 10:
          raise IndexError("that's enough!")
       return value * 10 # returns 0, 10, 20, ..., 100

and fetching data from an external source:

class MyTable:
   def __getitem__(self, index):
       value = fetch item index from database table
       if value not found:
           raise IndexError("not found")
       return value

or from a stream:

class MyFileIterator:
   def __getitem__(self, index):
      text = get next line from file
      if end of file:
          raise IndexError("end of file")
      return text

to fetching data from some other source:

class MyEventSource:
   def __getitem__(self, index):
      event = get next event
      if event == terminate:
          raise IndexError
      return event

for event in MyEventSource():
   process event

It’s more explicit in the latter examples, but in all these examples, the code in __getitem__ is basically treating the block of code inside the for-in loop as an in-lined callback.

Also note how the last two examples don’t even bother to look at the index; they just keep calling the for-in block until they run out of data. Or, less obvious, until they run out of bits in the internal index variable.

To deal with this, and also avoid the issue with having objects that looks a lot as sequences, but doesn’t support random access, the for-in statement was redesigned in Python 2.2. Instead of using the __getitem__ interface, for-in now starts by looking for an __iter__ hook. If present, this method is called, and the resulting object is then used to fetch items, one by one. This new protocol behaves like this:

obj = train.__iter__()
name = obj.next()
do something with name
name = obj.next()
do something with name
...

where obj is an internal variable, and the next method indicates end of data by raising the StopIterator exception, instead of IndexError. Using a custom object can look something like:

class MyTrain:
    def __iter__(self):
        return self
    def next(self):
        if not condition:
            raise StopIteration
        value = calculate next value
        return value # hand control over to the block

for name in MyTrain():
    do something with name

(Here, the MyTrain object returns itself, which means that the for-in statement will call MyTrain’s own next method to do the actual work. In some cases, it makes more sense to use an independent object for the iteration).

Using this mechanism, we can now rewrite the file iterator from above as:

class MyFileIterator:
    def __iter__(self):
        return self # use myself
    def next():
        text = get next line from file
        if end of file:
            raise StopIteration()
        return text

and, with very little work, get an object that doesn’t support normal indexing, and doesn’t break down if used on a file with more than 2 billion lines.

But what about ordinary sequences, you ask? That’s of course easily handled by a wrapper object, that keeps an internal counter, and maps next calls to __getitem__ calls, in exactly the same way as the original for-in statement did. Python provides a standard implementation of such an object, iter, which is used automatically if __iter__ doesn’t exist.

This wasn’t very difficult, was it?


Footnote: In Python 2.2 and later, several non-sequence objects have been extended to support the new protocol. For example, you can loop over both text files and dictionaries; the former return lines of text, the latter dictionary keys.

for line in open("file.txt"):
    do something with line

for key in my_dict:
    do something with key