Incremental Parsing

The iterparse function builds an element tree with data from a file or a file-like object, but also lets you inspect the tree during the build process.

This is similar to parsing a file and then iterating over it with iter (or getiterator, in 1.2 and earlier), but the parsing and the iteration are done in parallel. This has several advantages; you can for example remove parts of the tree that you don’t need, stop parsing when you find what you’re looking for, or just gain a little performance by parsing XML from a remote site as it arrives over the wire.

For example, here’s an example that shows how to print all item links from an RSS 2.0 file:

for event, elem in ET.iterparse("blog.rss"):
    if elem.tag == "item":
        print repr(elem.findtext("link"))
        elem.clear() # won't need this again
for event, elem in ET.iterparse("blog.rss"):
    if elem.tag == "title":
        print repr(elem.text)
        break # we're done

The events option specify what events you want to see (available events in this release are “start”, “end”, “start-ns”, and “end-ns”, where the “ns” events are used to get detailed namespace information). If the option is omitted, only “end” events are returned.

Note: The tree builder and the event generator are not necessarily synchronized; the latter usually lags behind a bit. This means that when you get a “start”event for an element, the builder may already have filled that element with content. You cannot rely on this, though — a “start” event can only be used to inspect the attributes, not the element content. For more details, see this message.

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
    root.clear()

(future releases will make it easier to access the root element from within the loop)

Namespace Events #

The namespace events contain information about namespace scopes in the source document. This can be used to keep track of active namespace prefixes, which are otherwise discarded by the parser. Here’s how you can emulate the namespaces attribute in the old FancyTreeBuilder class:

events = ("end", "start-ns", "end-ns")
namespaces = []
for event, elem in iterparse(source, events=events):
    if event == "start-ns":
        namespaces.insert(0, elem)
    elif event == "end-ns":
        namespaces.pop(0)
    else:
        ...

The namespaces variable in this example will contain a stack of (prefix, uri) tuples.

(Note how iterparse lets you replace instance variables with local variables. The code is not only easier to write, it is also a lot more efficient.)

For better performance, you can append and remove items at the right end of the list instead, and loop backwards when looking for prefix mappings.

events = ("end", "start-ns", "end-ns")
namespaces = []
for event, elem in iterparse(source, events=events):
    if event == "start-ns":
        namespaces.append(elem)
    elif event == "end-ns":
        namespaces.pop(-1)
    else:
        ...
 

A Django site. rendered by a django application. hosted by webfaction.