Element Library Functions

March 22, 2004 | Fredrik Lundh

ElementTree 1.3 may include a new module, ElementLib, with a number of convenient helper functions.

The exact contents are yet to be determined; here are some of the current proposals (from various sources, in no specific order):

Helpers to add subelements, with a nicer syntax.

Wrappers to access elements via attributes (templating).

copy #

copy/deepcopy: Copy element structures. [the copy module is supposed to work /F]

flatten #

flatten: Recursively extract text content.

def flatten(elem, include_tail=0):
    text = elem.text or ""
    for e in elem:
        text += flatten(e, 1)
    if include_tail and elem.tail: text += elem.tail
    return text

To get rid of all subelements to a given element, and keep just the text, you can do:

elem.text = flatten(elem); del elem[:]

append #

append: Like elem.append, but accepts either an element or a string (which is added to the tail).

def append(elem, item):
    if isinstance(item, basestring):
        if len(elem):
            elem[-1].tail = (elem[-1].tail or "") + item
        else:
            elem.text = (elem.text or "") + item
    else:
        elem.append(item)

walk #

walk: A generator that walks a tree in depth-first order. I think this is the same as “getiterator” but the docs are confusing. [the docs say “document order”, which is the order elements are stored in an XML document. same as depth-first, in other words /F]

reverse_walk: Like walk but in the reverse order.

walkaround: Walks around the outside of a tree. Each non-terminal node is visited twice. Each node should have a attribute whose values can be NONE, DONE, FIRST, SECOND, and LEAF.

kill #

kill/hoist: Removes a node from a tree. It is replaced by its children.

prettyprint #

prettyprint: Prints a tree with each node indented according to its depth. This is done by first indenting the tree (see below), and then serializing it as usual.

indent: Adds whitespace to the tree, so that saving it as usual results in a prettyprinted tree.

# in-place prettyprint formatter

def indent(elem, level=0):
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

tostringlist, fromstringlist #

tostringlist and fromstringlist to serialize to and from lists of string fragments. This can improve performance a lot when you’re not really interested in the entire string:

out.write(tostring(elem))

out.writelines(tostringlist(elem))

class XMLGenerator:
    def __init__(self, elem):
        self.iter = iter(tostringlist(elem))
    def more(self):
        try:
            return self.iter.next()
        except StopIteration:
            return None

Namespace helpers.

 
class NS:
    def __init__(self, uri):
        self.uri = uri
    def __getattr__(self, tag):
        return self.uri + tag
    def __call__(self, path):
        return "/".join(getattr(self, tag) for tag in path.split("/"))

XHTML = NS("{http://www.w3.org/1999/xhtml}")

for elem in tree.findall(XHTML("ul/li")):
    ...

See also

Bits and Pieces

Notes #

Suggestions (included above) from Ed Jones, and additional comments by François Pinard.
 

Comment:

The indent function doesn't behave exactly the way I desired (or expected); it's unclear whether the behavior is a bug, or by design using excessively clever code.

I replaced this:

        for elem in elem:       # clever, or a bug?
            elementtree_indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i

with this:

        for child in elem:
            elementtree_indent(child, level+1)
        if not child.tail or not child.tail.strip():
            child.tail = i
        if not elem.tail or not elem.tail.strip():
            elem.tail = i

and now sibling elements that contain with children have whitespace between them. My minimal test case is '<a><b><c /></b><b /></a>'

 <a>
   <b>
     <c />
   </b><b />
 </a>

vs

 <a>
   <b>
     <c />
   </b>
   <b />
 </a>

Posted by Paul Du Bois (2007-04-12)

Comment:

Also see: http://infix.se/2007/02/06/gentlemen-indent-your-xml

Posted by Fredrik (2007-05-31)

A Django site. this page was rendered by a django application in 0.05s 2010-09-02 14:36:51.226567. hosted by webfaction.