We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Using Non-Standard Encodings in cElementTree

Updated December 15 | December 1, 2005 | Fredrik Lundh

Update 2005-12-04: Changed to use codecs.open instead of plain open, to avoid problems with variable-width encodings. Thanks to “mark_m”.

Update 2005-12-15: This has been fixed in cElementTree 1.0.5, which supports all 8-bit encodings provided by Python’s Unicode implementation.

Older versions of cElementTree (1.0.4 and earlier) only supports the encodings provided by the expat library itself:

  • UTF-8
  • UTF-16
  • US-ASCII
  • ISO-8859-1

Support for more encodings will be added to a future release.

To work around this in the current version, you can use the XMLParser class directly, and “recode” the data stream in Python:

import cElementTree as ET
import codecs

def myparser(file, encoding):
    f = codecs.open(file. "r", encoding)
    p = ET.XMLParser(encoding="utf-8")
    while 1:
        s = f.read(65536)
        if not s:
            break
        p.feed(s.encode("utf-8"))
    return ET.ElementTree(p.close())

tree = myparser("example.xml", "windows-1252")

To determine the encoding used in the file, you can use something like Paul Prescod’s Auto-detect XML encoding recipe.