ElementTree Tidy HTML Tree Builder
July 6, 2003 | Fredrik Lundh
The TidyHTMLTreeBuilder parser can read (almost) arbitrary HTML files, and turn them into well-formed element trees. This parser uses a library version of Dave Raggett’s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML).
Note: If you don’t want to (or cannot) install binary Python extensions, you can use the TidyTools module in the standard ElementTree distribution. That module uses the command-line version of Tidy, which is available for many different platforms.
This tree builder requires the _elementtidy extension, which is based on the tidylib library. Note that this extension is not included in the current elementtree releases, but you can download a separate elementtidy package from effbot.org downloads site.
Usage #
Loading HTML Files #
To load an HTML file into an XHTML tree, import the TidyHTMLTreeBuilder module and call the parse method:
from elementtidy import TidyHTMLTreeBuilder tree = TidyHTMLTreeBuilder.parse("myfile.htm")
Note: In the experimental alpha releases, the tree builder is installed in the elementtidy package. If you’re using a version shipped with the ElementTree library, import the module from the elementtree package instead.
Converting XHTML to HTML #
The ElementTree interfaces convert the HTML to the XML version of HTML, called XHTML. In this format, all HTML tags live in the {http://www.w3.org/1999/xhtml} namespace. The following code snippet shows how to ‘normalize’ the tree, turning it into standard HTML:
XHTML = "{http://www.w3.org/1999/xhtml}"
for elem in tree.getiterator():
if elem.tag.startswith(XHTML):
elem.tag = elem.tag[len(XHTML):]Saving HTML Files #
To save a plain HTML file, just write out the tree.
tree.write("outfile.htm")This works well, as long as the file doesn’t containg any embedded SCRIPT or STYLE tags.
If you want, you can add a DTD reference to the beginning of the file:
file = open("outfile.htm", "w")
file.write(DTD + "\n")
tree.write(file)
file.close()Saving XHTML Files #
If you save an XHTML file (where each tag lives in the XHTML namespace), the write method will add a namespace declaration to the html element, and place every tag in an explicit namespace. Some browsers can’t handle this, and may fail to render your document properly.
Comment:
is there a good way to get elementtidy to use cElementTree instead of ElementTree?
Posted by phil z (2007-06-29)
One way is to monkey patch the TidyHTMLTreeBuilder module. Import it, and then do "TidyHTMLTreeBuilder.ElementTree = cElementTree" before you start parsing. /F

Comment:
ElementTidy isn't currently available from CheeseShop (there's an entry, but no download links). It's easy, however, to create a Python Egg for easy deployment, just edit its setup.py and change: from distutils.core import setup, Extension to: from setuptools import setup, Extension Then simply run $ python setup.py bdist_egg and find your egg in the dist/ directory. Perhaps someone will care to upload some pre-built ones?
Posted by k3rni (2007-05-08)