The HTMLParser Module

(New in 2.2) An improved HTML parser. Can be used to replace sgmllib, in many cases.

Like the other parsers in the standard library, this parser implements the standard feed/close consumer protocol, and calls methods on itself to handle the various parts of the HTML document. To use the parser, create a subclass where you override the methods you’re interested in.

This example extracts anchor links from an HTML document:

import HTMLParser

class AnchorParser(HTMLParser.HTMLParser):

    def __init__(self):
        self.anchors = []
        self.reset()

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for k, v in attrs:
                if k == "href":
                    self.anchors.append(v)
                    break

f = open("sample.html")

p = AnchorParser()
p.feed(f.read())
p.close()

print p.anchors

Here’s an alternate driver that lets you iterate over the anchors, as they are found by the parser:

class AnchorParser:
    ...

def getanchors(file):
    p = AnchorParser()
    while 1:
        # get some data from the source
        s = file.read(16384)
        if s:
            p.feed(s)
        else:
            p.close()
        # return anchors to caller
        for anchor in p.anchors:
            yield anchor
        if not s:
            break
        p.anchors[:] = [] # reset the list

# read from a file
for anchor in getanchors(open("index.html")):
    print anchor

# read from a remote site
from urllib import urlopen
for anchor in getanchors(urlopen("http://www.python.org")):
    print anchor
 

A Django site. rendered by a django application. hosted by webfaction.