Extracting plain text from HTML

Fredrik Lundh | August 2003 | Originally posted to online.effbot.org

As some readers may have noticed, my RSS feed no longer includes full articles; instead, each item contains the first 50-100 words from the corresponding article, as plain unstyled text. I may switch back again when I stop posting “standard python library” articles…

If you want to use something similar in your feeds, here’s the code that does the work. Tweak as necessary:

def textify(html_snippet, maxwords=50):

    import formatter, htmllib, StringIO, string

    class Parser(htmllib.HTMLParser):
        def anchor_end(self):
            self.anchor = None

    class Formatter(formatter.AbstractFormatter):

    class Writer(formatter.DumbWriter):
        def send_label_data(self, data):
            self.send_flowing_data(" ")

    o = StringIO.StringIO()
    p = Parser(Formatter(Writer(o)))

    words = o.getvalue().split()

    if len(words) <= 2*maxwords:
        return string.join(words)

    return string.join(words[:maxwords]) + " ..."

The HTMLParser subclass disables anchor footnotes; the DumbWriter subclass makes sure that HTML list items have proper labels (or in other words, the subclass works around a bug in the standard library).


