We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

EffNews Addendas, Frequently Asked Questions, and Other Assorted Notes

FAQ: I don’t really care about Python. Can I get a prebuilt version somewhere?

January 6, 2003 | Fredrik Lundh

A version of effnews is shipped with the effbot.exe platform. It runs on Windows NT/2000/XP, and probably also on Windows 98.

FAQ: Where’s the Code Archive?

October 7, 2002 | Fredrik Lundh

You can get a snapshot of the effnews #4 code base from the effbot.org downloads page.

For later additions, feel free to copy and paste from the articles (to select an entire script, triple-clicking on the first line of the script works fine in Internet Explorer).

FAQ: Where’s the Next Article?

October 10, 2002 | Fredrik Lundh

Working on it.

Note: Adding Entity Support

October 7, 2002 | Fredrik Lundh

Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.

The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.

class rss_parser(xmllib.XMLParser):

    ...

    htmlentitydefs = None

    def unknown_entityref(self, entity):
        if not self.htmlentitydefs:
            # lazy loading of entitydefs table
            import htmlentitydefs
            # make sure we don't overwrite entities already present in
            # the entitydefs dictionary (doing so will confuse xmllib)
            entitydefs = htmlentitydefs.entitydefs.copy()
            entitydefs.update(self.entitydefs)
            self.entitydefs = self.htmlentitydefs = entitydefs
        self.handle_data(self.entitydefs.get(entity, ""))

    ...

Note: Feed Statistics

October 5, 2002 | Fredrik Lundh

Using a list of feeds from Syndic8.com, I’ve tried the current version of the RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:

2010 feeds checked

137 feeds (6.8%) successfully read:

    rss 0.9: 17 feeds
    rss 0.91: 84 feeds
    rss 0.91fn: 2 feeds
    rss 0.92: 20 feeds
    rss 1.0: 10 feeds
    rss 2.0: 4 feeds

As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:

200 'text/plain; charset=utf-8': 1 feed
301 'text/html; charset=iso-8859-1': 1 feed
200 'text/html;charset=iso-8859-1': 1 feed
200 'text/xml; charset=utf-8': 1 feed
403 'text/html; charset=iso-8859-1': 1 feed
200 'text/XML': 1 feed
302 'text/html; charset=ISO-8859-1': 1 feed
200 'application/x-cdf': 1 feed
200 'application/unknown': 1 feed
200 'httpd/unix-directory': 2 feeds
200 'text/rdf': 2 feeds
200 'application/rss+xml': 2 feeds
200 'text/xml; charset=ISO-8859-1': 2 feeds
404 'text/html; charset=iso-8859-1': 3 feeds
200 'application/sgml': 3 feeds
302 'text/html; charset=iso-8859-1': 4 feeds
200 'text/html; charset=iso-8859-1': 4 feeds
200 'application/x-netcdf': 5 feeds
200 'text/plain; charset=ISO-8859-1': 7 feeds
200 'text/plain; charset=iso-8859-1': 8 feeds
200 'application/octet-stream': 10 feeds
200 'application/xml': 18 feeds
200 'text/html': 42 feeds
200 'text/xml': 191 feeds
200 'text/plain': 1660 feeds

Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.

If we remove the check for content type from the http_rss_parser class, we get the following result:

1746 feeds (86.9%) successfully read:

    rss unknown: 1 feed
    rss 0.9: 55 feeds
    rss 0.91: 1623 feeds
    rss 0.91fn: 2 feeds
    rss 0.92: 22 feeds
    rss 1.0: 39 feeds
    rss 2.0: 4 feeds

There’s still 264 feeds that cannot be read by the current parser. To figure out what (if anything) is wrong with the parser, we need to be able to extract more status information from the parser.