We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

In Progress: EffNews Part 5: Odds and Ends

July 2003 | Fredrik Lundh

This is the fifth article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.

This article is being edited.

Improving the RSS Support

Supporting Non-XML Character Entities

Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.

The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.

class rss_parser(xmllib.XMLParser):

    ...

    htmlentitydefs = None

    def unknown_entityref(self, entity):
        if not self.htmlentitydefs:
            # lazy loading of entitydefs table
            import htmlentitydefs
            # make sure we don't overwrite entities already present in
            # the entitydefs dictionary (doing so will confuse xmllib)
            entitydefs = htmlentitydefs.entitydefs.copy()
            entitydefs.update(self.entitydefs)
            self.entitydefs = self.htmlentitydefs = entitydefs
        self.handle_data(self.entitydefs.get(entity, ""))

    ...

Handling Non-ASCII Character Sets

Handling Windows CP1252 Gremlins

Improving the HTTP Support

Dealing With Different Content Types

Using a list of feeds from Syndic8.com, I’ve tried the current RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:

2010 feeds checked

137 feeds (6.8%) successfully read:

    rss 0.9: 17 feeds
    rss 0.91: 84 feeds
    rss 0.91fn: 2 feeds
    rss 0.92: 20 feeds
    rss 1.0: 10 feeds
    rss 2.0: 4 feeds

As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:

200 'text/plain; charset=utf-8': 1 feed
301 'text/html; charset=iso-8859-1': 1 feed
200 'text/html;charset=iso-8859-1': 1 feed
200 'text/xml; charset=utf-8': 1 feed
403 'text/html; charset=iso-8859-1': 1 feed
200 'text/XML': 1 feed
302 'text/html; charset=ISO-8859-1': 1 feed
200 'application/x-cdf': 1 feed
200 'application/unknown': 1 feed
200 'httpd/unix-directory': 2 feeds
200 'text/rdf': 2 feeds
200 'application/rss+xml': 2 feeds
200 'text/xml; charset=ISO-8859-1': 2 feeds
404 'text/html; charset=iso-8859-1': 3 feeds
200 'application/sgml': 3 feeds
302 'text/html; charset=iso-8859-1': 4 feeds
200 'text/html; charset=iso-8859-1': 4 feeds
200 'application/x-netcdf': 5 feeds
200 'text/plain; charset=ISO-8859-1': 7 feeds
200 'text/plain; charset=iso-8859-1': 8 feeds
200 'application/octet-stream': 10 feeds
200 'application/xml': 18 feeds
200 'text/html': 42 feeds
200 'text/xml': 191 feeds
200 'text/plain': 1660 feeds

Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.

If we remove the check for content type from the http_rss_parser class, we get the following result:

class http_rss_parser(rss_parser.rss_parser):
    ...
    def http_header(self, client):
        if client.status[1] != "200":
            raise http_client.CloseConnection
1746 feeds (86.9%) successfully read:

    rss unknown: 1 feed
    rss 0.9: 55 feeds
    rss 0.91: 1623 feeds
    rss 0.91fn: 2 feeds
    rss 0.92: 22 feeds
    rss 1.0: 39 feeds
    rss 2.0: 4 feeds

Handling Redirection

class http_rss_parser(rss_parser.rss_parser):
    ...
    def http_header(self, client):
        if client.status[1].startwith("3"):
            ... redirect ...
            location = client.header["location"]

Handling Other Status Codes

class http_rss_parser(rss_parser.rss_parser):
    ...
    def http_header(self, client):
        status = client.status[1]
        status_category = status[:1]
        if status_category == "3":
            ... redirect ...
            location = client.header["location"]
        elif status_category == "2":
            ... accept ...
        else:
            ...

Using Conditional Fetch

Fetching Compressed Data