EffNews Addendas, Frequently Asked Questions, and Other Assorted Notes
FAQ: I don’t really care about Python. Can I get a prebuilt version somewhere?
January 6, 2003 | Fredrik Lundh
A version of effnews is shipped with the effbot.exe platform. It runs on Windows NT/2000/XP, and probably also on Windows 98.
FAQ: Where’s the Code Archive?
October 7, 2002 | Fredrik Lundh
You can get a snapshot of the effnews #4 code base from the effbot.org downloads page.
For later additions, feel free to copy and paste from the articles (to select an entire script, triple-clicking on the first line of the script works fine in Internet Explorer).
FAQ: Where’s the Next Article?
October 10, 2002 | Fredrik Lundh
Note: Adding Entity Support
October 7, 2002 | Fredrik Lundh
Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.
The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.
class rss_parser(xmllib.XMLParser): ... htmlentitydefs = None def unknown_entityref(self, entity): if not self.htmlentitydefs: # lazy loading of entitydefs table import htmlentitydefs # make sure we don't overwrite entities already present in # the entitydefs dictionary (doing so will confuse xmllib) entitydefs = htmlentitydefs.entitydefs.copy() entitydefs.update(self.entitydefs) self.entitydefs = self.htmlentitydefs = entitydefs self.handle_data(self.entitydefs.get(entity, "")) ...
Note: Feed Statistics #
October 5, 2002 | Fredrik Lundh
Using a list of feeds from Syndic8.com, I’ve tried the current version of the RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:
2010 feeds checked
137 feeds (6.8%) successfully read:
rss 0.9: 17 feeds
rss 0.91: 84 feeds
rss 0.91fn: 2 feeds
rss 0.92: 20 feeds
rss 1.0: 10 feeds
rss 2.0: 4 feeds
As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:
200 'text/plain; charset=utf-8': 1 feed 301 'text/html; charset=iso-8859-1': 1 feed 200 'text/html;charset=iso-8859-1': 1 feed 200 'text/xml; charset=utf-8': 1 feed 403 'text/html; charset=iso-8859-1': 1 feed 200 'text/XML': 1 feed 302 'text/html; charset=ISO-8859-1': 1 feed 200 'application/x-cdf': 1 feed 200 'application/unknown': 1 feed 200 'httpd/unix-directory': 2 feeds 200 'text/rdf': 2 feeds 200 'application/rss+xml': 2 feeds 200 'text/xml; charset=ISO-8859-1': 2 feeds 404 'text/html; charset=iso-8859-1': 3 feeds 200 'application/sgml': 3 feeds 302 'text/html; charset=iso-8859-1': 4 feeds 200 'text/html; charset=iso-8859-1': 4 feeds 200 'application/x-netcdf': 5 feeds 200 'text/plain; charset=ISO-8859-1': 7 feeds 200 'text/plain; charset=iso-8859-1': 8 feeds 200 'application/octet-stream': 10 feeds 200 'application/xml': 18 feeds 200 'text/html': 42 feeds 200 'text/xml': 191 feeds 200 'text/plain': 1660 feeds
Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.
If we remove the check for content type from the http_rss_parser class, we get the following result:
1746 feeds (86.9%) successfully read:
rss unknown: 1 feed
rss 0.9: 55 feeds
rss 0.91: 1623 feeds
rss 0.91fn: 2 feeds
rss 0.92: 22 feeds
rss 1.0: 39 feeds
rss 2.0: 4 feeds
There’s still 264 feeds that cannot be read by the current parser. To figure out what (if anything) is wrong with the parser, we need to be able to extract more status information from the parser.
