September 10, 2002 | Fredrik Lundh
This is the second article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.
Intermission: Did Anyone Spot The Error Message? #
As some of you may have noticed, if you add the last code snippet from the previous article to the test program, a couple of strange-looking lines of text appears among the ok/failed messages:
online.effbot.org done www.bbc.co.uk done www.example.com failed error: uncaptured python exception, closing channel <async_http connected at 8eb07c> (exceptions.AttributeError:file_consumer instance has no attribute 'file' [C:\py21\lib\asyncore.py|poll|95] [C:\py21\lib\asyncore.py|handle_read_event|383] [http_client.py|handle_read|77] [my-test-program.py|feed|15]) www.scripting.com done
(Directory names and line numbers may vary.)
The error: uncaptured python exception message is generated by asyncore‘s default error handler when a callback raises a Python exception. This message is actually a compact rendition of a standard python traceback, printed on a single line. Here’s the deciphered version:
www.bbc.co.uk done www.example.com Traceback (most recent call last): File C:\py21\lib\asyncore.py, line 95, in poll: File C:\py21\lib\asyncore.py, line 383, in handle_read_event: File http_client.py, line 77, in handle_read: File my-test-program.py, line 15, in feed: AttributeError:file_consumer instance has no attribute 'file' online.effbot.org done www.scripting.com done
So what’s causing this error?
Note that the AttributeError occurs in the feed method, which is appears to be called despite the fact that the consumer did close the socket in the http_header method.
The http_client is supposed code to deal with this, by checking the connected flag attribute after calling the http_header consumer method. That flag was cleared by the close method in earlier versions of asyncore, but that was changed somewhere on the way from Python 1.5.2 to Python 2.1.
(And the reason I didn’t notice was sloppy testing: my test script contained enough debugging print statements to make me miss the error message. Sorry for that.)
Closing the Channel From the Consumer, Revisited
The obvious workaround is of course to explicitly clear the attribute in the consumer’s http_header method:
class file_consumer: def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" client.close() # bail out client.connected = 0 return self.host = client.host self.file = None ...
However, the connected flag is undocumented, and may (in theory) disappear in future versions of asyncore.
To make your code more future-proof, it’s better to use return value or an exception to indicate that the channel should be closed.
The following example uses a custom CloseConnection exception for this purpose:
class file_consumer: def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" raise http_client.CloseConnection self.host = client.host self.file = None
Here are the necessary additions to the http_client module:
class CloseConnection(Exception): pass ... try: self.consumer.http_header(self) except CloseConnection: self.close() return
Overriding Asyncore’s Error Handling
The error message is printed by a method called handle_error. To change the look of the error message, you can override this in your dispatcher subclass. For example, here’s a version that prints a traditional traceback:
import traceback class my_channel(asyncore.dispatcher_with_send): ... def handle_error(self): traceback.print_exc() self.close() ...
With the above lines added to the async_http class, you’ll get the following message instead:
www.bbc.co.uk done
www.example.com failed
Traceback (most recent call last):
File "C:\py21\lib\asyncore.py", line 95, in poll
obj.handle_read_event()
File "C:\py21\lib\asyncore.py", line 383, in handle_read_event
self.handle_read()
File "http_client.py", line 77, in handle_read
self.consumer.feed(data)
File "my-test-program.py", line 15, in feed
if self.file is None:
AttributeError: file_consumer instance has no attribute 'file'
online.effbot.org done
www.scripting.com done
Parsing RSS Files #
As shown in the first article, an RSS file contains summary information about a (portion of a) site, including a list of current news items.
For both the channel itself and the items, the RSS file can contain a title, a link to an HTML page, and a description field:
<rss version="0.91"> <channel> <title>the eff-bot online</title> <link>http://online.effbot.org</link> <description>Fredrik Lundh's clipbook.</description> <language>en-us>/language> ... <item> <title>spam, spam, spam</title> <link>http://online.effbot.org#85292735</link> <description>for the first seven months of 2002, the spam filters watching fredrik@pythonware.com has</description> </item> ... </channel> </rss>
Note that the item elements are stored as child elements to the channel element. Both the channel element and the individual item elements may contain additional subelements, including the language element present in this example. We’ll look at some additional elements in a later article; for now, we’re only interested in the three basic elements.
XML Parsers #
To parse an XML-based format like RSS, you need an XML parser. Python provides several ways to parse XML data, including the standard xmllib module which is a simple event-driven XML parser, the pyexpat parser and other components provided in the standard xml package, the PyXML extension library, and many others.
For the first version of the RSS parser, we’ll use the xmllib parser. You can plug in another parser if you need more features or better performance (and as you’ll see, chances are that you need more, or at least different features. More on this in a later article).
The xmllib parser works pretty much like the asyncore dispatcher; the module provides a parser base class that processes incoming data, and calls methods for different “XML events”. To handle the events, you should subclass the parser class, and implement methods for the events you need to deal with.
For the RSS parser, you need to implement the following methods:
start_TAG is called when the start tag (<TAG …>) for an element called TAG is found. The handler is called with a single argument, which is a dictionary containing the element attributes, if any.
end_TAG is called when the end tag (</TAG>) for an element called TAG is found.
handle_data is called for text between the elements (so-called character data). This handler is called with a single argument, a string containing the text. This method may be called more than once for any given character data segment.
For example, when parsing this XML fragment…
"<title>Odds & Ends</title>\n"
…the xmllib parser will call the following methods:
self.start_title({})
self.handle_data("Odds ")
self.handle_data("&")
self.handle_data(" Ends")
self.end_title()
self.handle_data("\n")Note that standard XML character entities like & are decoded by the parser, and are passed to the handle_data method as ordinary character data.
If start or end handlers are missing for elements that appear in the XML document, the corresponding start or end tags are silently ignored by the parser (but character data inside the element is still passed to handle_data).
Here’s a minimal test program that implements a character data handler, and start and end tag handlers for the three RSS elements we’re interested in:
import xmllib class rss_parser(xmllib.XMLParser): data = "" def start_title(self, attr): self.data = "" def end_title(self): print "TITLE", repr(self.data) def start_link(self, attr): self.data = "" def end_link(self): print "LINK", repr(self.data) def start_description(self, attr): self.data = "" def end_description(self): print "DESCRIPTION", repr(self.data) def handle_data(self, data): self.data = self.data + data import sys file = open(sys.argv[1]) parser = rss_parser() parser.feed(file.read()) parser.close()
Note that the start methods set the data member to an empty string, the handle_data method adds text to that string, and the end handlers print out the string.
Also note that you pass in the raw RSS data to the parser’s feed method, and call close method when you’re done.
Here’s some sample output from this script (using the BBC newsfeed we downloaded earlier):
$ python rss-test.py www.bbc.co.uk.rss TITLE 'BBC News | Front Page' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/default.stm' DESCRIPTION 'Updated every minute of every day' TITLE 'BBC News Online' LINK 'http://news.bbc.co.uk' TITLE 'Blair and Bush talk tough on Iraq\r\n' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243684.stm DESCRIPTION 'British PM Tony Blair says he has a "shared strategy" ... TITLE "Al-Qaeda 'plotted nuclear attacks'" LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2244146.stm DESCRIPTION 'Two alleged masterminds of the 11 September attacks ... TITLE "Rix: 'Scum' will profit from Tube" LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2244076.stm' DESCRIPTION 'Train drivers\' union leader Mick Rix says profits ... TITLE 'Ex-arms inspector defends Baghdad' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243627.stm DESCRIPTION 'Scott Ritter\xb8 once head of UN inspectors in Iraq\xb8 ... TITLE 'Police warning as flash floods hit city' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/scotland/2244003.stm' DESCRIPTION 'People are advised not to travel to Inverness after ...
The first title/link/description combination contains information about the site, the others contain information about individual items.
(Note that there are extra title and link values in first section. If you look in the source RSS file, you’ll notice that they come from an extra image element, which we can safely ignore for the moment.)
To get a usable RSS parser, all you have to do is to add some logic that checks where in the file we are, and adds element values to the right data structure.
In the following example, the element handlers update a common current dictionary attribute, which is set to point to either the channel information dictionary, or a dictionary for each item (stored in the items list). This version also does some very basic syntax checking.
import xmllib class ParseError(Exception): pass class rss_parser(xmllib.XMLParser): def __init__(self): xmllib.XMLParser.__init__(self) self.rss_version = None self.channel = None self.current = None self.data_tag = None self.data = None self.items = [] # stuff to deal with text elements def _start_data(self, tag): if self.current is None: raise ParseError("%s tag not in channel or item element" % tag) self.data_tag = tag self.data = "" def handle_data(self, data): if self.data is not None: self.data = self.data + data # cdata sections are handled as any other character data handle_cdata = handle_data def _end_data(self): if self.data_tag: self.current[self.data_tag] = self.data or "" # main rss structure def start_rss(self, attr): self.rss_version = attr.get("version") def start_channel(self, attr): if self.rss_version is None: raise ParseError("not a valid RSS 0.9x file") self.current = {} self.channel = self.current def start_item(self, attr): if self.rss_version is None: raise ParseError("not a valid RSS 0.9x file") self.current = {} self.items.append(self.current) # content elements def start_title(self, attr): self._start_data("title") end_title = _end_data def start_link(self, attr): self._start_data("link") end_link = _end_data def start_description(self, attr): self._start_data("description") end_description = _end_data
The _start_data and _end_data methods are used to switch on and off character data processing in handle_data.
Here’s a test script, which prints each item to standard output (via the end_item method).
import rss_parser, string, sys class my_rss_parser(rss_parser.rss_parser): def end_item(self): item = self.items[-1] print string.strip(item.get("title") or "") print item.get("link") print item.get("description") print for filename in sys.argv[1:]: file = open(filename) try: parser = my_rss_parser() parser.feed(file.read()) parser.close() except: print "=== cannot parse %s:" % filename print "===", sys.exc_type, sys.exc_value
Incremental parsing #
The above example reads the entire XML document from disk, and passes it to the parser in one go. The xmllib library also supports incremental parsing, allowing you to pass in XML fragments as you receive them. Just keep calling the feed method, and make sure to call close when you’re done. The parser framework will take care of the rest.
This feature is of course a perfect match for the http_client client class we developed in the first article; by plugging in a parser instance as the consumer, you can parse RSS items as they arrive over the network.
The following script provides an http_rss_parser class that adds the required http_header and http_failed methods to the parser, and uses an end_item handler to print incoming items:
import rss_parser, string class http_rss_parser(rss_parser.rss_parser): def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): raise http_client.CloseConnection self.host = client.host def http_failure(self, client): pass def end_item(self): item = self.items[-1] print " ", string.strip(item.get("title") or ""), print "[%s]" % self.host print " ", string.strip(item.get("link") or "") print print item.get("description") print
Here’s a driver script that reads a list of URLs from a text file named channels.txt, and fires up one asynchonous client for each channel.
import asyncore, http_client file = open("channels.txt") for url in file.readlines(): url = url.strip() if url: http_client.do_request(url, http_rss_parser()) asyncore.loop()
The output is a list of titles, links, and descriptions. Here’s an excerpt:
Blair defiant over Iraq [www.bbc.co.uk]
http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2247366.stm
Prime Minister Tony Blair confronts his trade union critics ...
arrgh! [online.effbot.org]
http://online.effbot.org#85432883
"Kom kom nu hit min vän, för glädjen blir större när man delar ...
Buffet killers [www.kottke.org]
http://www.kottke.org/02/09/020910buffet_kille.html
We're in Las Vegas and it's buffet time. It's always buffet ...
Note: When I write this, the www.scripting.com channel has just switched to something that appears to be an experimental version of Dave Winer’s RSS 2.0, which moves all RSS tags into a default namespace. The xmllib parser always takes the namespace into account, so it won’t find a single thing in that channel. Hopefully, this will be fixed in a not too distant future.
That’s all for today.
In the next article, we’ll look at what happens if you add dozens or hundreds of channels to the channels.txt file, and discuss how to deal with that. We’ll also build a simple RSS viewer using the Tkinter library.
In the meantime, if you’re running Unix, and are using a modern mail client that highlights URLs embedded in text mails, you can mail yourself the output from this program and let your mail reader do the rest:
$ python getchannels.py | mail -s effnews yourself
