We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Sockets: Usenet Support

This is a really old draft from 1997.

Pulling Documents and Images off Usenet

Another source for information and images is the part of Internet called Usenet, or News. Usenet is a distributed bulletin-board, where messages can be read from, and posted to special news servers. Messages posted to a given news server are propagated to other servers, but as with the Web, you have to connect to a server to be able to read the messages.

The protocol used to fetch messages (“articles”) from a news server is called Network News Transfer Protocol (NNTP). <RFC977>. Here’s a typical session, in which the client application connects, reads the standard headers for new messages in the newsgroup called comp.lang.python, downloads one of them, and then posts a message to the server (possibly in response to the other message):

Client: connects

Server: 200 news.spam.egg PyNNTP 1.0 ready (posting ok)

Client: GROUP comp.lang.python

Server: 211 367 13887 14268 comp.lang.python

Client: XOVER 14211-14268

Client: 204 data follows

Server: (sends overview information for articles 14211 to 14268)

Server: .

Client: ARTICLE 14220

Server: 220 14220 <5qj8v5$8dd@news.spam.egg > article

Server: (sends message)

Server: .

Client: POST

Server: 340 OK

Client: (sends message)

Client: .

Server: 240 Article posted

Client: QUIT

Client: disconnects

(now, this is a real chat protocol!):

Note that each command from the client starts with a command keyword, and each reply from the server starts with a status code. Messages and listings are terminated with a line containing a single dot.

The server assigns a serial number to each message (in this case, the comp.lang.python newsgroup currently contains 367 messages, having numbers between 13887 to 14268), and it’s usually up to the client to keep track of which messages it has already seen.

News Message Format

We’ll implement an NNTP client class in a moment, but before we do that, let’s see what the news messages look like. Here’s a simple example:

Path: news.myisp.se!newsfeed.internetmci.com!news.spam.egg
From: erbath@spam.egg
Newsgroups: comp.lang.python
Subject: Re: Where's the bacon?
Date: 17 Jul 1999 09:25:53 -0400
Lines: 12
Sender: erbath@spam.egg
Message-ID: <lqsoxd95em.ach@news.spam.egg>
References: <199907152100.RAA14304@foobar.spam.egg>
Xref: news.spam.egg comp.lang.python:14304

Fredrik wrote:
> Haven't got a clue. Maybe someone else knows more.

You could check the list of contributed software at
www.python.org.

...

As in HTTP, the message starts with a list of headers, followed by an empty line, and the message body itself. Python’s standard library contains a module designed to represent messages like this. This module is named rfc822, after the Internet specification with the same name (the full name of which is Standard for the Format of ARPA Internet Text Messages, by the way).

RFC822 only specifies the general layout of the message; another specification, RFC1036, defines what headers to use in a news message.

<FIXME: header field summary: From, Date, Newsgroups, Subject, Message-ID, and Path>

The Message class defined in the rfc822 module takes a file handle, extracts the header fields, and leaves the file pointer positioned on the first line in the message, after the empty line. Basically, an instance of the Message class behaves like a dictionary of header fields, but also provides a set of utility functions and members.

The following code snippet reads a message from a file, and dumps the header dictionary to the screen:

import rfc822

fp = open("sample.news")

msg = rfc822.Message(fp)

for k, v in msg.items():
    print k, "=", v

If applied to the above example, this script prints something like:

path = news.myisp.se!newsfeed.internetmci.com!news.spam.egg
newsgroups = comp.lang.python
from = erbath@spam.egg
sender = erbath@spam.egg
xref = news.spam.egg comp.lang.python:14304
date = 17 Jul 1999 09:25:53 -0400
references = <199907152100.RAA14304@foobar.egg>
lines = 12
message-id = <lqsoxd95em.ach@news.spam.egg>
subject = Re: Where's the bacon?

Sending Binary Data via News

The RFC822 specification (published in 1982) explicitly specifies that only 7-bit US ASCII characters can be used in news messages (it also applies to mail, something we will discuss later in this chapter). Nevertheless, binary files can be posted anyway, by first encoding them using one of the following methods:

  1. Use the Unix uuencode utility to encode the data.
  2. Use the Multipurpose Internet Mail Extension (MIME) encoding standard. Especially the base64 encoding scheme is becoming popular as a slightly more convenient alternative to uuencode.
  3. [FIXME: Use the yEnc format]

In both uuencode and base64, each group of 3 data bytes is converted to 4 ASCII characters, storing 6 bits of original data in each character. While uuencode stores each 6-bit value as chr(value+32), the base64 encoding uses a character table designed to minimize the risk for errors if the message is to be converted to other character sets. Python’s standard library supports both formats, via the uu and base64 modules, and a low-level support module called binascii.

The uuencode format is line-oriented, and the encoded data starts with a begin line, which also contains the Unix file mode (in octal), and the original filename. Then follows the encoded lines (the first character gives the number of bytes encoded on the rest of the line, and is usually an “M” for a full line of 45 binary bytes), and the encoded block ends with a line containing the word end. Here’s an example:

begin 600 can.jpg
M_]C_X `02D9)1@`!``$`4P!3``#__@`752U,96%D(%-Y<W1E;7,L($EN8RX`
M_]L`A `#`@("`@(#`@("`P,#`P0(!00$! 0)!P<%" L*# P+"@L+# X2#PP-
M$0T+"Q 5$!$3$Q04% P/%A@6%!@2%!03`0,#`P0$! D%!0D3#0L-$Q,3$Q,3
... typically a few hundred similar lines ...
M?E3;Y52UNG1$5E2,`A1QT_7W]SZFL8?"O4N"3C)LBTHEW ?YL<#=SCGMZ=!^
M50M-*NH_*Y3##&WC'TQT_P#U53BN9JQ7*K19J:ZB0PV3Q*(RZ$ML&,G*GM]?
=Y#L*S)I9$E9%D8!20,GIS6>'2:5T;Q24I6\@_]FB
`
end

The MIME format is a bit different; it uses special message headers to indicate what the message contains, and how it is encoded. If the message header contains a field named MIME-Version, the document is encoded using the MIME specification. We’ll get back to MIME and base64-encoding later in this chapter, when we look closer on how to send and receive images and other documents via electronic mail.

Decoding uuencoded messages

To figure out if a message contains uuencoded data, we need to scan the message body for a line starting with begin, followed by a number and a filename. We can then use the binascii module to convert each line to a chunk of binary data, and write it to a file, or, as in the following example, store it in a list. The getuubody function shown below also returns the filename. If the message is not encoded, this function sets the filename to None, and returns the message body as is.

Example: extract uuencoded data (from messageutils.py)
import regex, string

begin = regex.compile("begin [0-9]+ \(.*\)")

def getuubody(msg):
    "Given a uuencoded message, extract and decode the message body"

    msg.rewindbody()

    while 1:

        s = msg.fp.readline()
        if not s:
            break

        if begin.match(s) > 0:

            # decode uuencoded message body

            body = []
            file = begin.group(1)

            for s in msg.fp.readlines():
                if s[:3] == "end":
                    break
                try:
                    body.append(binascii.a2b_uu(s))
                except:
                    # workaround for broken encoders
                    bytes = (((ord(s[0])-32) & 63) * 4 + 3) / 3
                    body.append(binascii.a2b_uu(s[:bytes]))

            return file, string.join(body, "")

    msg.rewindbody()

    return None, msg.fp.read()

Note that some encoders sometimes adds extra padding characters to lines containing less than 45 bytes of binary data. In earlier versions of Python, the binascii module raises an exception if it stumbles upon such a line; the above try/except clause works around this problem by explicitly truncating the line to the appropriate length.

[FIXME: explain why uu.py cannot be used: it assumes that the file is already positioned on the begin line, and it doesn’t handle offending encoders well either (this will probably be fixed in binascii in 1.5 final)]

An NNTP Client Library

Creating a client library for the NNTP protocol is a straight-forward task. Again, the SimpleClient takes care of the socket configuration issues, and provides getline and putline primitives.

The code shown here includes a minimal set of commands only; list to get a list of newsgroups available on the server, group to select which group to read, overview to get an overview of all or some messages in a group, and retrieve to read a given message. The overview method uses an NNTP command called XOVER, which is an extension to the original NNTP protocol. Virtually every modern news server supports this command, though, and some news clients won’t work without it. The retrieve method uses either HEAD, BODY, or ARTICLE, to read parts or all of a message. The default is ARTICLE, which reads both headers and body in a single call.

Example: File: NNTPClient.py

from string import *
import SimpleClient

ARTICLE, HEAD, BODY = tuple(range(3))

class NNTPClient(SimpleClient.SimpleClient):

    def __init__(self, host, port = 119):

        # connect
        SimpleClient.SimpleClient.__init__(self, host, port)

        s, self.welcome = self.getstatus()
        if s not in [200, 201, 205]:
            raise IOError, (s, "NNTP connection error", self.welcome)

        self.may_post = (s == 200)
        self.must_login = (s == 205)

    def close(self):
        "Quit."
        try:
            stat = self.command(None, "QUIT")
        except IOError:
            pass
        # self.destroy()

    def command(self, ok, *args):
        self.putline(join(args))
        s, m = self.getstatus()
        if ok and s not in ok:
            raise IOError, (s, args[0]+" command failed", m)
        return m

    def getstatus(self):
        info = self.getline()
        return atoi(info[:3]), info

    def getmessage(self, newline = ""):
        text = []
        while 1:
            s = self.getline()
            if s[:1] == ".":
                s = s[1:]
                if not s:
                    break
            text.append(s + newline)

        return text

    def _range(self, lo, hi):
        if hi is None:
            return str(lo)
        return "%s-%s" % (lo, hi)

    #
    # NNTP commands (subset)

    def group(self, group):
        "Select group.  Returns number of messages, range, and group name."
        m = split(self.command([211], "group", group))
        self.groupinfo = group, (atoi(m[2]), atoi(m[3]))
        return (atoi(m[1]),                     # number of messages (est.)
                atoi(m[2]), atoi(m[3]),         # message number range
                m[4])                           # group name

    def list(self):
        "List groups.  Returns list of (group, lo, hi, may_post) tuples"
        self.command([215], "LIST")
        data = []
        for s in self.getmessage():
            s = split(s)
            data.append((s[0],                  # group name
                         atoi(s[1]), atoi(s[2]),# message number range
                         s[3] in "yY"))         # may post
        return data

    def overview(self, lo, hi = None):
        "Get message overview (extension)."
        self.command([224], "XOVER", self._range(lo, hi))
        data = []
        for s in self.getmessage():
            s = split(s, "\t")
            data.append((atoi(s[0]),            # message number
                         s[1],                  # subject
                         s[2],                  # from
                         s[3],                  # date
                         s[4],                  # message id
                         tuple(split(s[5])),    # references
                         atoi(s[6]),            # byte count
                         atoi(s[7])))           # line count
        return data

    def retrieve(self, msgid, mode = ARTICLE):
        "Get article (mode argument controls which part)"
        if mode == HEAD:
            self.command([221], "HEAD", str(msgid))
        elif mode == BODY:
            self.command([222], "BODY", str(msgid))
        else:
            self.command([220], "ARTICLE", str(msgid))
        return self.getmessage("\n")

Messages are returned as a list of strings, where each string ends with a newline. In this way, messages obtained via retrieve looks like messages read from a file using readlines.

An NNTP Robot

The following example uses the NNTPClient module to download messages from a news server. It fetches overview information from the server (including the From and Subject header fields, and size information), passes that information to a user-defined filter function, and downloads messages as indicated by the filter. The messages are stored in files named group-serial.mail. [FIXME: redesign NNTPClient so it returns Article instances, and move the processing into that class.

Example: File: newsrobot.py

#
# user configuration

HOST  = "news.spam.egg"
GROUP = "alt.binaries.pictures.bacon"

def messagefilter(info):

    serial, subject, _from, date, msgid, ref, bytes, lines = info

    # assume everything larger than 10k is an image, but don't
    # download things larger than 60k

    return 10000 <= bytes <= 60000

#
# main program

import NNTPClient
import string

nntp = NNTPClient.NNTPClient(HOST)

count, lo, hi, name = nntp.group(GROUP)

# get last message number, if saved

try:
    fp = open(GROUP + ".last")
    lo = max(lo, string.atoi(fp.readline())+1)
    fp.close()
except (IOError, ValueError):
    pass # scan whole group

# loop over new messages

for info in nntp.overview(lo, hi):

    serial = info[0]

    if messagefilter(info):

        print "fetching", info[2], "(%d bytes)" % info[6]

        message = nntp.retrieve(serial)

        fp = open("%s-%d.news" % (GROUP, serial), "w")
        fp.writelines(message)
        fp.close()

nntp.close()

# store last message number

try:
    fp = open(GROUP + ".last", "w")
    fp.write(str(serial) + "\n")
    fp.close()
except IOError:
    pass

Note that the we store the last message number seen in a file named group.last, to avoid downloading the same messages over and over again. To start all over again, for example if you change the filter, simply remove that file.

[FIXME: instead of storing the raw message to disk, this code should call the getuubody method and store the message body in the “incoming” directory]