Sandbox: SourceForge Tools (In Progress)

Fredrik Lundh | April 2006

Note: The sourceforge page layout was changed slightly after the first version of these tools were released. We’re working on a new version, but if you want to use the tools to experiment with older tracker snapshots, you need version 200604. See below.

The sourceforge sandbox contains a set of simple tools to download and process sourceforge tracker items.

You need either Python 2.4 and the ElementTree library (cElementTree is recommended), or Python 2.5 (which ships with cElementTree).

To run the download tools, you also need the tidy utility.

Current Version (200608, Work in Progress)

 

To download the current version of the tools, use Subversion:

$ svn co http://svn.effbot.python-hosting.com/stuff/sandbox/sourceforge

Previous version (200604)

This version is compatible with the sourceforge tracker layout used in April 2006.

$ svn co http://svn.effbot.python-hosting.com/tags/sourceforge-200604/

A snapshot of the Python tracker data from April 2006 can be downloaded here:

Tracker Datasets #

 

Tracker data is represented as a set of files in a tracker directory. For each tracker item, there are at least two files:

    tracker-TTT/item-NNN.xml (index information, created by getindex.py)
    tracker-TTT/item-NNN-page.xml (xhtml pages, created by getpages.py)

where TTT is the tracker identifier, and NNN is the item identifier.

 

For items that have attached files, there’s also one or more

    tracker-TTT/item-NNN-data-MMM.dat (data files, created by getfiles.py)

files, where MMM is a file identifier (referred to by the page files). The data files consists of a copy of the HTTP header (which includes content-type and content-disposition headers), followed by an empty line, and the actual data.

Note that the datasets contain complete HTML pages. This lets you fix bugs in the extraction tools without having to reload everything again (or download large existing datasets).

Processing Tracker Datasets  #

To process tracker datasets, use the extract module to extract relevant information from item-NNN-page.xml files. See the export scripts for examples:

csv-export.py

A simple dataset to CSV exporter.

xml-export.py

A simple dataset to XML exporter. The resulting XML file contains all data from the tracker dataset, including attached files (stored as BASE64-encoded blocks).

 

More export scripts, bug fixes, and other contributions are welcome.

Downloading and Updating Tracker Datasets

To download tracker datasets, run 'init' to set things up, and use the
getindex/getpages/getfiles scripts to download items.

* init

The 'init' script is used to select what tracker to download.  It asks
for a tracker "group id".  To get the group id for your project, check
the URL for the tracker homepage.  If you press return, the group id
defaults to 5470, which is the group id for the Python tracker.

The 'init' script downloads the tracker homepage, and creates tracker
directories for the individual trackers used by the given project.

    $ python init.py

    enter sourceforge tracker group id [5470]: 1234

    --- create tracker-123456

You only have to run the 'init' script once for each project.

* getindex

The 'getindex' script parses the tracker index, and creates item
files which contains overview information from the index pages.
Usage:

    $ python getindex.py tracker-123456 [offset]

If the offset is omitted, the parser starts at offset 0, and keeps
going until it gets an index page for which all items have already
been downloaded.  If an offset is given, the parser keeps going until
it cannot find any more items.

You can use the output from 'getindex' to generate tracker statistics.
To get more information about the items, use the 'getpages' and 'get-
files' scripts.

* getpages

The 'getpages' script looks for item files, and downloads missing page
files.

    $ python getpages.py tracker-123456

To refresh the page files, remove them from the tracker directory, and
run the 'getpages' script again.

    $ rm tracker-123456/*-page.xml
    $ python getpages.py tracker-123456

* getfiles

The 'getfiles' script, finally, looks for download links in the
page files, and downloads missing data files.

    $ python getfiles.py tracker-123456

* status

The 'status' script can be used to get a download status summary:

    $ python status.py
    tracker-123456
        6682 items
        6682 pages (100%)
        1912 files

Comment:

These tools proved very useful to me. I did have to make some changes to get them to work with SourceForge as of today. Here are the diffs, this isn't the best way to pass along diffs so I'll break them into separate comments, one for each file along with the rationale for the change.

Posted by Robert (2007-02-14)

Comment:

The current comments are formatted as a set of headers followed by a blank line and then the text of the comment. This change to getcomment() sets description to just the body portion.

Changed getchangenote() to strip the whitespace from the fields.

The current version of SourceForge embeds a form to add a comment within the description table data tag. I added a removeForm function to remove that from the description.

The description isn't the only table data item that has colspan=2. The first row has a monitor button that also has a colspan=2. I added a test to make sure we weren't using the first row of the table for the description.

I cleaned up the whitespace handling for the description.

The current portion of the table with Changes and Followups has a p tag between the td tag and the table tag so I changed the pattern to allow an optional p tag.

--- extract.py	(revision 453)
+++ extract.py	(working copy)
@@ -15,11 +15,13 @@
 
 def getcomment(elem):
     text = gettext(elem).strip()
-    data = dict(description=text)
+    data = dict()
     sender = user_id = None
+    desc = ""
+    in_body = 0
     for line in text.split("\n"):
         if not line:
-            break
+            in_body = 1
         if line.startswith("Date:") and not data.has_key('date'):
             data["date"] = line[5:].strip() # Only get the first
                                             # date. There's at least
@@ -31,14 +33,17 @@
             data["sender"] = line[7:].strip()
         elif line.startswith("user_id="):
             data["sender_user_id"] = line[8:].strip()
+        elif in_body:
+            desc += line + "\n"
+    data["description"] = desc.strip()
     return data
 
 def getchangenote(elem):
     c = elem.getchildren()
-    return {'field':gettext(c[0]),
-           'oldvalue':gettext(c[1]),
-           'date':gettext(c[2]),
-           'change_by':gettext(c[3])}
+    return {'field':gettext(c[0]).strip(),
+           'oldvalue':gettext(c[1]).strip(),
+           'date':gettext(c[2]).strip(),
+           'change_by':gettext(c[3]).strip()}
 
 
 KEYMAP = {
@@ -64,6 +69,20 @@
     "Summary:": "summary",
 }
 
+def removeForm(elem):
+    out = []
+    for e in elem:
+        if e.tag == "form":
+            if e.tail:
+                if out:
+                    out[-1].tail += e.tail
+                else:
+                    elem.text += e.tail
+        else:
+            removeForm(e)
+            out.append(e)
+    elem[:] = out
+
 ##
 # Extracts information for a tracker item, based on the contents of the
 # 'page' file.
@@ -95,19 +114,19 @@
     table = elem.find("table")
 
     # locate the description
-    for tr in table:
-        if len(tr) == 1 and tr[0].get("colspan") == "2":
+    for i, tr in enumerate(table):
+        if len(tr) == 1 and tr[0].get("colspan") == "2" and i > 0:
             # map <br> to newlines
             for br in tr.findall(".//br"):
                 br.text = chr(0) # temporarily use NULL as line terminator
                 if br.tail and br.tail.startswith("\n"):
                     br.tail = br.tail[1:] # trip extra newlines
+            removeForm(tr)
             text = gettext(tr)
-            if text.startswith("\n\n\t\t\t"):
-                text = text[5:]
+            text = text.replace("\r", "")
             text = text.replace("\n", " ")
             text = text.replace(chr(0), "\n")
-            text = text.rstrip()
+            text = text.strip()
             result["description"] = text
             tr.clear()
             break
@@ -128,12 +147,12 @@
         elif td and td[0].tag == "h3":
             key = gettext(td[0]).strip()
             if key == "Followups:":
-                for i, e in enumerate(td.findall("table/tr/td")):
+                for i, e in enumerate(td.findall(".//table/tr/td")):
                     if i:
                         data = getcomment(e)
                         result.setdefault("comments", []).append(data)
             elif key == "Changes:":
-                for i, e in enumerate(td.findall("table/tr")[1:]):
+                for i, e in enumerate(td.findall(".//table/tr")[1:]):
                     data = getchangenote(e)
                     result.setdefault("changes", []).append(data)
             # nuke table contents

Posted by Robert (2007-02-14)

Comment:

I'm using Python 2.4 and got an error without the change below.

Index: init.py
===================================================================
--- init.py	(revision 453)
+++ init.py	(working copy)
@@ -25,7 +25,7 @@
 trackers = []
 for elem in page.getiterator("a"):
     href = elem.get("href")
-    if "func=browse" in href:
+    if "func=browse" in str(href):
         m = re.search("atid=(\d+)", href)
         if m:
             trackers.append((m.group(1), htmlload.gettext(elem).strip()))

Posted by Robert (2007-02-14)

Comment:

I fixed a divide by zero error when there are no tracker items.

Index: status.py
===================================================================
--- status.py   (revision 453)
+++ status.py   (working copy)
@@ -21,7 +21,11 @@
             else:
                 pass # print file, "?"
         print "   ", ni, "items"
-        print "   ", np, "pages", "(%d%%)" % (100 * np / ni)
+        if ni != 0:
+            pct = (100 * np / ni)
+        else:
+            pct = 0
+        print "   ", np, "pages", "(%d%%)" % pct
         print "   ", nf, "files"
     else:
         pass # full status to be added

Posted by Robert (2007-02-14)

Comment:

SourceForge is embedding comments related to the google ads inside the data so I added code to strip out any comments.

Some of the data fields have HTML escaped characters like &quot; for quote, etc. So I call unescape on the return value from gettext().

Index: htmlload.py
===================================================================
--- htmlload.py	(revision 453)
+++ htmlload.py	(working copy)
@@ -34,13 +34,14 @@
 
 def load(file):
     def emit(soup):
-        if isinstance(soup, BS.NavigableString):
-            bob.data(unescape(soup))
-        else:
-            bob.start(soup.name, dict((k, unescape(v)) for k, v in soup.attrs))
-            for s in soup:
-                emit(s)
-            bob.end(soup.name)
+        if not isinstance(soup, BS.Comment):
+            if isinstance(soup, BS.NavigableString):
+                bob.data(unescape(soup))
+            else:
+                bob.start(soup.name, dict((k, unescape(v)) for k, v in soup.attrs))
+                for s in soup:
+                    emit(s)
+                bob.end(soup.name)
     # determine encoding (the document charset is not reliable)
     text = open(file).read()
     try:
@@ -67,7 +68,7 @@
         text += gettext(e)
         if e.tail:
             text += e.tail
-    return text
+    return unescape(text)
 
 ##
 # Download URL.

Posted by Robert (2007-02-14)

Comment:

The timestamp is preceeded by an asterisk if the tracker item is more than 45 days old so I added code to strip the asterisk as well as the whitespace.

The whitespace wasn't being stripped from the priority.

Index: getindex.py
===================================================================
--- getindex.py	(revision 453)
+++ getindex.py	(working copy)
@@ -44,8 +44,8 @@
             id = row[0].text.strip(),
             link = row[1][0].get("href").strip(),
             description = gettext(row[1][0]).strip(),
-            timestamp = gettext(row[2]).strip(),
-            priority = row[3].text,
+            timestamp = gettext(row[2]).strip("* \r\n\f\t\v"),
+            priority = row[3].text.strip(),
             status = row[4].text,
             assigned_to = gettext(row[5]),
             submitted_by = gettext(row[6]),

Posted by Robert (2007-02-14)

A Django site. rendered by a django application. hosted by webfaction.