Some Notes on Tim Bray's Wide Finder Benchmark

Fredrik Lundh | Updated October 12, 2007 | Originally posted October 6, 2007

The Problem #

Tim Bray recently posted about his experiences from using Erlang to do some straightforward parsing of a large log file, inspired by a chapter he wrote for the book Beautiful Code. As it turned out, Erlang isn’t exactly optimized for tasks like this. After trying to parse a 1,000,000-line log file, Tim notes:

My first cut in Erlang, based on the per-process dictionary, took around eight minutes of CPU, and kept one of my MacBook’s Core Duo processors pegged at 97% while it was running. Ouch!

That’s less than a half megabyte per second. Not very impressive. Let’s see if we can come up with something better in Python.

A Single-Threaded Python Solution #

Santiago Gala followed up on Tim’s original post with a nice map/reduce-based implementation in Python:

http://memojo.com/~sgala/blog/2007/09/29/Python-Erlang-Map-Reduce

Santiago’s script uses a series of nested generators to do filtering and mapping, and then uses a for-in-loop to reduce the mapped stream into a dictionary.

To benchmark the script, I created a sample by concatenating 100 copies of Tim’s original 10,000-line sample file. With that file, Santiago’s script needs about 6.7 seconds wall-time to parse 200 megabytes of log data on my Core Duo laptop (using Windows XP, warmed-up disk caches, and the final print statement replaced with a pass).

 

Tim’s 1.67 GHz Core Duo L2400 MacBook should match my 1.66 GHz Core Duo T2300 HP notebook pretty well, so that’s about 70 times faster than his Erlang program, and about twice as fast as his Ruby version. Not too shabby.

But we can speed things up a bit more, of course.

Compiling the RE #

Python’s RE engine caches compiled expressions, but it’s usually a good idea to move the cache lookup out of the inner loop anyway. And while we’re at it, we can move the method lookup out of the loop as well:

pat = r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) "
search = re.compile(pat).search

matches = (search(line) for line in file("o10k.ap"))

With these changes, the script finishes in 4.1 seconds.

Skipping lines that cannot match #

Somewhat less obvious is the fact that we can use Python’s in operator to filter out lines that cannot match:

matches = (search(line) for line in file("o10k.ap")
    if "GET /ongoing/When" in line)
 

The RE engine does indeed use special code for literal prefixes, but the sublinear substring search algorithm that was introduced in 2.5 is a lot faster in cases like this, so this simple change gives a noticable speedup; the script now runs in 2.9 seconds.

Reading files in binary mode (Windows) #

On Windows (and in theory, on other platforms that distinguish between text files and binary files), data read via the standard file object are scanned for Windows-style line endings (“\r\n”). Any such character combination is then translated to a single newline, for consistency.

This is of course very convenient, since it allows you to treat text files in the same way no matter what platform you’re on, but on files this large, the performance penality is starting to get noticable.

We can turn this off simply by passing in the “rb” flag (read binary) to the open function.

matches = (search(line) for line in file("o10k.ap", "rb")
    if "GET /ongoing/When" in line)

The file object will still break things up in lines, and our code doesn’t look at the line endings, so we still get the same result. Just a bit quicker.

The Code #

Here’s the final version of Santiago’s script:

import re
from collections import defaultdict

FILE = "o1000k.ap"

pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")

search = pat.search

# map
matches = (search(line) for line in file(FILE, "rb") if "GET /ongoing/When" in line)
mapp    = (match.group(1) for match in matches if match)

# reduce
count = defaultdict(int)
for page in mapp:
    count[page] +=1

for key in sorted(count, key=count.get)[:10]:
    print "%40s = %s" % (key, count[key])

To get a version that’s set up for benchmarking, get the wf-2.py file from this directory:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

This version of the script finishes in 1.9 seconds. This is a 3.5x speedup over Santiago’s version, and over 250x faster than Tim’s Erlang version. Pretty good for a short single-threaded script, don’t you think?

But I’m running this on a Core Duo machine. Two CPU cores, that is. What about using them both for this task?

A Multi-Threaded Python Solution #

To run multiple subtasks in parallel, we need to split the task up in some way. Since the program reads a single text file, the easiest way to do that is to split the file into multiple pieces on the way in. Here’s a simple function that rushes through the file, splitting it up in 1 megabyte chunks, and returns chunk offsets and sizes:

def getchunks(file, size=1024*1024):
    f = open(file)
    while 1:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline()
        yield start, f.tell() - start
        if not s:
            break

By default, this splits the file in megabyte-sized chunks:

>>> for chunk in getchunks("o1000k.ap"):
...     print chunk
(0L, 1048637L)
(1048637L, 1048810L)
(2097447L, 1048793L)
(3146240L, 1048603L)

Note the use of readline to make sure that each chunk ends at a newline character. (Without this, there’s a small chance that we’ll miss some entries here and there. This is probably not much of a problem in practice, but let’s stick to the exact solution for now.)

So, given a list of chunks, we need something that takes a chunk, and produces a partial result. Here’s a first attempt, where the map and reduce steps are combined into a single loop:

pat = re.compile(...)

def process(file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
    search = pat.search
    for line in f.read(chunk[1]).splitlines():
        if "GET /ongoing/When" in line:
            m = search(line)
            if m:
                d[m.group(1)] += 1
    return d

Note that we cannot loop over the file itself, since we need to stop when we reach the end of it. The above version solves this by reading the entire chunk, and then splitting it into lines.

To test this code, we can loop over the chunks and feed them to the process function, one by one, and combine the result:

count = defaultdict(int)
for chunk in getchunks(file):
    for key, value in process(file, chunk).items():
        count[key] += value

This version is a bit slower than the non-chunked version on my machine; one pass over the 200 megabyte file takes about 2.6 seconds.

However, since a chunk is guaranteed to contain a full set of lines, we can speed things up a bit more by looking for matches in the chunk itself instead of splitting it into lines:

def process(file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
    for page in pat.findall(f.read(chunk[1])):
        d[page] += 1
    return d

With this change, the time drops to 1.8 seconds (3.7x faster than the original version).

The next step is to set things up so we can do the processing in parallel. First, we’ll call the process function from a standard “worker thread” wrapper:

import threading, Queue

# job queue
queue = Queue.Queue()

# result queue
result = []

class Worker(threading.Thread):
    def run(self):
        while 1:
            args = queue.get()
            if args is None:
                break
            result.append(process(*args))
            queue.task_done()

This uses the standard “worker thread” pattern, with a thread-safe Queue for pending jobs, and a plain list object to collect the results (list.append is an atomic operation in CPython).

To finish the script, just create a bunch of workers, give them something to do (via the queue), and collect the results into a single dictionary:

for i in range(4):
    w = Worker()
    w.setDaemon(1)
    w.start()

for chunk in getchunks(file):
    queue.put((file, chunk))

queue.join()

count = defaultdict(int)
for item in result:
    for key, value in item.items():
        count[key] += value

With a single thread, this runs in about 1.8 seconds (same as the non-threaded version). When we increase the number of threads, things are improved:

  • Two threads: 1.9 seconds
  • Three: 1.7 seconds
  • Four to eight: 1.6 seconds

For this specific test, the ideal number appears to be three threads per CPU. With fewer threads, the CPU:s will occasionally get stuck waiting for I/O.

Or perhaps they’re waiting for the interpreter itself; Python uses a global interpreter lock to protect the interpreter internals from simultaneous access, so there’s probably some fighting over the interpreter going on as well. To get even more performance out of this, we need to get around the lock in some way.

Luckily, for this kind of problem, the solution is straightforward.

A Multi-Processor Python Solution #

To fully get around the interpreter lock, we need to run each subtask in a separate process. An easy way to do that is to let each worker thread start an associated process, send it a chunk, and read back the result. To make things really simple, and also portable, we’ll use the script itself as the subprocess, and use a special option to enter “subprocess” mode.

Here’s the updated worker thread:

import subprocess, sys

executable = [sys.executable]
if sys.platform == "win32":
    executable.append("-u") # use raw mode on windows

class Worker(threading.Thread):
    def run(self):
        process = subprocess.Popen(
            executable + [sys.argv[0], "--process"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE
            )
        stdin = process.stdin
        stdout = process.stdout
        while 1:
            cmd = queue.get()
            if cmd is None:
                putobject(stdin, None)
                break
            putobject(stdin, cmd)
            result.append(getobject(stdout))
            queue.task_done()

where the getobject and putobject helpers are defined as:

import marshal, struct

def putobject(file, object):
    data = marshal.dumps(object)
    file.write(struct.pack("I", len(data)))
    file.write(data)
    file.flush()

def getobject(file):
    try:
        n = struct.unpack("I", file.read(4))[0]
    except struct.error:
        return None
    return marshal.loads(file.read(n))

The worker thread runs a copy of the script itself, and passes in the “—process” option. To enter subprocess mode, we need to look for that before we do anything else:

if "--process" in sys.argv:
    stdin = sys.stdin
    stdout = sys.stdout
    while 1:
        args = getobject(stdin)
        if args is None:
            sys.exit(0) # done
        result = process(*args)
        putobject(stdout, result)
else:
    ... create worker threads ...

With this approach, the processing time drops to 1.2 seconds, when using two threads/processes (one per CPU). But that’s about as good as it gets; adding more processes doesn’t really improve things on this machine.

Memory Mapping #

So, is this the best we can get? Not quite. We can speed up the file access as well, by switching to memory mapping:

 
import mmap

filemap = None

def process(file, chunk):
    global filemap, fileobj
    if filemap is None or fileobj.name != file:
        fileobj = open(file)
        filemap = mmap.mmap(
            fileobj.fileno(),
            os.path.getsize(file),
            access=mmap.ACCESS_READ
        )
    d = defaultdict(int)
    for file in pat.findall(filemap, chunk[0], chunk[0]+chunk[1]):
        d[file] += 1
    return d

Note that findall can be applied directly to the mapped region, thanks to Python’s internal memory buffer interface (dead link). Also note that the mmap module doesn’t support windowing, so the code needs to map the entire file in each subprocess. This can result in overly excessive use of virtual memory on some platforms (running this on your own log files if you’re on a shared web server is not necessarily a good idea. Yes, I’ve tried ;-).

Anyway, this gets the job done in 0.9 seconds, with the original chunk size. But since we’re mapping the entire file anyway in each subprocess, we can increase the chunk size to reduce the process communication overhead. With 50 megabyte chunks, the script runs in just under 0.8 seconds.

Summary #

In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:

  • Pre-compiled RE patterns
  • Fast filtering of candidate lines
  • Chunked reading
  • Multiple processes
  • Memory mapping, combined with support for RE operations on mapped buffers

This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.

However, it should be noticed that the benchmark I’ve been using focuses on processing speed, not disk speed. The code will most likely behave differently on cold caches (and will definitely take longer to run), on machines with different disk systems, and of course also on machines with additional cores.

If you have some time to spare and some interesting hardware to run it on, feel free to grab the code and take it on a ride:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

(see the README.txt file for details.)

Addenda #

2007-10-07: Stanley Seibert has adapted the code to use the processing library, which provides multiprocess functionality with a lot less (user) code; see Parallel Processing in Python with processing (dead link) for details.

2007-10-07: Bioinformatics veteran and fellow Python string-type hacker Andrew Dalke points out, via mail, that it’s possible to shave off a few more cycles by extracting all URL:s that start with “/ongoing/When/” (which we’re looking for anyway), and then removing bogus URL:s during post-processing. Andrew has also written a custom parser based on mxTextTools, which is a quite a bit faster than the RE solution. Hopefully, he’ll turn his findings into a blog post, so I can link to his work ;-) See More notes on Wide Finder for the full story (which is more about fast “narrow finding” than “wide finding”, though).

2007-10-07: Bill de hÓra has some code too.

2007-10-07: And Steve Vinoski has tried the code from this article on some big iron: “I ran his wf-6.py on an 8-core 2.33 GHz Intel Xeon Linux box with 8GB of RAM, and it ran best at 5 processes, clocking in at 0.336 sec. Another process-based approach, wf-5.py, executed best with 8 processes, presumably one per core, in 0.358 sec. The multithreaded approach, wf-4.py, ran best with 5 threads, at 1.402 sec (but also got the same result with 19 threads, go figure). Using the same dataset, I get 11.8 sec from my best Erlang effort, which is obviously considerably slower.

2007-10-08: Paul Boddie provides code and results (dead link) using a different parallelization library, pprocess.

2007-10-08: Tim Bray summarizes recent developments.

2007-10-12: Updated the article to use binary mode on Windows. This makes the chunk calculations a bit more reliable (tell can misbehave on text files), and speeds things up quite a bit, since the I/O layer no longer needs to convert line endings.

2007-10-31: Tim Bray has tested a bunch of implementations on a multicore Solaris box. When I write this, Python’s in the lead ;-)

 

A Django site. rendered by a django application. hosted by webfaction.