Common Log Format
March 2004 | Fredrik Lundh
Here’s a simple regular expression that can be used to parse server log files, in the Common Log Format.
p = re.compile(
'([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*)'
)
for line in file.readlines():
m = p.match(line)
if not m:
continue
host, ignore, user, date, request, status, size = m.groups()
...Here’s a variation that parses the Extended Common Log Format, which contains additional referrer and user-agent fields.
p = re.compile(
'([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*)'
' "([^"]*)" "([^"]*)"' # extensions
)
for line in file.readlines():
m = p.match(line)
if not m:
continue
host, ignore, user, date, request, status, size,
referer, agent = m.groups()
...Comment:
Might be interesting to look at Perl's Apache::LogRegex module (in CPAN) - has a pretty nice approach whereby you pass it your Apache custom log format and it generates a regex for parsing lines from it, returning a dict with the format identifiers as keys. Not sure if it's capable of handling everything but one or two bugs seem to have been slayed in 5 releases.
Posted by Harry Fuecks (2006-12-05)
Comment:
Now implemented: http://webtuesday.ch/~harryf/code/apachelog/ - either just download or $ bzr get http://webtuesday.ch/~harryf/code/apachelog/
Posted by Harry Fuecks (2006-12-08)
Comment:
I'm using the regular expression from http://effbot.org/zone/re-common-log-format.htm. I have a problem with it, though: I have some lines in my Apache log which contain something of the form "GET /\" HTTP/1.1". So they have a " which is kind of 'escaped' by a \. I was trying to change the regex to take this into account, but I've come up dry... Here are some things I tried: Some friends proposed this: But it's both monstrous and fails in the case of more \"\" things... I was wondering if you have any suggestions? (And kind of reporting a 'bug', I guess.)Posted by Manuzhai (2006-06-20)