We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Using Regular Expressions for Lexical Analysis

February 04, 2002 | Fredrik Lundh

This note discusses how to use the re module in Python 2.0 and later for lexical analysis.

A Simple RE Scanner

A scanner reads an input string (e.g. program code) and groups the characters into lexical units, such as keywords and integer literals. Scanners are also known as lexical analysers, or tokenizers.

For example, here’s a simple expression:

b = 2 + a*10

In the above expression, we can identify three token types:

  • Symbols (or identifiers). In the example, all symbols consist of a single character, but we can easily accept any string starting with a letter or an underscore, and followed by more letters, digits, or underscores.
  • Integer literals, consisting of one or more digits.
  • Operators (=, +, and *). To simplify, let’s treat any other non-whitespace character as an operator as well.

Regular expressions provide an easy and quite powerful tool for writing simple scanners. For example, you could use the findall method to split the expression into a list of tokens:

# File: xml-scanner-example-1.py

import re

expr = "b = 2 + a*10"

print re.findall("\s*(\d+|\w+|.)", expr)

Note that the pattern skips any leading whitespace. The token patterns are all placed in the same group, so findall will return a straightforward list of token strings:

$ python xml-scanner-example-1.py
['b', '=', '2', '+', 'a', '*', '10']

The next step would be to pass the token list on to a function that can interpret the expression (or perhaps compile it into some kind of executable code).

A problem here is that while findall returns a list of tokens, the program will have to inspect each token to see what it really is. It could for example look at the first character; if it’s a digit, the token represents an integer literal. If it’s a letter or an underscore, the token represents a symbol. All other tokens are potential operators.

This extra check feels a bit unnecessary; after all, the regular expression engine knows what subpattern it picked. One way to get this information is to modify the pattern so that every token subpattern is in its own group.

# File: xml-scanner-example-2.py

import re

expr = "b = 2 + a*10"

for item in re.findall("\s*(?:(\d+)|(\w+)|(.))", expr):
    print item

In this case, when there are more than one group in a pattern, findall returns a tuple with empty strings for groups that didn’t match:

$ python xml-scanner-example-2.py
('', 'b', '')
('', '', '=')
('2', '', '')
('', '', '+')
('', 'a', '')
('', '', '*')
('10', '', '')

To find the token type, all we have to do is to find the first non-empty item in each tuple, e.g:

if item[0]:
    print "integer", item[0]
elif item[1]:
    print "symbol", item[1]
else:
    print "operator", item[2]

This solution works pretty well as long as we don’t have too many different token types, but it quickly breaks down when the number increases.

There’s actually a way to get at the group index in Python 2.0 and later. Whenever the regular expression engine completes a group, it assigns the group number to an internal register called lastindex. After a successful match, this register is available as an attribute on the match object (the kind of object returned by a successful call to match or search).

The only problem here is that findall returns strings (or tuples), not match objects. Oh well. We’ll have to write our own scanning loop:

# File: xml-scanner-example-3.py

import re

expr = "b = 2 + a*10"

pos = 0

pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")

while 1:
    m = pattern.match(expr, pos)
    if not m:
        break
    print m.lastindex, repr(m.group(m.lastindex))
    pos = m.end()
$ python xml-scanner-example-3.py
2 'b'
3 '='
1 '2'
3 '+'
2 'a'
3 '*'
1 '10'

Here, the lastindex attribute is 1 for integers, 2 for symbols, and code 3 for operators.

Note: In recent versions, you can use finditer instead of findall. finditer returns a lazily constructed sequence of match objects, using an internal scanner object (see below).

The 2.0 engine provides another (undocumented) feature that can be used to optimize this even further. The scanner method is used to create a scanner object and attach it to a string. This object keeps track of the current position, and moves forward after each successful match. Instead of keeping track of the position yourself, just call the scanner over and over again.

After replacing the pos variable with a scanner object, our example now looks like this:

# File: xml-scanner-example-4.py

import re

expr = "b = 2 + a*10"

pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")

scan = pattern.scanner(expr)

while 1:
    m = scan.match()
    if not m:
        break
    print m.lastindex, repr(m.group(m.lastindex))

A Slightly More Complex RE Scanner

Let’s look at a somewhat larger example: a fast but somewhat sloppy XML parser.

An XML document consists of tags, entities, and character data sections. Tags are relatively complex objects; a start tag consists of an angle bracket, an identifier (the element name), optional attributes, and an ending angle bracket. Each attribute consists of an identifier followed by an equal sign, a quotation mark (single or double), text and entities, and a closing quotation mark. An entity reference consists of an ampersand, an identifier (or a number sign followed by an integer), and a semi-colon. An end tag looks like a start tag, but the identifier is preceded by a slash, and attributes are not allowed. And so on…

However, with a little creativity, we can boil it down to the following five token types:

  1. Tags, consisting of an angle bracket (<) followed by an optional slash (/), and an identifier.
  2. Entity and character references, consisting of an ampersand (&), followed by an optional number sign (#), and a string of letters and digits.
  3. Text strings, consisting of letters and digits and other non-special characters (but not whitespace).
  4. Whitespace.
  5. Special characters (<, >, &, ‘, “, and =).

How the tokens are interpreted depend on whether the scanner is inside a tag, or inside character data. In the latter case, identifiers, whitespace segments, and special characters are all interpreted as character data. Inside tags, they’re used for attributes (both attribute names and their values).

Here’s a first implementation of the XML tokenizer:

# File: xml-scanner-example-5.py

import re

xml = re.compile(r"""
    <([/?!]?\w+)     # 1. tags
    |&(\#?\w+);      # 2. entities
    |([^<>&'\"=\s]+) # 3. text strings (no special characters)
    |(\s+)           # 4. whitespace
    |(.)             # 5. special characters
    """, re.VERBOSE)

document = """\
<body class='default'>
here&apos;s some text&#33;
</body>
"""

# bind a scanner to the target string
scan = xml.scanner(document)

# print all tokens
while 1:
    m = scan.match()
    if not m:
        break
    print m.lastindex, repr(m.group(m.lastindex))
$ python xml-scanner-example-5.py
1 'body'
4 ' '
3 'class'
5 '='
5 "'"
3 'default'
5 "'"
5 '>'
4 '\n'
3 'here'
2 'apos'
3 's'
4 ' '
3 'some'
4 ' '
3 'text'
2 '#33'
4 '\n'
1 '/body'
5 '>'
4 '\n'

Here, the lastindex attribute is 1 for tags, 2 for entity and character references, 3 for identifiers, 4 for whitespace segments, and 5 for special characters.

Turning the token stream into start tags, end tags, and character data sections is relatively straightforward. The following implementation uses a few tricks to speed things up (most notably, it creates a temporary function object called gettoken which returns the next token, possibly skipping over whitespace on the way).

# File: xmlscanner.py
# a simple xml scanner

import re, string

# xml tokenizer pattern
xml = re.compile("<([/?!]?\w+)|&(#?\w+);|([^<>&'\"=\s]+)|(\s+)|(.)")

def scan(str, target):
    # split string into xml elements

    # create a scanner function for this string
    def gettoken(space=0, scan=xml.scanner(str).match):
        # scanner function (bound to the string)
        try:
            while 1:
                m = scan()
                code = m.lastindex
                text = m.group(m.lastindex)
                if not space or code != 4:
                    return code, text
        except AttributeError:
            raise EOFError

    # token categories
    TAG = 1; ENTITY = 2; STRING = 3; WHITESPACE = 4; SEPARATOR = 5

    start = target.start; end = target.end; data = target.data

    try:
        while 1:
            code, text = gettoken()
            if code == TAG:
                # deal with tags
                type = text[:1]
                if type == "/":
                    # end tag
                    end(text[1:])
                    code, text = gettoken(1)
                    if text != ">":
                        raise SyntaxError, "malformed end tag"
                elif type == "!":
                    # document type declaration (incomplete)
                    value = []
                    while 1:
                        # parse start tag contents
                        code, text = gettoken(1)
                        if text == ">":
                            break
                        value.append(text)
                    value = string.join(value, "")
                else:
                    # start tag or procesing instruction
                    tag = text
                    attrib = {}
                    while 1:
                        # parse start tag contents
                        code, text = gettoken(1)
                        if text == ">":
                            start(tag, attrib)
                            break
                        if text == "/":
                            start(tag, attrib)
                            end(tag)
                            break
                        if text == "?":
                            if type != text:
                                raise SyntaxError, "unexpected quotation mark"
                            code, text = gettoken(1)
                            if text != ">":
                                raise SyntaxError, "expected end tag"
                            try:
                                target.xml(attrib)
                            except AttributeError:
                                pass
                            break
                        if code == STRING:
                            # attribute
                            key = text
                            code, text = gettoken(1)
                            if text != "=":
                                raise SyntaxError, "expected equal sign"
                            code, quote = gettoken(1)
                            if quote != "'" and quote != '"':
                                raise SyntaxError, "expected quote"
                            value = []
                            while 1:
                                code, text = gettoken()
                                if text == quote:
                                    break
                                if code == ENTITY:
                                    try:
                                        text = fixentity(text)
                                    except ValueError:
                                        text = target.resolve_entity(text)
                                value.append(text)
                            attrib[key] = string.join(value, "")
            elif code == ENTITY:
                # text entities
                try:
                    text = fixentity(text)
                except ValueError:
                    text = target.resolve_entity(text)
                data(text)
            else:
                # other text (whitespace, string, or separator)
                data(text)
    except EOFError:
        pass
    except SyntaxError, v:
        # generate nicer error message
        raise

xml_entities = {"amp": "&", "apos": "'", "gt": ">", "lt": "<", "quot": '"'}

def fixentity(entity):
    # map entity name to built-in entity
    try:
        return xml_entities[entity]
    except KeyError:
        pass
    # assume numeric entity (raises ValueError if malformed)
    if entity[:2] == "#x":
        value = int(entity[2:], 16)
    else:
        value = int(entity[1:])
    if value > 127:
        return unichr(value)
    return chr(value)

The scanner calls methods on a target object, as shown in the following example:

# File: xml-scanner-example-6.py

import xmlscanner

class echo:
    def xml(self, attrib):
        print "XML", attrib
    def start(self, tag, attrib):
        print "START", tag, attrib
    def end(self, tag):
        print "END"
    def data(self, text):
        print "DATA", repr(text)

text = """\
<?xml version='1.0'?>
<body class='default'>
here&apos;s some text&#33;
</body>
"""

xmlscanner.scan(text, echo())
$ python xml-scanner-example-6.py
XML {'version': '1.0'}
DATA '\n'
START body {'class': 'default'}
DATA '\n'
DATA 'here'
DATA "'"
DATA 's'
DATA ' '
DATA 'some'
DATA ' '
DATA 'text'
DATA '!'
DATA '\n'
END
DATA '\n'

Under Python 2.1, this parser is about 20% faster than xmllib.py.

Summary

This note described how to write simple scanners (or tokenizers) with Python’s new regular expression engine. Such scanners work well for simple languages, but can be made to work also with more complex syntaxes, such as XML.