Python Hash Algorithms

July 11, 2002 | Fredrik Lundh

This note describes how Python calculates hash values for some internal data types.

Strings #

Strings (both 8-bit and Unicode) use the following hash function:

class string:
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead.

The c_mul function in this example is an ordinary C multiplication, using long (usually 32-bit) arguments. In C, the result simply wraps around when the result gets too large (only the low 32 bits are kept), which is exactly what we want in this case.

Getting the same behaviour from Python is a bit tricker; Python’s multiplication operator either gives an overflow error, or in later versions, happily returns a Python long large enough to hold the entire result. And given that we’re multiplying the hash with a million for each character, that’s a really really large number, for anything but the shortest strings.

Anyway, here’s a rather ugly Python implementation:

def c_mul(a, b):
    return eval(hex((long(a) * b) & 0xFFFFFFFFL)[:-1])

Note that you cannot use int instead of eval here; the latter converts 0xFFFFFFFF to -1, the former throws an exception in that case.

(I’m sure there’s some better way to simular a C multiplication in Python, but I’ll leave that for another day.)

Integers #

For ordinary integers, the hash value is simply the integer itself (unless it’s -1).

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value == -2
        return value

For long Python integers, things are a little tricker. For now, let’s just say that the hash algorithm is designed to make sure that an ordinary integer and a long integer with the same value will hash to the same value.

(alright, calculating the hash value for a positive long integer isn’t that hard: take the integer, add the low 15 bits to the hash, shift it 15 bits to the right, and keep doing that until you run out of bits. Dealing with negative numbers, or really large positive integers is a bit tricker, unless you’re implementing the algorithm in C).

Tuples #

The tuple hash function is similar to that used for strings, but instead of character values, it’s using hash values for the individual members.

class tuple:
    def __hash__(self):
        value = 0x345678
        for item in self:
            value = c_mul(1000003, value) ^ hash(item)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

(In Python 2.4 and later, the algorithm is slightly different.)

Instead of “seeding” the hash with the first item, this function use a fixed seed (and from the look of it, a lot of research went into finding the right value ;-).

Another, less obvious difference is that this hash function may fail, if the tuple contains something that doesn’t have a hash value (like a list or dictionary). In that case, the hash(item) call will raise a TypeError exception.

 

A Django site. rendered by a django application. hosted by webfaction.