Python Unicode Objects
Some Observations on Working With Non-ASCII Character Sets
This note provides some brief information on best practices for working with non-ASCII data in Python 2.0 and later. As everything else on this site, this is a work in progress.
Updated June 21, 2004 | February 11, 2002 | Fredrik Lundh
Python’s Unicode string type stores characters from the Unicode character set. In this set, each distinct character has its own number, the code point. Unicode supports more than one million code points. Unicode characters don’t have an encoding; each character is represented by its code. The Unicode string type uses some unknown mechanism to store the characters; in your Python code, Unicode strings simply appear as sequences of characters, just like 8-bit strings appear as sequences of bytes.
Observations:
Text files always contain encoded text, not characters. Each character in the text is encoded as one or more bytes in the file.
Most popular encodings (UTF-8, ISO-8859-X, etc) are supersets of ASCII. This means that the first 128 characters have the usual meaning, and that the usual characters are used for line endings. In other words, readline() will work just fine.
You can mix Python Unicode strings with 8-bit Python strings, as long as the 8-bit string only contains ASCII characters. A Unicode-aware library may chose to use 8-bit strings for text that only contains ASCII, to save space and time.
If you read a line of text from a file, you get bytes, not characters.
To decode an encoded string into a string of well-defined characters, you have to know what encoding it uses.
To decode a string, use the decode() method on the input string, and pass it the name of the encoding:
fileencoding = "iso-8859-1" raw = file.readline() txt = raw.decode(fileencoding)(the result is a Python Unicode string).
The decode method was added in Python 2.2. In earlier versions (or if you think it reads better), use the unicode constructor instead:
txt = unicode(raw, fileencoding)
Python’s regular expression engine supports Unicode. You can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for \w (and \s, and \b), use the “(?u)” flag prefix, or the re.UNICODE flag:
pattern = re.compile("(?u)pattern") pattern = re.compile("pattern", re.UNICODE)To write a Unicode string to a file or other device, you have to convert it to the encoding used by the file. The encode method converts from Unicode to an encoded string.
out = txt.encode(encoding)If the string contains characters that cannot be represented in the given encoding, Python raises an exception. You can change this by passing in a second argument to encode:
# skip bad chars out = txt.encode(encoding, "ignore") # replace bad chars with "?" out = txt.encode(encoding, "replace")For more on string encoding, see Converting Unicode Strings to 8-bit Strings.
- To print a Unicode string to your output device, you have to
convert it to the encoding used by your terminal. The encode()
method converts from Unicode back to an encoded string. You can use
the locale.getdefaultlocale() function to get the current
output encoding.
import locale language, output_encoding = locale.getdefaultlocale() print txt.encode(output_encoding)
There are lots of shortcuts in Python, including coded streams, using default locales for pattern matching, ISO-8859-1 as a subset of Unicode, etc, but that’s outside the scope of this note. At least for the moment.
Comment:
This is a 5-star article. It led me by the hand thru the thicket of character encodings!
Posted by metaperl (2007-03-15)
Comment:
Stumbled across this when searching for tips on handling encodings in Python. Great stuff. Thanks a lot!
Posted by Florian (2007-03-25)
Comment:
It's a nice tutorial. If it's allowed i wanna ask some question. i have a variable that contain '\xe9' (é) and i can't change it to unicode. if i write it u'\xe9' it represent the sama thing (é) i have tried all things i know but i just can't convert it. Any suggestion? or do you have any link to forum that i can get support on. Thanks.
Posted by Roy Sebastianus (2007-07-03)
'\xe9' is an encoded string. u'\xe9' is a Unicode string that contains the unicode character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). To convert from the former to the latter, you need to know the encoding. In this case, it's probably latin-1, so you can do "text = '\xe9'.decode('latin-1')" to convert it. For help with Python issues in general, go to www.python.org, click "ABOUT" and then "Help", and check the resources listed under "Got a Python problem or question?" /F
Comment:
This came up when I was looking to go from htmlentities to unicode, which is not supported by .decode() the answer seems to be unicode() on a BeautifulStoneSoup object as explained here: http://laniels.org/weblog/tech/free_software/python/encoding_and_decoding_html_entities_with_python.html
Posted by Andy Fundinger (2007-07-03)

Comment:
Very good and simple explanation! Tks.
Posted by Jairo (2006-11-21)