Python 2.x’s support for Unicode was a messy problem for me when I started learning Python. Today I read through the official tutorial and want to take note of it.
Note: In Python 3, there are some changes. This post is only based on Python 2.
Brief history
First, we need to understand that we are talking about using numbers to represent characters. Ok, now, history starts.
In 1968, ASCII is invented. It was an American-developed standard. Numeric values from 0 to 127 are assigned to various characters. For example, the lowercase letter ‘a’ is assigned 97 as its code value. However, ASCII is not enough to represent other languages’ characters like French characters.
So in 1980s, people started develop the Unicode Standard. Unicode started out using 16-bit characters instead of 8-bit characters.
Encodings and decodings in Python 2
The Unicode standard describes how characters are represented by code points. For example, ‘a’ is 0061. 0061 is the code point. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). Here is a Unicode codepoint chart.
The Unicode string needs to be represented as a set of bytes (meaning, values from 0–255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
In Python 2.x, a Python string is a sequence of bytes. And Python byte strings (str type) have an encoding, Unicode does not. So you can convert a Unicode string to a Python byte string using .encode(encoding), and you can convert a byte string to a Unicode string using .decode(encoding). A code example is given below.
|
|
Reference:
- Unicode HOWTO
- Some very good questions and answers from stackoverflow:
https://stackoverflow.com/questions/10288016/usage-of-unicode-and-encode-functions-in-python?noredirect=1&lq=1
https://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode?noredirect=1&lq=1
https://stackoverflow.com/questions/447107/what-is-the-difference-between-encode-decode