How is unicode implemented behind the scenes?
ned at nedbatchelder.com
Sun Mar 9 04:48:51 CET 2014
On 3/8/14 9:08 PM, Dan Stromberg wrote:
> OK, I know that Unicode data is stored in an encoding on disk.
> But how is it stored in RAM?
> I realize I shouldn't write code that depends on any relevant
> implementation details, but knowing some of the more common
> implementation options would probably help build an intuition for
> what's going on internally.
> I've heard that characters are no longer all c bytes wide internally,
> so is it sometimes utf-8?
In abstract terms, a Unicode string is a sequence of integers (code
points). There are lots of ways to store a sequence of integers.
In Python 2.x, it's either a vector of 16-bit ints, or 32-bit ints.
These are the Unicode representations known as UTF-16 and UTF-32,
respectively, and which you have depends on whether you have a "narrow"
or "wide" build of Python. You can tell the difference by examining
sys.maxunicode, which is 65535 (narrow) or 1114111 (wide).
In Python 3.3, the representation was changed from narrow/wide to the
so-called Flexible String Representation which others here have
described. It uses either 1-, 2-, or 4-bytes per code point, depending
on the set of code points in the string. It's specified in PEP 393:
Ned Batchelder, http://nedbatchelder.com
More information about the Python-list