How is unicode implemented behind the scenes?

Mark Lawrence breamoreboy at
Sun Mar 9 15:53:05 CET 2014

On 09/03/2014 10:32, Rustom Mody wrote:
> On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm... at wrote:
>> Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
>>> On 2014-03-09 02:08, Dan Stromberg wrote:
>>>> OK, I know that Unicode data is stored in an encoding on disk.
>>>> But how is it stored in RAM?
>>>> I realize I shouldn't write code that depends on any relevant
>>>> implementation details, but knowing some of the more common
>>>> implementation options would probably help build an intuition for
>>>> what's going on internally.
>>>> I've heard that characters are no longer all c bytes wide internally,
>>>> so is it sometimes utf-8?
>>> No.
>>>   From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
>>> In Python terms:
>>> if all(c <= '\xFF' for c in string):
>>>       use 1 byte per codepoint
>>> elif all(c <= '\xFFFF' for c in string):
>>>       use 2 bytes per codepoint
>>> else:
>>>       use 4 bytes per codepoint
>> A very, very nice recursive mathematical absurdity.
> As a profoundly astute mathematician
> v v n r m a
> can be parsed in 42 different ways (5th catalan number)
> Which parse did you intend?

Please don't feed this particular troll, it's a complete waste of time.

My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

This email is free from viruses and malware because avast! Antivirus protection is active.

More information about the Python-list mailing list