[Python-Dev] PEP 393: Flexible String Representation

Antoine Pitrou solipsis at pitrou.net
Wed Jan 26 00:22:32 CET 2011

For the record:

> I also don't see how this could save a lot of memory. As an example
> take a French text with say 10mio code points. This would end up
> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
> on how many accents are used).

Typical French text seems to have 5% non-ASCII characters. So the
number of UTF-8 bytes needed to represent a French text would only be
5% higher than the number of code points.

Anyway, it's quite obvious that Martin's goal is that only one
representation gets created most of the time. To quote the draft:

“All three representations are optional, although the str form is
considered the canonical representation which can be absent only
while the string is being created.”



More information about the Python-Dev mailing list