[Python-Dev] PEP 393: Flexible String Representation

Antoine Pitrou solipsis at pitrou.net
Wed Jan 26 00:22:32 CET 2011


For the record:

> I also don't see how this could save a lot of memory. As an example
> take a French text with say 10mio code points. This would end up
> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
> on how many accents are used).

Typical French text seems to have 5% non-ASCII characters. So the
number of UTF-8 bytes needed to represent a French text would only be
5% higher than the number of code points.

Anyway, it's quite obvious that Martin's goal is that only one
representation gets created most of the time. To quote the draft:

“All three representations are optional, although the str form is
considered the canonical representation which can be absent only
while the string is being created.”

Regards

Antoine.




More information about the Python-Dev mailing list