[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 23:10:32 CEST 2011

Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
> Given the required variability of character size in all presently
> Unicode defined encodings, I tend to agree with Tom that UTF-8, together
> with some technique of translating character index to code unit offset,
> may provide the best overall space utilization, and adequate CPU
> efficiency.

UTF-8 can use more space than latin1 or UCS2:
>>> text="abc"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 3)
>>> text="ééé"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 6)
>>> text="€€€"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(6, 9)
>>> text="北京"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(4, 6)

UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters 
(or few non-BMP characters).

About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) 
(PEP 393 indexing).

> ...  Applications that support long
> strings are more likely to bitten by the occasional "outlier" character
> that is longer than the average character, doubling or quadrupling the
> space needed to represent such strings, and eliminating a significant
> portion of the space savings the PEP is providing for other
> applications.

In these worst cases, the PEP 393 is not worse than the current 
implementation: it just as much memory than Python in wide mode (mode used on 
Linux and Mac OS X because wchar_t is 32 bits).  But it uses the double of 
Python in narrow mode (Windows).

I agree than UTF-8 is better in these corner cases, but I also bet than most 
Python programs will use less memory and will be faster with the PEP 393. You 
can already try the pep-393 branch on your own programs.

> Benchmarks may or may not fully reflect the actual
> requirements of all applications, so conclusions based on benchmarking
> can easily be blind-sided the realities of other applications, unless
> the benchmarks are carefully constructed.

I used stringbench and "./python -m test test_unicode". I plan to try iobench.

Which other benchmark tool should be used? Should we write a new one?

> It is possible that the ideas in PEP 393, with its support for multiple
> underlying representations, could be the basis for some more complex
> representations that would better support characters rather than only
> supporting code points, ...

I don't think that the *default* Unicode type is the best place for this. The 
base Unicode type has to be *very* efficient.

If you have unusual needs, write your own type. Maybe based on the base type?

Victor