RE Module Performance
steve+comp.lang.python at pearwood.info
Thu Jul 25 07:56:46 CEST 2013
On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF:
> His most recent argument that Python should use UTF as a representation
> is very strange to be honest.
He's not arguing for anything, he is just hating on anything that gives
even the tiniest benefit to ASCII users. This isn't about Python 3.3.
hurting non-ASCII users, because that is demonstrably untrue: they are
*better off* in Python 3.3. This is about denying even a tiny benefit to
In Python 3.3, non-ASCII users have these advantages compared to previous
- strings will usually take less memory, and aside from trivial changes
to the object header, they never take more memory than a wide build would
- consequently nearly all objects will take less memory (especially
builtins and standard library objects, which are all ASCII), since
objects contain dozens of internal strings (attribute and method names in
__dict__, class name, etc.);
- consequently whole-application benchmarks show most applications will
use significantly less memory, which leads to faster speeds;
- you cannot break surrogate pairs apart by accident, which you can do in
- in previous versions, code which works when run in a wide build may
fail in a narrow build, but that is no longer an issue since the
distinction between wide and narrow builds is gone;
- Latin1 users, which includes JMF himself, will likewise see memory
savings, since Latin1 strings will take half the size of narrow builds
and a quarter the size of wide builds.
The cost of all these benefits is a small overhead when creating a string
in the first place, and some purely internal added complication to the
I'm the first to argue against complication unless there is a
corresponding benefit. This is a case where the benefit has proven itself
doubly: Python 3.3's Unicode implementation is *more correct* than
before, and it uses less memory to do so.
> The cons of UTF are apparent and widely
> known. The main con is that UTF strings are O(n) for indexing a
> position within the string.
Not so for UTF-32.
More information about the Python-list