RE Module Performance
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Jul 26 23:37:20 EDT 2013
On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote:
> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> UTF-8 uses a flexible representation on a character-by-character basis.
>> When parsing UTF-8, one needs to look at EVERY character to decide how
>> many bytes you need to read. In Python 3, the flexible representation
>> is on a string-by-string basis: once Python has looked at the string
>> header, it can tell whether the *entire* string takes 1, 2 or 4 bytes
>> per character, and the string is then fixed-width. You can't do that
>> with UTF-8.
>
> UTF-8 does not use a flexible representation.
I disagree, and so does Jeremy Sanders who first pointed out the
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the
Emacs documentation again:
"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc."
And the Python FSR:
"To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc."
See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of
the string, Python based on the largest code-point in the string.
[...]
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> claim about implementing text editors, because UTF-8 is not what he (or
> anybody else) is referring to when speaking of the FSR or "something
> like the FSR".
Whether JMF can see the similarities between different implementations of
strings or not is beside the point, those similarities do exist. As do
the differences, of course, but in this case the differences are in
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4
bytes, while a FSR implementation can simply record the fact that the
string is pure Latin1 at creation time, and then treat it as fixed-width
from then on.
JMF claims that FSR is "impossible" to use efficiently, and yet he
supports encoding schemes which are *less* efficient. Go figure. He tells
us he has no problem with any of the established UTF encodings, and yet
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not
UTF-16, since there are no surrogate pairs. But the difference is
insignificant.)
Having watched this issue from Day One when JMF first complained about
it, I believe this is entirely about denying any benefit to ASCII users.
Had Python implemented a system identical to the current FSR except that
it added a fourth category, "all ASCII", which used an eight-byte
encoding scheme (thus making ASCII strings twice as expensive as strings
including code points from the Supplementary Multilingual Planes), JMF
would be the scheme's number one champion.
I cannot see any other rational explanation for why JMF prefers broken,
buggy Unicode implementations, or implementations which are equally
expensive for all strings, over one which is demonstrably correct,
demonstrably saves memory, and for realistic, non-contrived benchmarks,
demonstrably faster, except that he wants to punish ASCII users more than
he wants to support Unicode users.
--
Steven
More information about the Python-list
mailing list