RE Module Performance
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Jul 25 22:48:17 EDT 2013
On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote:
> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>>
>>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>>> <steve+comp.lang.python at pearwood.info> wrote:
>>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>>> that are codepoints of text characters within buffers and strings.
>>>>> Rather, Emacs uses a variable-length internal representation of
>>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes
>>>>> up 2 bytes, etc. We call this representation of text multibyte.
>>>>
>>>> Well, you've just proven what Vim users have always suspected: Emacs
>>>> doesn't really exist.
>>>
>>> ... lolwut?
>>
>>
>> JMF has explained that it is impossible, impossible I say!, to write an
>> editor using a flexible string representation. Since Emacs uses such a
>> flexible string representation, Emacs is impossible, and therefore
>> Emacs doesn't exist.
>>
>> QED.
>
> Except that the described representation used by Emacs is a variant of
> UTF-8, not an FSR. It doesn't have three different possible encodings
> for the letter 'a' depending on what other characters happen to be in
> the string.
>
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> (or presumably the Emacs variant) as its internal string representation.
UTF-8 uses a flexible representation on a character-by-character basis.
When parsing UTF-8, one needs to look at EVERY character to decide how
many bytes you need to read. In Python 3, the flexible representation is
on a string-by-string basis: once Python has looked at the string header,
it can tell whether the *entire* string takes 1, 2 or 4 bytes per
character, and the string is then fixed-width. You can't do that with
UTF-8.
To put it in terms of pseudo-code:
# Python 3.3
def parse_string(astring):
# Decision gets made once per string.
if astring uses 1 byte:
count = 1
elif astring uses 2 bytes:
count = 2
else:
count = 4
while not done:
char = convert(next(count bytes))
# UTF-8
def parse_string(astring):
while not done:
b = next(1 byte)
# Decision gets made for every single char
if uses 1 byte:
char = convert(b)
elif uses 2 bytes:
char = convert(b, next(1 byte))
elif uses 3 bytes:
char = convert(b, next(2 bytes))
else:
char = convert(b, next(3 bytes))
So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's
variation can in fact require more bytes per character than either.
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.)
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of
his depth, and is a fine example of the Dunning-Kruger effect in action.
He is so sure he is right based on so little evidence.
One advantage of UTF-8 is that for some BMP characters, you can get away
with only three bytes instead of four. For transmitting data over the
wire, or storage on disk, that's potentially up to a 25% reduction in
space, which is not to be sneezed at. (Although in practice it's usually
much less than that, since the most common characters are encoded to 1 or
2 bytes, not 4). But that comes at the cost of much more runtime
overhead, which in my opinion makes UTF-8 a second-class string
representation compared to fixed-width representations.
--
Steven
More information about the Python-list
mailing list