RE Module Performance
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Jul 25 05:22:17 EDT 2013
On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:
> On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>>
>>> If nobody had ever thought of doing a multi-format string
>>> representation, I could well imagine the Python core devs debating
>>> whether the cost of UTF-32 strings is worth the correctness and
>>> consistency improvements... and most likely concluding that narrow
>>> builds get abolished. And if any other language (eg ECMAScript)
>>> decides to move from UTF-16 to UTF-32, I would wholeheartedly support
>>> the move, even if it broke code to do so.
>>
>> Unfortunately, so long as most language designers are European-centric,
>> there is going to be a lot of push-back against any attempt to fix
>> (say) Javascript, or Java just for the sake of "a bunch of dead
>> languages" in the SMPs. Thank goodness for emoji. Wait til the young
>> kids start complaining that their emoticons and emoji are broken in
>> Javascript, and eventually it will get fixed. It may take a decade, for
>> the young kids to grow up and take over Javascript from the
>> old-codgers, but it will happen.
>
> I don't know that that'll happen like that. Emoticons aren't broken in
> Javascript - you can use them just fine. You only start seeing problems
> when you index into that string. People will start to wonder why, for
> instance, a "500 character maximum" field deducts two from the limit
> when an emoticon goes in.
I get that. I meant *Javascript developers*, not end-users. The young
kids today who become Javascript developers tomorrow will grow up in a
world where they expect to be able to write band names like
"▼□■□■□■" (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I
guarantee that even as we speak some new hipster band is trying to decide
whether to name themselves "Smiling 😢" or "Crying 😊".
:-)
>> It is *possible* to have non-buggy string routines using UTF-16, but
>> the implementation is a lot more complex than most language developers
>> can be bothered with. I'm not aware of any language that uses UTF-16
>> internally that doesn't give wrong results for surrogate pairs.
>
> The problem isn't the underlying representation, the problem is what
> gets exposed to the application. Once you've decided to expose
> codepoints to the app (abstracting over your UTF-16 underlying
> representation), the change to using UTF-32, or mimicking PEP 393, or
> some other structure, is purely internal and an optimization. So I doubt
> any language will use UTF-16 internally and UTF-32 to the app. It'd be
> needlessly complex.
To be honest, I don't understand what you are trying to say.
What I'm trying to say is that it is possible to use UTF-16 internally,
but *not* assume that every code point (character) is represented by a
single 2-byte unit. For example, the len() of a UTF-16 string should not
be calculated by counting the number of bytes and dividing by two. You
actually need to walk the string, inspecting each double-byte:
# calculate length
count = 0
inside_surrogate = False
for bb in buffer: # get two bytes at a time
if is_lower_surrogate(bb):
inside_surrogate = True
continue
if is_upper_surrogate(bb):
if inside_surrogate:
count += 1
inside_surrogate = False
continue
raise ValueError("missing lower surrogate")
if inside_surrogate:
break
count += 1
if inside_surrogate:
raise ValueError("missing upper surrogate")
Given immutable strings, you could validate the string once, on creation,
and from then on assume they are well-formed:
# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer: # get two bytes at a time
if skip:
count += 1
skip = False
continue
if is_surrogate(bb):
skip = True
count += 1
String operations such as slicing become much more complex once you can
no longer assume a 1:1 relationship between code points and code units,
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't
handle that complexity, and push responsibility for it back onto the
coder using the language.
--
Steven
More information about the Python-list
mailing list