[Python-Dev] PEP 393: Flexible String Representation
Stefan Behnel
stefan_ml at behnel.de
Fri Jan 28 16:22:37 CET 2011
Florian Weimer, 28.01.2011 15:27:
> * Stefan Behnel:
>
>> The nice thing about Py_UNICODE is that is basically gives you native
>> Unicode code points directly, without needing to decode UTF-8 byte
>> runs and the like. In Cython, it allows you to do things like this:
>>
>> def test_for_those_characters(unicode s):
>> for c in s:
>> # warning: randomly chosen Unicode escapes ahead
>> if c in u"\u0356\u1012\u3359\u4567":
>> return True
>> else:
>> return False
>>
>> The loop runs in plain C, using the somewhat obvious implementation
>> with a loop over Py_UNICODE characters and a switch statement for the
>> comparison. This would look a *lot* more ugly with UTF-8 encoded byte
>> strings.
>
> Not really, because UTF-8 is quite search-friendly. (The if would
> have to invoke a memmem()-like primitive.) Random subscrips are
> problematic.
>
> However, why would one want to write loops like the above? Don't you
> have to take combining characters (comprising multiple codepoints)
> into account most of the time when you look at individual characters?
> Then UTF-32 does not offer much of a simplification.
Hmm, I think this discussion is pointless. Regardless of the memory layout,
you can always go down to the byte level and use an efficient
(multi-)substring search algorithm. (which is obviously helped if you know
the layout at compile time *wink*)
Bad example, I guess.
Stefan
More information about the Python-Dev
mailing list