[Python-Dev] PEP 393: Flexible String Representation

Fri Jan 28 16:22:37 CET 2011

Florian Weimer, 28.01.2011 15:27:
> * Stefan Behnel:
>
>> The nice thing about Py_UNICODE is that is basically gives you native
>> Unicode code points directly, without needing to decode UTF-8 byte
>> runs and the like. In Cython, it allows you to do things like this:
>>
>>      def test_for_those_characters(unicode s):
>>          for c in s:
>>              # warning: randomly chosen Unicode escapes ahead
>>              if c in u"\u0356\u1012\u3359\u4567":
>>                  return True
>>          else:
>>              return False
>>
>> The loop runs in plain C, using the somewhat obvious implementation
>> with a loop over Py_UNICODE characters and a switch statement for the
>> comparison. This would look a *lot* more ugly with UTF-8 encoded byte
>> strings.
>
> Not really, because UTF-8 is quite search-friendly.  (The if would
> have to invoke a memmem()-like primitive.)  Random subscrips are
> problematic.
>
> However, why would one want to write loops like the above?  Don't you
> have to take combining characters (comprising multiple codepoints)
> into account most of the time when you look at individual characters?
> Then UTF-32 does not offer much of a simplification.

Hmm, I think this discussion is pointless. Regardless of the memory layout, 
you can always go down to the byte level and use an efficient 
(multi-)substring search algorithm. (which is obviously helped if you know 
the layout at compile time *wink*)

Bad example, I guess.

Stefan