[Python-Dev] PEP 393: Flexible String Representation

Fri Jan 28 15:27:39 CET 2011

* Stefan Behnel:

> The nice thing about Py_UNICODE is that is basically gives you native
> Unicode code points directly, without needing to decode UTF-8 byte
> runs and the like. In Cython, it allows you to do things like this:
>
>     def test_for_those_characters(unicode s):
>         for c in s:
>             # warning: randomly chosen Unicode escapes ahead
>             if c in u"\u0356\u1012\u3359\u4567":
>                 return True
>         else:
>             return False
>
> The loop runs in plain C, using the somewhat obvious implementation
> with a loop over Py_UNICODE characters and a switch statement for the
> comparison. This would look a *lot* more ugly with UTF-8 encoded byte
> strings.

Not really, because UTF-8 is quite search-friendly.  (The if would
have to invoke a memmem()-like primitive.)  Random subscrips are
problematic.

However, why would one want to write loops like the above?  Don't you
have to take combining characters (comprising multiple codepoints)
into account most of the time when you look at individual characters?
Then UTF-32 does not offer much of a simplification.

-- 
Florian Weimer                <fweimer at bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99