[Python-3000] Are strs sequences of characters or disguised byte strings?

Wed Oct 3 05:28:56 CEST 2007

String objects are arrays of code units. They can represent normalized
and unnormalized Unicode text just as easily, and even invalid data,
like half a surrogate and other illegal code units. It is up to the
application (or perhaps at some point the library) to implement
various checks and normalizations. AFAIK this is the same stance that
Java and C# take -- the String types there don't concern themselves
with the higher levels of Unicode standard compliance. (Though those
languages probably have more library support than Python does --
perhaps someone can contribute something, like wrappers for ICU?)

However, for identifiers occurring in source code, we *do* normalize
before comparing them. PEP 3131 should explain this.

--Guido

On 10/2/07, Mark Summerfield <mark at qtrac.eu> wrote:
> In Python 3.0a1, exec() appears to normalize strings, but in other cases
> they don't appear to be normalized, and this leads to results that
> appear to be counter-intuitive in some cases, at least to me.
>
>     >>> c1 = "\u00C7"
>     >>> c2 = "C\u0327"
>     >>> c3 = "\u0043\u0327"
>     >>> c1, c2, c3
>     ('\xc7', 'C\u0327', 'C\u0327')
>     >>> print(c1, c2)
>     Ç Ç
>
> Clearly c1 and c2 are different at the byte level. But if we use them to
> create variables using exec(), Python appears to normalize them:
>
>     >>> dir()
>     ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3']
>     >>> exec("C\u0327 = 5")
>     >>> dir()
>     ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3', '\xc7']
>     >>> Ç
>     5
>     >>> exec("\u00C7 = -7")
>     >>> dir()
>     ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3', '\xc7']
>     >>> Ç
>     -7
>
> This seems to be the right behaviour to me, since from the point of view
> of a programmer, Ç is the name of the variable, no matter what the
> underlying byte encoding used to represent the variable's name.
>
>     >>> print(c1, c2)
>     Ç Ç
>     >>> c1.encode("utf8") == c2.encode("utf8")
>     False
>
> This is what I'd expect, since here I'm comparing the actual bytes.
>
> But when I compare them as strings I really expect them to be compared
> as sequences of characters (in a human sense), so this:
>
>     >>> c1 == c2
>     False
>
> seems counter-intuitive to me. It is easy to fix:
>
>     >>> from unicodedata import normalize
>     >>> normalize("NFKD", c1) == normalize("NFKD", c2)
>     True
>
> but isn't it asking a lot of Python users to use normalize() whenever
> they want to perform such a basic operation as string comparison?
>
> Another issue that arises is that you can end up with duplicate
> dictionary keys and set elements. (The duplication is in human terms, in
> byte terms the keys/set elements differ of course):
>
>     >>> d = {c1: 1, c2: 2}
>     >>> d
>     {'C\u0327': 2, '\xc7': 1}
>     >>> for k, v in d.items():
>     ...     print(k, v)
>     ...
>     Ç 2
>     Ç 1
>
> I think this is surprising.
>
>     >>> s = {c1, c2}
>     >>> s
>     {'C\u0327', '\xc7'}
>     >>> for x in s:
>     ...     print(x)
>     ...
>     Ç
>     Ç
>
> And the same result applies to sets of course.
>
> I don't know what the performance costs would be for always normalizing
> strings, but it seems to me that if strings are not normalized, then
> they are really being treated as byte strings thinly disguised as
> strings rather than as true sequences of characters whose byte
> representation is a detail that programmers can ignore (unless they
> choose to explicitly decode).
>
> --
> Mark Summerfield, Qtrac Ltd., www.qtrac.eu
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)