[Python-3000] Are strs sequences of characters or disguised byte strings?

Wed Oct 3 04:24:50 CEST 2007

In Python 3.0a1, exec() appears to normalize strings, but in other cases
they don't appear to be normalized, and this leads to results that
appear to be counter-intuitive in some cases, at least to me.

    >>> c1 = "\u00C7"
    >>> c2 = "C\u0327"
    >>> c3 = "\u0043\u0327"
    >>> c1, c2, c3
    ('\xc7', 'C\u0327', 'C\u0327')
    >>> print(c1, c2)
    Ç Ç

Clearly c1 and c2 are different at the byte level. But if we use them to
create variables using exec(), Python appears to normalize them:

    >>> dir()
    ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3']
    >>> exec("C\u0327 = 5")
    >>> dir()
    ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3', '\xc7']
    >>> Ç
    5
    >>> exec("\u00C7 = -7")
    >>> dir()
    ['__builtins__', '__doc__', '__name__', 'c1', 'c2', 'c3', '\xc7']
    >>> Ç
    -7

This seems to be the right behaviour to me, since from the point of view
of a programmer, Ç is the name of the variable, no matter what the
underlying byte encoding used to represent the variable's name.

    >>> print(c1, c2)
    Ç Ç
    >>> c1.encode("utf8") == c2.encode("utf8")
    False

This is what I'd expect, since here I'm comparing the actual bytes.

But when I compare them as strings I really expect them to be compared
as sequences of characters (in a human sense), so this:

    >>> c1 == c2
    False

seems counter-intuitive to me. It is easy to fix:

    >>> from unicodedata import normalize
    >>> normalize("NFKD", c1) == normalize("NFKD", c2)
    True

but isn't it asking a lot of Python users to use normalize() whenever
they want to perform such a basic operation as string comparison?

Another issue that arises is that you can end up with duplicate
dictionary keys and set elements. (The duplication is in human terms, in
byte terms the keys/set elements differ of course):

    >>> d = {c1: 1, c2: 2}
    >>> d
    {'C\u0327': 2, '\xc7': 1}
    >>> for k, v in d.items():
    ...     print(k, v)
    ...
    Ç 2
    Ç 1

I think this is surprising.

    >>> s = {c1, c2}
    >>> s
    {'C\u0327', '\xc7'}
    >>> for x in s:
    ...     print(x)
    ...
    Ç
    Ç

And the same result applies to sets of course.

I don't know what the performance costs would be for always normalizing
strings, but it seems to me that if strings are not normalized, then
they are really being treated as byte strings thinly disguised as
strings rather than as true sequences of characters whose byte
representation is a detail that programmers can ignore (unless they
choose to explicitly decode).

-- 
Mark Summerfield, Qtrac Ltd., www.qtrac.eu