[Tutor] sort() method and non-ASCII

eryk sun eryksun at gmail.com
Sun Feb 5 02:52:45 EST 2017


On Sun, Feb 5, 2017 at 3:52 AM, boB Stepp <robertvstepp at gmail.com> wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

list.sort uses a less-than comparison. What you really want to know is
how Python compares strings. They're compared by ordinal at
corresponding indexes, i.e. ord(s1[i]) vs ord(s2[i]) for i less than
min(len(s1), len(s2)).

This gets a bit interesting when you're comparing characters that have
composed and decomposed Unicode forms, i.e. a single code vs multiple
combining codes. For example:

    >>> s1 = '\xc7'
    >>> s2 = 'C' + '\u0327'
    >>> print(s1, s2)
    Ç Ç
    >>> s2 < s1
    True

where U+0327 is a combining cedilla. As characters, s1 and s2 are the
same. However, codewise s2 is less than s1 because 0x43 ("C") is less
than 0xc7 ("Ç"). In this case you can first normalize the strings to
either composed or decomposed form [1]. For example:

    >>> strings = ['\xc7', 'C\u0327', 'D']
    >>> sorted(strings)
    ['Ç', 'D', 'Ç']

    >>> norm_nfc = functools.partial(unicodedata.normalize, 'NFC')
    >>> sorted(strings, key=norm_nfc)
    ['D', 'Ç', 'Ç']

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms


More information about the Tutor mailing list