[Tutor] sort() method and non-ASCII
Steven D'Aprano
steve at pearwood.info
Sun Feb 5 03:32:37 EST 2017
On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?
Correct, except that there is only one Unicode encoding chart.
You may be thinking of the legacy Windows "code pages" system, where you
can change the code page to re-interpret characters as different
characters. E.g. ð in code page 1252 (Western European) becomes π in
code page 1253 (Greek).
Python supports encoding and decoding to and from legacy code page
forms, but Unicode itself does away with the idea of using separate code
pages. It effectively is a single, giant code page containing room for
over a million characters. It's also a superset of ASCII, so pure ASCII
text can be identical in Unicode.
Anyhoo, since Unicode supports dozens of languages from all over the
world, it defines "collation rules" for sorting text in various
languages. For example, sorting in Austria is different from sorting in
Germany, despite them both using the same alphabet. Even in English,
sorting rules can vary: some phone books sort Mc and Mac together, some
don't.
However, Python doesn't directly support that. It just provides a single
basic lexicographic sort based on the ord() of each character in the
string.
> If yes, then
> my expectation would be that the French "á" would come after the "z"
> character.
Correct:
py> "á" > "z"
True
py> sorted('áz')
['z', 'á']
--
Steve
More information about the Tutor
mailing list