[Tutor] sort() method and non-ASCII

Sun Feb 5 03:32:37 EST 2017

On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

Correct, except that there is only one Unicode encoding chart.

You may be thinking of the legacy Windows "code pages" system, where you 
can change the code page to re-interpret characters as different 
characters. E.g. ð in code page 1252 (Western European) becomes π in 
code page 1253 (Greek).

Python supports encoding and decoding to and from legacy code page 
forms, but Unicode itself does away with the idea of using separate code 
pages. It effectively is a single, giant code page containing room for 
over a million characters. It's also a superset of ASCII, so pure ASCII 
text can be identical in Unicode.

Anyhoo, since Unicode supports dozens of languages from all over the 
world, it defines "collation rules" for sorting text in various 
languages. For example, sorting in Austria is different from sorting in 
Germany, despite them both using the same alphabet. Even in English, 
sorting rules can vary: some phone books sort Mc and Mac together, some 
don't.

However, Python doesn't directly support that. It just provides a single 
basic lexicographic sort based on the ord() of each character in the 
string.

> If yes, then
> my expectation would be that the French "á" would come after the "z"
> character. 

Correct:

py> "á" > "z"
True
py> sorted('áz')
['z', 'á']

-- 
Steve