[Tutor] sort() method and non-ASCII

Steven D'Aprano steve at pearwood.info
Sun Feb 5 20:23:56 EST 2017


On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
> On Sat, Feb 4, 2017 at 10:50 PM, Random832 <random832 at fastmail.com> wrote:
> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> >> Does the list sort() method (and other sort methods in Python) just go
> >> by the hex value assigned to each symbol to determine sort order in
> >> whichever Unicode encoding chart is being implemented?
> >
> > By default. You need key=locale.strxfrm to make it do anything more
> > sophisticated.
> >
> > I'm not sure what you mean by "whichever unicode encoding chart". Python
> > 3 strings are unicode-unicode, not UTF-8.
> 
> As I said in my response to Steve just now:  I was looking at
> http://unicode.org/charts/  Because they called them charts, so did I.

Ah, that makes sense! They're just reference tables ("charts") for the 
convenience of people wishing to find particular characters.


> I'm assuming that despite this organization into charts, each and
> every character in each chart has its own unique hexadecimal code to
> designate each character.

Correct, although strictly speaking the codes are only conventionally 
given in hexadecimal. They are numbered from 0 to 1114111 in 
decimal (although not all codes are currently used).

The terminology used is that a "code point" is what I've been calling a 
"character", although not all code points are characters. Code points 
are usually written either as the character itself, e.g. 'A', or using 
the notation U+0041 where there are at least four and no more than six 
hexadecimal digits following the "U+". 

Bringing this back to Python, if you know the code point (as a number), 
you can use the chr() function to return it as a string:

py> chr(960)
'π'


Don't forget that Python understands hex too!

py> chr(0x03C0)  # better than chr(int('03C0', 16))
'π'


Alternatively, you can embed it right in the string. For code points 
between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U 
escapes:

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U000003C0'  # requires exactly eight hex digits
'pi = π'


Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'


One last comment: Random832 said:

"Python 3 strings are unicode-unicode, not UTF-8."

To be pedantic, Unicode strings are sequences of abstract code points 
("characters"). UTF-8 is a particular concrete implementation that is 
used to store or transmit such code strings. Here are examples of three 
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 000003C0 0000007A

UTF-8: between one and four bytes per character: CF80 7A

(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be 
reversed, e.g. C003 7A00. UTF-8 is not.)

Prior to version 3.3, there was a built-time option to select either 
"narrow" or "wide" Unicode strings. A narrow build used a fixed two 
bytes per code point, together with an incomplete and not quite correct 
scheme for using two code points together to represent the supplementary 
Unicode characters U+10000 through U+10FFFF. (This is sometimes called 
UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at 
least an incomplete and "buggy" implementation of UTF-16.)


-- 
Steve


More information about the Tutor mailing list