[Tutor] sort() method and non-ASCII
Steven D'Aprano
steve at pearwood.info
Sun Feb 5 20:23:56 EST 2017
On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
> On Sat, Feb 4, 2017 at 10:50 PM, Random832 <random832 at fastmail.com> wrote:
> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> >> Does the list sort() method (and other sort methods in Python) just go
> >> by the hex value assigned to each symbol to determine sort order in
> >> whichever Unicode encoding chart is being implemented?
> >
> > By default. You need key=locale.strxfrm to make it do anything more
> > sophisticated.
> >
> > I'm not sure what you mean by "whichever unicode encoding chart". Python
> > 3 strings are unicode-unicode, not UTF-8.
>
> As I said in my response to Steve just now: I was looking at
> http://unicode.org/charts/ Because they called them charts, so did I.
Ah, that makes sense! They're just reference tables ("charts") for the
convenience of people wishing to find particular characters.
> I'm assuming that despite this organization into charts, each and
> every character in each chart has its own unique hexadecimal code to
> designate each character.
Correct, although strictly speaking the codes are only conventionally
given in hexadecimal. They are numbered from 0 to 1114111 in
decimal (although not all codes are currently used).
The terminology used is that a "code point" is what I've been calling a
"character", although not all code points are characters. Code points
are usually written either as the character itself, e.g. 'A', or using
the notation U+0041 where there are at least four and no more than six
hexadecimal digits following the "U+".
Bringing this back to Python, if you know the code point (as a number),
you can use the chr() function to return it as a string:
py> chr(960)
'π'
Don't forget that Python understands hex too!
py> chr(0x03C0) # better than chr(int('03C0', 16))
'π'
Alternatively, you can embed it right in the string. For code points
between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U
escapes:
py> 'pi = \u03C0' # requires exactly four hex digits
'pi = π'
py> 'pi = \U000003C0' # requires exactly eight hex digits
'pi = π'
Lastly, you can use the code point's name:
py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'
One last comment: Random832 said:
"Python 3 strings are unicode-unicode, not UTF-8."
To be pedantic, Unicode strings are sequences of abstract code points
("characters"). UTF-8 is a particular concrete implementation that is
used to store or transmit such code strings. Here are examples of three
possible encoding forms for the string 'πz':
UTF-16: either two, or four, bytes per character: 03C0 007A
UTF-32: exactly four bytes per character: 000003C0 0000007A
UTF-8: between one and four bytes per character: CF80 7A
(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
reversed, e.g. C003 7A00. UTF-8 is not.)
Prior to version 3.3, there was a built-time option to select either
"narrow" or "wide" Unicode strings. A narrow build used a fixed two
bytes per code point, together with an incomplete and not quite correct
scheme for using two code points together to represent the supplementary
Unicode characters U+10000 through U+10FFFF. (This is sometimes called
UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at
least an incomplete and "buggy" implementation of UTF-16.)
--
Steve
More information about the Tutor
mailing list