[Tutor] sort() method and non-ASCII

Sun Feb 5 23:27:41 EST 2017

On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
>> On Sat, Feb 4, 2017 at 10:50 PM, Random832 <random832 at fastmail.com> wrote:
>> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:

> Alternatively, you can embed it right in the string. For code points
> between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U
> escapes:
>
> py> 'pi = \u03C0'  # requires exactly four hex digits
> 'pi = π'
>
> py> 'pi = \U000003C0'  # requires exactly eight hex digits
> 'pi = π'
>
>
> Lastly, you can use the code point's name:
>
> py> 'pi = \N{GREEK SMALL LETTER PI}'
> 'pi = π'

You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?

>
> One last comment: Random832 said:
>
> "Python 3 strings are unicode-unicode, not UTF-8."

If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.

> To be pedantic, Unicode strings are sequences of abstract code points
> ("characters"). UTF-8 is a particular concrete implementation that is
> used to store or transmit such code strings. Here are examples of three
> possible encoding forms for the string 'πz':
>
> UTF-16: either two, or four, bytes per character: 03C0 007A
>
> UTF-32: exactly four bytes per character: 000003C0 0000007A
>
> UTF-8: between one and four bytes per character: CF80 7A

I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:

> (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
> reversed, e.g. C003 7A00. UTF-8 is not.)

do you mean that some hardware configurations require UTF-16 or UTF-32?

Thank you (and the others in this thread) for taking the time to
clarify these matters.

-- 
boB