[Tutor] Why difference between printing string & typing its object reference at the prompt?

eryksun eryksun at gmail.com
Tue Oct 9 11:29:44 CEST 2012


On Mon, Oct 8, 2012 at 10:35 PM, boB Stepp <robertvstepp at gmail.com> wrote:
>
> I am not up (yet) on the details of Unicode that Python 3 defaults to
> for strings, but I believe I comprehend the general concept. Looking
> at the string escape table of chapter 2 it appears that Unicode
> characters can be either 16-bit or 32-bit. That must be a lot of
> potential characters!

There are 1114112 possible codes (65536 codes/plane * 17 planes), but
some are reserved, and only about 10% are assigned. Here's a list by
category:

http://www.fileformat.info/info/unicode/category/index.htm

Python 3 lets you use any Unicode letter as an identifier, including
letter modifiers ("Lm") and number letters ("Nl"). For example:

    >>> aꘌꘌb = True
    >>> aꘌꘌb
    True

    >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6)
    >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ
    (1, 2, 3, 4, 5)

A potential gotcha in Unicode is the design choice to have both
[C]omposed and [D]ecomposed forms of characters. For example:

    >>> from unicodedata import name, normalize

    >>> s1 = "ü"
    >>> name(s1)
    'LATIN SMALL LETTER U WITH DIAERESIS'

    >>> s2 = normalize("NFD", s1)
    >>> list(map(name, s2))
    ['LATIN SMALL LETTER U', 'COMBINING DIAERESIS']

These combine as one glyph when printed:

    >>> print(s2)
    ü

Different forms of the 'same' character won't compare as equal unless
you first normalize them to the same form:

    >>> s1 == s2
    False
    >>> normalize("NFC", s1) == normalize("NFC", s2)
    True

> I don't see a mention of byte strings mentioned in the index of my
> text. Are these just the ASCII character set?

A bytes object (and its mutable cousin bytearray) is a sequence of
numbers, each in the range of a byte (0-255). bytes literals start
with b, such as b'spam' and can only use ASCII characters, as does the
repr of bytes. Slicing returns a new bytes object, but an index or
iteration returns integer values:

    >>> b'spam'[:3]
    b'spa'
    >>> b'spam'[0]
    115
    >>> list(b'spam')
    [115, 112, 97, 109]

bytes have string methods as a convenience, such as find, split, and
partition. They also have the method decode(), which uses a specified
encoding such as "utf-8" to create a string from an encoded bytes
sequence.


More information about the Tutor mailing list