[Tutor] Why difference between printing string & typing its object reference at the prompt?
eryksun
eryksun at gmail.com
Tue Oct 9 11:29:44 CEST 2012
On Mon, Oct 8, 2012 at 10:35 PM, boB Stepp <robertvstepp at gmail.com> wrote:
>
> I am not up (yet) on the details of Unicode that Python 3 defaults to
> for strings, but I believe I comprehend the general concept. Looking
> at the string escape table of chapter 2 it appears that Unicode
> characters can be either 16-bit or 32-bit. That must be a lot of
> potential characters!
There are 1114112 possible codes (65536 codes/plane * 17 planes), but
some are reserved, and only about 10% are assigned. Here's a list by
category:
http://www.fileformat.info/info/unicode/category/index.htm
Python 3 lets you use any Unicode letter as an identifier, including
letter modifiers ("Lm") and number letters ("Nl"). For example:
>>> aꘌꘌb = True
>>> aꘌꘌb
True
>>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6)
>>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ
(1, 2, 3, 4, 5)
A potential gotcha in Unicode is the design choice to have both
[C]omposed and [D]ecomposed forms of characters. For example:
>>> from unicodedata import name, normalize
>>> s1 = "ü"
>>> name(s1)
'LATIN SMALL LETTER U WITH DIAERESIS'
>>> s2 = normalize("NFD", s1)
>>> list(map(name, s2))
['LATIN SMALL LETTER U', 'COMBINING DIAERESIS']
These combine as one glyph when printed:
>>> print(s2)
ü
Different forms of the 'same' character won't compare as equal unless
you first normalize them to the same form:
>>> s1 == s2
False
>>> normalize("NFC", s1) == normalize("NFC", s2)
True
> I don't see a mention of byte strings mentioned in the index of my
> text. Are these just the ASCII character set?
A bytes object (and its mutable cousin bytearray) is a sequence of
numbers, each in the range of a byte (0-255). bytes literals start
with b, such as b'spam' and can only use ASCII characters, as does the
repr of bytes. Slicing returns a new bytes object, but an index or
iteration returns integer values:
>>> b'spam'[:3]
b'spa'
>>> b'spam'[0]
115
>>> list(b'spam')
[115, 112, 97, 109]
bytes have string methods as a convenience, such as find, split, and
partition. They also have the method decode(), which uses a specified
encoding such as "utf-8" to create a string from an encoded bytes
sequence.
More information about the Tutor
mailing list