How to turn a string into a list of integers?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Sep 6 07:47:45 CEST 2014


Kurt Mueller wrote:

> Could someone please explain the following behavior to me:
> Python 2.7.7, MacOS 10.9 Mavericks
> 
>>>> import sys
>>>> sys.getdefaultencoding()
> 'ascii'

That's technically known as a "lie", since if it were *really* ASCII it
would refuse to deal with characters with the high-bit set. But it doesn't,
it treats them in an unpredictable and implementation-dependent manner.

>>>> [ord(c) for c in 'AÄ']
> [65, 195, 132]

In this case, it looks like your terminal is using UTF-8, so the character Ä
is represented in memory by bytes 195, 132:

py> u'Ä'.encode('utf-8')
'\xc3\x84'
py> for c in u'Ä'.encode('utf-8'):
...     print ord(c)
...
195
132

If your terminal was set to use a different encoding, you probably would
have got different results. When you type whatever key combination you used
to get Ä, your terminal receives the bytes 195, 132, and displays Ä. But
when Python processes those bytes, it's not expecting arbitrary Unicode
characters, it's expecting ASCII-ish bytes, and so treats it as two bytes
rather than a single character:

py> 'AÄ'
'A\xc3\x84'

That's not *really* ASCII, because ASCII doesn't include anything above 127,
but we can pretend that "ASCII plus arbitrary bytes between 128 and 256" is
just called ASCII. The important thing here is that although your terminal
is interpreting those two bytes \xc3\x84 (decimal 195, 132) as the
character Ä, it isn't anything of the sort. It's just two arbitrary bytes.

>>>> [ord(c) for c in u'AÄ']
> [65, 196]

Here, you have a proper Unicode string, so Python is expecting to receive
arbitrary Unicode characters and can treat the two bytes 195, 132 as Ä, and
that character has ordinal value 196:

py> ord(u"Ä")
196



> My obviously wrong understanding:
> ‚AÄ‘ in ‚ascii‘ are two characters
>      one with ord A=65 and
>      one with ord Ä=196 ISO8859-1 <depends on code table>

As soon as you start talking about code tables, *it isn't ASCII anymore*.
(Technically, ASCII *is* a code table, but it's one that only covers 127
different characters.)

When you type AÄ on your keyboard, or paste them, or however they were
entered, the *actual bytes* the terminal receives will vary, but regardless
of how they vary, the terminal *almost certainly* will interpret the first
byte (or possibly more than one byte, who knows?) as the ASCII character A.

(Most, but not all, code pages agree that byte 65 is A, 66 is B, and so on.)

The second (third? fifth?) byte, and possibly subsequent bytes, will
*probably* be displayed by the terminal as Ä, but Python only sees the raw
bytes. The important thing here is that unless you have some bizarre and
broken configuration, Python can correctly interpret the A as A, but what
you get for the Ä depends on the interaction of keyboard, OS, terminal and
the phase of the moon.

>      —-> why [65, 195, 132]

Since Python is expecting to interpret those bytes as an ASCII-ish byte
string, it grabs the raw bytes and ends up (in your case) with 65, 195,
132, or 'A\xc3\x84', even though your terminal displays it as AÄ.

This does not happen with Unicode strings.

> u’AÄ’ is an Unicode string
>      —-> why [65, 196]

In this case, Python knows that you are dealing with a Unicode string, and Ä
is a valid character in Unicode. Python deals with the internal details of
converting from whatever-damn-bytes your terminal sends it, and ends up
with a string of characters A followed by Ä.

If you could peer under the hood, and see what implementation Python uses to
store that string, you would see something version dependent. In Python
2.7, you would see an object more or less something vaguely like this:

[object header containing various fields]
[length = 2]
[array of bytes = 0x0041 0x00C4]


That's for a so-called "narrow build" of Python. If you have a "wide build",
it will something like this:

[object header containing various fields]
[length = 2]
[array of bytes = 0x00000041 0x000000C4]

In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have
something conceptually like this:

[object header containing various fields]
[length = 2]
[tag = one byte per character]
[array of bytes = 0x41 0xC4]

Some other implementations of Python could use UTF-8 internally:

[object header containing various fields]
[length = 2]
[array of bytes = 0x41 0xC3 0x84]


or even something more complex. But the important thing is, regardless of
the internal implementation, Python guarantees that a Unicode string is
treated as a fixed array of code points. Each code point has a value
between 0 and, not 127, not 255, not 65535, but 1114111.


-- 
Steven




More information about the Python-list mailing list