How to turn a string into a list of integers?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Sep 6 06:27:05 CEST 2014


Chris Angelico wrote:

> On Fri, Sep 5, 2014 at 12:09 PM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
>> On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav at gmail.com> wrote:
>>> If it's a Unicode string (which is the default in Python 3), all
>>> Unicode characters will work correctly.
>>
>> Assuming the library that needs this is expecting codepoints and will
>> accept integers greater than 255.
> 
> They're still valid integers. It's just that someone might not know
> how to work with them. Everyone has limits - I don't think repr()
> would like to be fed Graham's Number, for instance, but we still say
> that it accepts integers :)

If you can fit Graham's Number into memory, repr() will happily deal with
it. Although, it might take a while to print out...

[...]
> I just don't like people talking about "Unicode characters" being
> somehow different from "normal text" or something, and being something
> that you need to be careful of. It's not that there are some
> characters that behave nicely, and then other ones ("Unicode" ones)
> that don't.

"Behave nicely" depends on what behaviour you're expecting.

There is a sense in which Unicode is different from ASCII text. ASCII is a 7
bit character set. In principle, you could have different implementations
of ASCII but in practice it's been so long since any machine you're likely
to come across uses anything but exactly a single 8-bit byte for each ASCII
character that we might as well say that ASCII has a single implementation:

* 1 byte code units, fixed width characters 

That is, every character takes exactly one 8-bit byte.

(Reminder: "byte" does not necessarily mean 8 bits.)

Unicode, on the other hand, has *at least* nine different implementations
which you are *likely* to come across:

* UTF-8 has 1-byte code units, variable width characters: every character
takes between 1 and 4 bytes;

* UTF-8 with a so-called "Byte Order Mark" at the beginning of the file;

* UTF-16-BE has 2-byte code units, variable width characters: every
character takes either 2 or 4 bytes;

* UTF-16-LE is the same, but the bytes are in opposite order;

* UTF-16 with a Byte Order Mark at the beginning of the file;

* UTF-32-BE has 4-byte code units, fixed width characters; every character
takes exactly 4 bytes;

* UTF-32-LE is the same, but the bytes are in opposite order;

* UTF-32 with a Byte Order Mark at the beginning of the file;

* UCS-2 is a subset of Unicode with 2-byte code units, fixed width
characters; every character takes exactly 2 bytes (UCS-2 is effectively
UTF-16-BE for characters in the Basic Multilingual Plane).

Plus various more obscure or exotic encodings.

So, while it is not *strictly* correct to say that ASCII character 'A' is
always the eight bits 01000001, the exceptions are so rare that there might
as well not be any. But the Unicode character 'A' could be:

01000001
01000001 00000000
00000000 01000001
01000001 00000000 00000000 00000000
00000000 00000000 00000000 01000001


and possibly more.


-- 
Steven




More information about the Python-list mailing list