[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 10:56:02 CEST 2014

On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:

> Guido's mantra is something like "Python's str doesn't contain
> characters or even code points[1], it contains code units."

But is that true? If it were true, I would expect to be able to make 
Python text strings containing code units that aren't code points, e.g. 
something like "\U12340000" or chr(0x12340000) should work, but neither 
do. As far as I can tell, there is no way to build a string containing 
items which aren't code points.

I don't think it is useful to say that strings *contain* code units, 
more that they *are made up from* code units. Code units are the 
implementation: 16-bit code units in narrow builds, 32-bit code units 
in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
beyond. (I don't know of any Python implementation which uses UTF-8 
internally, but if there was one, it would use 8-bit code units.)

It isn't very useful to say that in Python 3.3 the string "A" *contains*
the 8-bit code unit 0x41. That's conflating two different levels of 
explanation (the high-level interface and the underlying implemention) 
and potentially leads to user confusion like

# 8-bit code units are bytes, right?
assert b'\41' in "A"

which is Not Even Wrong.
http://rationalwiki.org/wiki/Not_even_wrong

I think it is correct to say that Python strings are sequences of 
Unicode code points U+0000 through U+10FFFF. There are no other 
restrictions, e.g. strings can contain surrogates, noncharacters, or 
nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
ACCENT combined with U+000A (newline).

> Implying
> that dealing with characters (or the grapheme globs that occasionally
> raise their ugly heads here) is an issue for higher-level facilities
> than str to deal with.

Agreed that Python doesn't offer a string type based on graphemes, and 
that such a facility belongs as a high-level library, not a built-in 
type.

Also agreed that talking about characters is sloppy. Nevertheless, for 
English speakers at least, "code point = character" isn't too awful a 
first approximation.

> The point being that
> 
>  > Basically, we are pretending that the each smuggled byte is single
>  > character
> 
> is something of a misstatement (good enough for present purpose of
> discussing email, but not good enough for the general case of
> understanding how this is supposed to work when porting the construct
> to other Python implementations), while
> 
>  > for string parsing purposes...but they don't match any of our
>  > parsing constants.
> 
> is precisely Pythonically correct.  You might want to add "because all
> parsing constants contain only valid characters by construction."

I don't understand what you are trying to say here.

>  > [*] I worried a lot that this was re-introducing the bytes/string
>  > problem from python2.
> 
> It isn't, because the bytes/str problem was that given a str object
> out of context you could not tell whether it was a binary blob or
> text, and if text, you couldn't tell if it was external encoded text
> or internal abstract text.
> 
> That is not true here because the representations of characters vs.
> smuggled bytes in str are disjoint sets.

Nor am I sure what you are trying to say here either.

> Footnotes: 
> [1]  In Unicode terminology, a code unit is the smallest computer
> object that can represent a character (this is uniquely and sanely
> defined for all real Unicode transformation formats aka UTFs).  A code
> point is an integer 0 - (17*256*256-1) that can represent a character,
> but many code points such as surrogates and 0xFFFF are defined to be
> non-characters.

Actually not quite. "Noncharacter" is concretely defined in Unicode, and 
there are only 66 of them, many fewer than the surrogate code points 
alone. Surrogates are reserved, not noncharacters.

http://www.unicode.org/glossary/#surrogate_code_point
http://www.unicode.org/faq/private_use.html#nonchar1

It is wrong to talk about "surrogate characters", but perhaps you mean 
to say that surrogates (by which I understand you to mean surrogate code 
points) are "not human-meaningful characters", which is not the same 
thing as a Unicode noncharacter.

> Characters are those code points that may be assigned
> an interpretation as a character, including undefined characters
> (private space and reserved).

So characters are code points which are characters, including undefined 
characters? :-)

http://www.unicode.org/glossary/#character

-- 
Steven