[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 00:02:53 CEST 2011

On 8/31/2011 1:10 PM, Guido van Rossum wrote:

> This is why I find the issue of Python, the language (and stdlib), as
> a whole "conforming to the Unicode standard" such a troublesome
> concept -- I think it is something that an application may claim, but
> the language should make much more modest claims, such as "the regular
> expression syntax supports features X, Y and Z from the Unicode
> recommendation XXX, or "the UTF-8 codec will never emit a sequence of
> bytes that is invalid according Unicode specification YYY". (As long
> as the Unicode references are also versioned or dated.)

This will be a great improvement. It was both embarrassing and 
frustrating to have to respond to Tom C.'s (and other's) issue with "Our 
unicode type is too vaguely documented to tell whether you are reporting 
a bug or making a feature request.

> But if you can observe (valid) surrogate pairs it is still UTF-16.
...
> Ok, I dig this, to some extent. However saying it is UCS-2 is equally
> bad.

As I said on the tracker, our narrow builds are in-between (while moving 
closer to UTF-16), and both terms are deceptive, at least to some.

> At the same time I think it would be useful if certain string
> operations like .lower() worked in such a way that *if* the input were
> valid UTF-16, *then* the output would also be, while *if* the input
> contained an invalid surrogate, the result would simply be something
> that is no worse (in particular, those are all mapped to themselves).
> We could even go further and have .lower() and friends look at
> graphemes (multi-code-point characters) if the Unicode std has a
> useful definition of e.g. lowercasing graphemes that differed from
> lowercasing code points.
>
> An analogy is actually found in .lower() on 8-bit strings in Python 2:
> it assumes the string contains ASCII, and non-ASCII characters are
> mapped to themselves. If your string contains Latin-1 or EBCDIC or
> UTF-8 it will not do the right thing. But that doesn't mean strings
> cannot contain those encodings, it just means that the .lower() method
> is not useful if they do. (Why ASCII? Because that is the system
> encoding in Python 2.)

Good analogy.

> Let's call those things graphemes (Tom C's term, I quite like leaving
> "character" ambiguous) -- they are sequences of multiple code points
> that represent a single "visual squiggle" (the kind of thing that
> you'd want to be swappable in vim with "xp" :-). I agree that APIs are
> needed to manipulate (match, generate, validate, mutilate, etc.)
> things at the grapheme level. I don't agree that this means a separate
> data type is required.

I presume by 'separate data type' you mean a base level builtin class 
like int or str and that you would allow for wrapper classes built on 
top of str, as such are not really 'separate'. For grapheme leval and 
higher, we should certainly start with wrappers and probably with 
alternate versions based on different strategies.

> There are ever-larger units of information
> encoded in text strings, with ever farther-reaching (and more vague)
> requirements on valid sequences. Do you want to have a data type that
> can represent (only valid) words in a language? Sentences? Novels?
...
> I think that at this point in time the best we can do is claim that
> Python (the language standard) uses either 16-bit code units or 21-bit
> code points in its string datatype, and that, thanks to PEP 393,
> CPython 3.3 and further will always use 21-bit code points (but Jython
> and IronPython may forever use their platform's native 16-bit code
> unit representing string type). And then we add APIs that can be used
> everywhere to look for code points (even if the string contains code
> points), graphemes, or larger constructs. I'd like those APIs to be
> designed using a garbage-in-garbage-out principle, where if the input
> conforms to some Unicode requirement, the output does too, but if the
> input doesn't, the output does what makes most sense. Validation is
> then limited to codecs, and optional calls.
>
> If you index or slice a string, or create a string from chr() of a
> surrogate or from some other value that the Unicode standard considers
> an illegal code point, you better know what you are doing. I want
> chr(i) to be valid for all values of i in range(2**21),

Actually, it is range(0X110000) == range(1114112) so that UTF-8 uses at 
most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.

> so it can be
> used to create a lone surrogate, or (on systems with 16-bit
> "characters") a surrogate pair. And also ord(chr(i)) == i for all i in
> range(2**21).

for i in range(0x110000):  # 1114112
     if ord(chr(i)) != i:
         print(i)
# prints nothing (on Windows)

 > I'm not sure about ord() on a 2-character string
> containing a surrogate pair on systems where strings contain 21-bit
> code points; I think it should be an error there, just as ord() on
> other strings of length != 1. But on systems with 16-bit "characters",
> ord() of strings of length 2 containing a valid surrogate pair should
> work.

And now does, thanks to whoever fixed this (withing the last year, I think).

-- 
Terry Jan Reedy