[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 09:13:03 CEST 2011

Where I cut your words, we are in 100% agreement.  (FWIW :-)

Guido van Rossum writes:
 > On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull
 > <stephen at xemacs.org> wrote:

 > > Well, that's why I wrote "intended to be suggestive".  The Unicode
 > > Standard does not specify at all what the internal representation of
 > > characters may be, it only specifies what their external behavior must
 > > be when two processes communicate.  (For "process" as used in the
 > > standard, think "Python modules" here, since we are concerned with the
 > > problems of folks who develop in Python.)  When observing the behavior
 > > of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
 > > even UTF-32 arrays; only arrays of characters.
 > 
 > Hm, that's not how I would read "process". IMO that is an
 > intentionally vague term,

I agree.  I'm sorry that I didn't make myself clear.  The reason I
read "process" as "module" is that some modules of Python, and
therefore Python as a whole, cannot conform to the Unicode standard.
Eg, anything that inputs or outputs bytes.  Therefore only "modules"
and "types" can be asked to conform.  (I don't think it makes sense to
ask anything lower level to conform.  See below where I comment on
your .lower() example.)

What I am advocating (for the long term) is provision of *one* module
(or type) such that if the text processing done by the application is
done entirely in terms of this module (type), it will conform (to some
specified degree, chosen to balance user wants with implementation and
support costs).  It may be desireable to provide others for
sufficiently important particular use cases, but at present I see a
clear need for *one*.  Unicode conformance is going to be a common
requirement for apps used by global enterprises.

I oppose trying to make str into that type.  We need str, just as it
is, for many reasons.

 > and we are free to decide how to interpret it. I don't think it
 > will work very well to define a process as a Python module; what
 > about Python modules that agree about passing along array of code
 > units (or streams of UTF-8, for that matter)?

Certainly a group of cooperating modules could form a conforming
process, just as you describe it for one example.  The "one module"
mentioned above need not implement everything internally, but it would
take responsiblity for providing guarantees (eg, unit tests) of
whatever conformance claims it makes.

 > > Thus, according to the rules of handling a UTF-16 stream, it is an
 > > error to observe a lone surrogate or a surrogate pair that isn't a
 > > high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
 > > C8-C10).  That's what I mean by "can't tell it's UTF-16".
 > 
 > But if you can observe (valid) surrogate pairs it is still UTF-16.

In the concrete implementation I have in mind, surrogate pairs are
represented by a str containing 2 code units.  But in that case
s[i][1] is an error, and s[i][0] == s[i].  print(s[i][0]) and
print(s[i]) will print the same character to the screen.  If you
decode it to bytes, well, it's not a str any more so what have you
proved?  Ie, what you will see is *code points* not in the BMP.

You don't have to agree that such "surrogate containment" behavior is
so valuable as I think it is, but that's what I have in mind as one
requirement for a "conforming implementation of UTF-16".

 > At the same time I think it would be useful if certain string
 > operations like .lower() worked in such a way that *if* the input were
 > valid UTF-16, *then* the output would also be, while *if* the input
 > contained an invalid surrogate, the result would simply be something
 > that is no worse (in particular, those are all mapped to
 > themselves).

I don't think that it's a good idea to go for conformance at the
method level.  It would be a feature for apps that don't claim full
conformance because they nevertheless give good results in more cases.
The downside will be Python apps using str that will pass conformance
tests written for, say Western Europe, but end users in Kuwait and
Kuala Lumpur will report bugs.

 > An analogy is actually found in .lower() on 8-bit strings in Python 2:
 > it assumes the string contains ASCII, and non-ASCII characters are
 > mapped to themselves. If your string contains Latin-1 or EBCDIC or
 > UTF-8 it will not do the right thing. But that doesn't mean strings
 > cannot contain those encodings, it just means that the .lower() method
 > is not useful if they do. (Why ASCII? Because that is the system
 > encoding in Python 2.)

Sure.  I think that approach is fine for str, too, except that I would
hope it looks up BMP base characters in the case-mapping database.
The fact is that with very few exceptions non-BMP characters are going
to be symbols (mathematical operators and emoticons, for example).
This is good enough, except when it's not---but when it's not, only
100% conformance is really a reasonable target.  IMO, of course.

 > I think we should just document how it behaves and not get hung up on
 > what it is called. Mentioning UTF-16

If you also say, "this type can represent all characters in Unicode,
as well as certain non-characters", why mention UTF-16 at all?

 > Let's call those things graphemes (Tom C's term, I quite like leaving
 > "character" ambiguous)

OK, but those definitions need to be made clear, as "grapheme cluster"
and "combined character" are defined in the Unicode standard, and in
fact mean slightly different things from each other.

 > -- they are sequences of multiple code points that represent a
 > single "visual squiggle" (the kind of thing that you'd want to be
 > swappable in vim with "xp" :-). I agree that APIs are needed to
 > manipulate (match, generate, validate, mutilate, etc.)  things at
 > the grapheme level. I don't agree that this means a separate data
 > type is required.

Clear enough.

 > There are ever-larger units of information encoded in text strings,
 > with ever farther-reaching (and more vague) requirements on valid
 > sequences. Do you want to have a data type that can represent (only
 > valid) words in a language? Sentences? Novels?

No, and I can tell you why!  The difference between characters and
words is much more important than that between code point and grapheme
cluster for most users and the developers who serve them.  Even small
children recognize typographical ligatures as being composite objects,
while at least this Spanish-as-a-second-language learner was taught
that `ñ' is an atomic character represented by a discontiguous glyph,
like `i', and it is no more related to `n' than `m' is.  Users really
believe that characters are atomic.  Even in the cases of Han
characters and Hangul, users think of the characters as being
"atomic," but in the sense of Bohr rather than that of Democritus.

I think the situation for text processing is analogous to chemistry
where the atom, with a few fairly gross properties (the outer electron
orbitals) is the fundamental unit, not the elementary particles like
electrons and protons and structures like inner orbitals.  Sure, there
are higher order structures like molecules, phases, and crystals, but
it is elements that have the most regular and simply described
behavior for the chemist, and it does not become any simpler for the
chemist if you decompose the atom.  The composed character or grapheme
cluster is the analogue of the atom for most processing at the level
of "text".  The only real exceptions I can imagine are in the domain
of linguistics.

 > I think that at this point in time the best we can do is claim that
 > Python (the language standard) uses either 16-bit code units or 21-bit
 > code points in its string datatype, and that, thanks to PEP 393,
 > CPython 3.3 and further will always use 21-bit code points (but Jython
 > and IronPython may forever use their platform's native 16-bit code
 > unit representing string type). And then we add APIs that can be used
 > everywhere to look for code points (even if the string contains code
 > points), graphemes, or larger constructs. I'd like those APIs to be
 > designed using a garbage-in-garbage-out principle, where if the input
 > conforms to some Unicode requirement, the output does too, but if the
 > input doesn't, the output does what makes most sense. Validation is
 > then limited to codecs, and optional calls.

Clear enough.  I disagree that that will be enough for constructing
large-scale Unicode-conformant applications.  Somebody is going to
have to produce batteries for those applications, and I think they
should be included in Python.  I agree that it's proper that I and
those who think the same way take responsibility for writing and
implementing a PEP.

 > If you index or slice a string, or create a string from chr() of a
 > surrogate or from some other value that the Unicode standard considers
 > an illegal code point, you better know what you are doing.

I think that's like asking a toddler to know that the stove is hot.
The consequences for the toddler of her ignorance are much greater,
but the informational requirement is equally stringent.  Of course
application writers are adults who could be asked to learn, but
economically I think it make a lot more sense to include those
batteries.  IMHO YMMV, obviously.

 > I want chr(i) to be valid for all values of i in range(2**21),

I quite agree (ie, for str).  Thus I perceive a need for another type.