[Python-3000] String comparison

Wed Jun 13 12:24:33 CEST 2007

On 6/13/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> What you are saying is that if you write a 10-line script that claims
> Unicode conformance, you are responsible for the Unicode-correctness of
> all modules you call implicitly as well as that of the Python interpreter.

If text files are by default read in normalized and noncharacters stripped,
where will you get problems in practice? A higher-level string type may be
useful, but there's no single obvious design.

>  > Practically speaking, there's little need to interpret surrogate pairs
>  > as two code points instead of as one non-BMP code point.
>
> Again, a mistake.  In the standard library, the question is not "do I
> need this?", but "what happens if somebody else does it?"  They may
> receive the same answer, but then again they may not.

What I meant is that the stdlib should only have string operations that
effectively work on (1) sequences of code units or (2) sequences of code
points, and that the choice between these two should be made reasonably.

One way to check whether a choice is reasonable is to consider what it
would mean for UTF-8, as there the difference between code units (0...ff)
and code points (0...10ffff) is the easiest to see. E.g. normalization
doesn't make any sense on code units, but slicing does.

Once you have determined that the reasonable choice is code points for some
operation in general, then you shouldn't use the UCS-2 interpretation for
16-bit strings in particular, because it muddies the underlying rule,
and Unicode is clear as mud without extra muddying already :-)

> For example, suppose you have a supplier-consumer pair sharing a
> fixed-length buffer of 2-octet code units.  If it should happen that
> the supplier uses the UCS-2 interpretation, then a surrogate pair may
> get split when the buffer is full.

I.e. you have a supplier that works on code units. If you document this,
then there's no problem, especially if that's what the user expects.

> Will a UTF-16 consumer be prepared for this?

This also needs to be documented, especially if it isn't. The consumer is
more useful if it is prepared for it. I've been excavating some Cambrian
period discussions on the topic recently, and this brings one post to mind:
http://mail.python.org/pipermail/i18n-sig/2001-June/001010.html

> Almost surely some will not, because that would imply maintaining an
> internal buffer, which is stupidly inefficient if you ave an external
> buffer protocol.

You only need to buffer one code unit at most, it's not inefficient.

> The problem is, suppose somehow you get a UCS-2 source?  Whose
> responsibility is it to detect that?

The user should check the API documentation. If the documentation is
missing, then you have to test or UTSL it (testing is good to do anyway).
If the documentation is wrong, then it's a bug.

> But the Unicode standard itself gives (the equivalent of) u'\ud800' +
> u'\udc00' as an example of the kind of thing you *should be able to
> do*.  Because, you know, clients of the standard library *will* be
> doing half-witted[1] things like that.

For UTF-16, yes, but for UTF-32, no. Any surrogate code units make
UTF-32 ill-formed, so there's no need to use them to make UTF-32 strings.
In UTF-16 surrogate pairs are allowed, and allowing isolated surrogates
makes some operations simpler. Kind of like negative integers make
calculations simpler, even if the end result is always non-negative.

Python itself has both UTF-16 and UTF-32 behavior on UCS-4 builds, but
that's an original invention probably intended to make code written for
UTF-16 work unchanged on UCS-4 builds, following the rule "be lenient
in what you accept and strict in what you emit".

> Footnotes:
> [1]  What I wanted to say was いい加減にしろよ！ <wink>

しかたがあるまい。