[I18n-sig] Re: UCS-4 configuration

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 01:22:22 +0100

On Tue, 26 Jun 2001, guido@digicool.com wrote:
 > Here's another weird failure in 4-byte mode, with a manually
> constructed surrogate pair (using marshal, but direct use of
> u.encode('utf8') would show the same problem):
>>>> u = u'\ud800\udc00'
>>>> u
> u'\ud800\udc00'
>>>> len(u)
> 2
>>>> import marshal
>>>> s = marshal.dumps(u)
>>>> s
> 'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80'
>>>> marshal.loads(s)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: illegal encoding
> Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> utf8 sequences.  I think it should either spit out an error or (I
> think this is better -- "be forgiving in what you accept") recognize
> the surrogate pair and spit out a 4-byte utf8 sequence.  Note that
> in 2-byte mode, this same string literal can be marshalled and
> unmarshalled just fine!

I think that the best compromise is to discourage programmers from
creating non-BMP characters by manually splicing together surrogate
values, and encourage them to use unichr(approiate non-BMP value)
instead.  This is not only more readable, but avoids this kind of
problem.  Perhaps the Python parser ought to produce a warning when it
encounters such a string constant, to help catch this sort of bug.  On
the other hand, disallowing unichr(some surrogate value) is probably
too far: you should either allow all non-sensical values, or none at

> I think I'm going to withdraw my recommendation that in 4-byte mode
> \U and unichr() would accept any 32-bit value; the use of UTF8 by
> marshal effectively rules this out.

UTF-8 is easily extended to store anything 31-bit values; in fact the
current ISO definition of UTF-8 is like that, though it will be
changed to match the Unicode definition in the next version.  There is
an obvious tweak to store 32 bit values as well.

Off course, using such a scheme means that UTF-8 is not used for
marshalling, just some closely related encoding.  But since we "own"
the marshalling format, this might no be such a problem.

> Or should we change the marshalling format to do something that's
> more transparent?  It feels uncomfortable that in 2-byte mode we can
> easily create unicode strings containing illegal sequences
> (e.g. lone surrogates), but these strings can't be marshalled.
> Marshal has no business being judgemental about the value of the
> data.

Just encode the lone surrogate as though it was a proper Unicode
scalar value.  This is a no-no if you go by the standard and I know
that I've been arguing against doing things like that in the standard
UTF-8 codec, but in the context of a private file format I think that
it is ok to use a private variation of UTF-8.  All we have to do is
make sure that it is referred to by a name different from UTF-8
("marshall" would be fine, I suppose) and that we never expose this
private goo to anything outside Python.

Big Gaute                               http://www.srcf.ucam.org/~gs234/
I am having a CONCEPTION--