[Python-Dev] Re: Regression in unicodestr.encode()?

François Pinard pinard@iro.umontreal.ca
09 Apr 2002 20:31:47 -0400


[Guido van Rossum]

> > [Barry A. Warsaw]
> > > My very limited Tim-enlightened understanding is that encoding a string
> > > to UTF-8 should never produce a string with NULs.

> [François]
> > Besides the fact that NULs encode to themselves.  In fact, 0-127 encode to
> > themselves, and are never produced otherwise.  Also, 254 and 255 are never
> > produced, I heard that some use these to escape out within an UTF-8 string.

> Hm, but isn't there a way to encode a NUL that doesn't produce a NUL?
> In some variant?

Not in UTF-8, that I know.  The 128 characters of ASCII are just invariant,
in and out, by design.  Please forgive me if I'm merely repeating things
that you know already, but it goes this way, looking at an UTF-8 string:

1) any byte with the eight bit cleared represents itself (so for NUL),

2) any byte with the eight bit set is part of a multi-byte sequence,

2a) any byte with both the eight bit and the seventh bit set is the start of
    a multi-byte sequence,

2b) any byte with the eight bit set and the seventh bit clear is the
    continuation of a multi-byte sequence.

Point 2) indirectly says that a multi-byte sequence never holds any NUL.

There is also a rule about the shortest coding.  It is invalid UTF-8 to
use more bytes than required, and a given UCS character has a unique UTF-8
representation.  Moreover, decoders should raise an exception on non-minimal
UTF-8 codings, and I do not know how Python behaves with this.  The Gambit
author once told me he found a way to implement the test very efficiently.

One could use multi-byte sequences, that is, a sequence having no NULs,
that would fool a lazy UTF-8 decoder into producing a NUL.  But for this,
one has to break the shortest coding rule, and start from invalid UTF-8.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard