[Python-Dev] Re: Regression in unicodestr.encode()?

Tue, 9 Apr 2002 21:46:11 -0500

On Tue, Apr 09, 2002 at 08:50:23PM -0400, Guido van Rossum wrote:
> I knew all that, but I thought I'd read about a hack to encode NUL
> using c0 80, specifically to get around the limitation on encoded
> strings containing a NUL.  But I can't find the reference so I'll shut
> up.

Tcl does, even including a CVS checkout from a few weeks ago.  It's done
deliberately, as though some internal APIs didn't handle NUL-containing
strings correctly.  I am certain that I saw a paper about precisely this
detail of tcl, but apparently it's been taken down in shame.  I did find:
    TCL does its best to accept anything, but produce only shortest-form
    output.  The one special case is embedded nulls (0x0000), where
    Tcl produces 0xC0 0x80 in order to avoid possible null-termination
    problems with non-UTF aware code.  It probably wouldn't break
    anything to to disallow non-shortest form UTF-8 for all but this
    one case.  If you eliminate the 0xc080 case, you'll have to check
    to make sure *everything* is length encoded.
	-- http://mail.nl.linux.org/linux-utf8/2001-03/msg00029.html

About Java:
    The interfaces java.io.DataInput and java.io.DataOutput have methods
    called `readUTF' and `writeUTF' respectively. But note that they don't
    use UTF-8; they use a modified UTF-8 encoding: the NUL character
    is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00,
    and a 0x00 byte is added at the end. Encoded this way, strings can
    contain NUL characters and nevertheless need not be prefixed with a
    length field - the C <string.h> functions like strlen() and strcpy()
    can be used to manipulate them.
	-- http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html

Why Python refuses to do it this way:
    for security reasons, the UTF-8 codec gives you an "illegal encoding"
    error in this case.
	-- http://aspn.activestate.com/ASPN/Mail/Message/i18n-sig/581440
	(our very own Mr. Fredrik Lundh, also quoting the Gospel of RFC,
	chapter 2279)

Ah, and here's the article I originally found the c0 80 idea presented as
a way to make existing programs handle embedded NULs:
    Now going the other way. In orthodox UTF-8, a NUL byte(\x00) is
    represented by a NUL byte. Plain enough. But in Tcl we sometimes
    want NUL bytes inside "binary" strings (e.g. image data), without
    them terminating it as a real NUL byte does. To represent a NUL byte
    without any physical NUL bytes, we treat it like a character above
    ASCII, which must be a minimum two bytes long:

	(110)00000 (10)000000 => C0 80

    Whoops. Took us a while, but now we can read UTF-8, bit by bit. 
	-- http://mini.net/tcl/1211.html

I'm terribly glad that Python has gotten this detail right.

Jeff