[Python-Dev] Re: Regression in unicodestr.encode()?
jepler@unpythonic.dhs.org
jepler@unpythonic.dhs.org
Tue, 9 Apr 2002 21:46:11 -0500
On Tue, Apr 09, 2002 at 08:50:23PM -0400, Guido van Rossum wrote:
> I knew all that, but I thought I'd read about a hack to encode NUL
> using c0 80, specifically to get around the limitation on encoded
> strings containing a NUL. But I can't find the reference so I'll shut
> up.
Tcl does, even including a CVS checkout from a few weeks ago. It's done
deliberately, as though some internal APIs didn't handle NUL-containing
strings correctly. I am certain that I saw a paper about precisely this
detail of tcl, but apparently it's been taken down in shame. I did find:
TCL does its best to accept anything, but produce only shortest-form
output. The one special case is embedded nulls (0x0000), where
Tcl produces 0xC0 0x80 in order to avoid possible null-termination
problems with non-UTF aware code. It probably wouldn't break
anything to to disallow non-shortest form UTF-8 for all but this
one case. If you eliminate the 0xc080 case, you'll have to check
to make sure *everything* is length encoded.
-- http://mail.nl.linux.org/linux-utf8/2001-03/msg00029.html
About Java:
The interfaces java.io.DataInput and java.io.DataOutput have methods
called `readUTF' and `writeUTF' respectively. But note that they don't
use UTF-8; they use a modified UTF-8 encoding: the NUL character
is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00,
and a 0x00 byte is added at the end. Encoded this way, strings can
contain NUL characters and nevertheless need not be prefixed with a
length field - the C <string.h> functions like strlen() and strcpy()
can be used to manipulate them.
-- http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html
Why Python refuses to do it this way:
for security reasons, the UTF-8 codec gives you an "illegal encoding"
error in this case.
-- http://aspn.activestate.com/ASPN/Mail/Message/i18n-sig/581440
(our very own Mr. Fredrik Lundh, also quoting the Gospel of RFC,
chapter 2279)
Ah, and here's the article I originally found the c0 80 idea presented as
a way to make existing programs handle embedded NULs:
Now going the other way. In orthodox UTF-8, a NUL byte(\x00) is
represented by a NUL byte. Plain enough. But in Tcl we sometimes
want NUL bytes inside "binary" strings (e.g. image data), without
them terminating it as a real NUL byte does. To represent a NUL byte
without any physical NUL bytes, we treat it like a character above
ASCII, which must be a minimum two bytes long:
(110)00000 (10)000000 => C0 80
Whoops. Took us a while, but now we can read UTF-8, bit by bit.
-- http://mini.net/tcl/1211.html
I'm terribly glad that Python has gotten this detail right.
Jeff