[Python-Dev] Unicode support in getargs.c

Martin v. Loewis martin@v.loewis.de
Sun, 6 Jan 2002 01:10:42 +0100


> [leaving only one example in:]
> >                 version = "HTTP/0.9"
> >                 status = "200"
> >                 reason = ""
> > 
> > Protocol elements, thus byte string.
> 
> I think you're taking it too far now. I think we should assume that
> ASCII survives. 

That is not the issue. That string *is* a byte string. The HTTP
protocol is not defined in terms of character sequences, but in terms
of byte sequences, or else interoperability would be lost.

If those strings would converted to character strings (i.e. Unicode
strings), it would still work, but it won't be correct anymore. That's
just like giving a file size as a double: it would probably work, but
it won't be correct.

> Also, as these things are readable they should be treated as such. It
> should be possible to do
> >>> print u"Funny reply to my "+unicode(version)+u" message"
> especially when the "funny reply" bit is in Japanese.

That is a nice property of so-called "text" protocols. That still
doesn't make it a character-oriented protocol; HTTP *is* a byte
oriented protocol. If you have a binary protocol, there is likely also
a version field in it, but you'd have to write

print u"Funny reply to my "+XDRversion2string(version)+u" message"

> What I would agree with, I think, is if we tag these strings as
> "ascii". 

That is pointless. Having strings tagged with their encoding is also a
possible architecture for a programming language, but none that Python
has chosen to take. Instead, Python has selected to have only a single
data type for character data, namely Unicode.

> Python sourcecode is ASCII, and if you put 8 bit characters in there
> you're living dangerously.
[...]
> Only when octal or hex escapes appear in a sourcecode string can it be
> anything other than ascii.

The octal escapes, in themselves, are also ASCII, or else you could
not put them into source code. The traditional string type in Python
really is a byte string type first of all. It can be used as a
character string type only if you imply a character set and an
encoding. The source being ASCII just gives you a guarantee about the
bytes you get at runtime.

Regards,
Martin