Re: [Python-Dev] Unicode support in getargs.c
Recently, "Martin v. Loewis"
When the discussion of tagging binary strings in source code came up, I started to look into the standard library which string literals would have to be tagged as byte strings, and which are really character strings.
I found that the overwhelming majority of string literals in the standard Python library really denotes byte strings, if you ignore doc strings. Sometimes, it isn't obvious that they are binary strings, hence the smiley. [leaving only one example in:] version = "HTTP/0.9" status = "200" reason = ""
Protocol elements, thus byte string.
I think you're taking it too far now. I think we should assume that ASCII survives. If Python runs on an EBCDIC machine (does it?) I assume that at some point the conversion of EBCDIC<->ASCII is handled semi-transparently. Also, as these things are readable they should be treated as such. It should be possible to do
print u"Funny reply to my "+unicode(version)+u" message" especially when the "funny reply" bit is in Japanese.
What I would agree with, I think, is if we tag these strings as "ascii". And that is also what the BDFL pronounced at some point: Python sourcecode is ASCII, and if you put 8 bit characters in there you're living dangerously. Only when octal or hex escapes appear in a sourcecode string can it be anything other than ascii. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
[leaving only one example in:]
version = "HTTP/0.9" status = "200" reason = ""
Protocol elements, thus byte string.
I think you're taking it too far now. I think we should assume that ASCII survives.
That is not the issue. That string *is* a byte string. The HTTP protocol is not defined in terms of character sequences, but in terms of byte sequences, or else interoperability would be lost. If those strings would converted to character strings (i.e. Unicode strings), it would still work, but it won't be correct anymore. That's just like giving a file size as a double: it would probably work, but it won't be correct.
Also, as these things are readable they should be treated as such. It should be possible to do
print u"Funny reply to my "+unicode(version)+u" message" especially when the "funny reply" bit is in Japanese.
That is a nice property of so-called "text" protocols. That still doesn't make it a character-oriented protocol; HTTP *is* a byte oriented protocol. If you have a binary protocol, there is likely also a version field in it, but you'd have to write print u"Funny reply to my "+XDRversion2string(version)+u" message"
What I would agree with, I think, is if we tag these strings as "ascii".
That is pointless. Having strings tagged with their encoding is also a possible architecture for a programming language, but none that Python has chosen to take. Instead, Python has selected to have only a single data type for character data, namely Unicode.
Python sourcecode is ASCII, and if you put 8 bit characters in there you're living dangerously. [...] Only when octal or hex escapes appear in a sourcecode string can it be anything other than ascii.
The octal escapes, in themselves, are also ASCII, or else you could not put them into source code. The traditional string type in Python really is a byte string type first of all. It can be used as a character string type only if you imply a character set and an encoding. The source being ASCII just gives you a guarantee about the bytes you get at runtime. Regards, Martin
jack wrote:
If Python runs on an EBCDIC machine (does it?)
http://home.no.net/pgummeda/ (2.2 on as/400) http://www-1.ibm.com/servers/eserver/zseries/zos/unix/python.html (1.4 on os/390) </F>
participants (3)
-
Fredrik Lundh
-
Jack Jansen
-
Martin v. Loewis