Re: [Python-Dev] Unicode support in getargs.c

Sigh, I let myself be drawn in again, despite my previous assertion.... Recently, "Martin v. Loewis" <martin@v.loewis.de> said:
For this it should be as backward-compatible as possible, i.e. if some API expects a unicode filename and I pass "a.out" it should interpret it as u"a.out".
That works fine with the current API.
No, it doesn't, that is the whole point of why I started this thread!!!! If the Python wrapper around the API uses PyArg_Parse("u") then it will barf on "a.out", if the wrapper uses "u#" it will not barf but in stead completely misinterpret the StringObject containing "a.out", interpreting it as the binary representation of 3 unicode characters or something far worse! Yes, there is a workaround with the "O" format and three more function calls, but I wouldn't call that "works fine"...
Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer.
This is a conclusion I cannot agree with. Most strings are really binary, if you look at them closely enough :-)
I'm not sure I understand this remark. If you made it just for the smiley: never mind. If you really don't agree: please explain why. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm

No, it doesn't, that is the whole point of why I started this thread!!!!
Oops, right. I was thinking the other way around: passing u"a.out" where "a.out" is expected works fine; for this case, the memory management issues come into play.
Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer.
This is a conclusion I cannot agree with. Most strings are really binary, if you look at them closely enough :-)
I'm not sure I understand this remark. If you made it just for the smiley: never mind. If you really don't agree: please explain why.
When the discussion of tagging binary strings in source code came up, I started to look into the standard library which string literals would have to be tagged as byte strings, and which are really character strings. I found that the overwhelming majority of string literals in the standard Python library really denotes byte strings, if you ignore doc strings. Sometimes, it isn't obvious that they are binary strings, hence the smiley. Look at httplib.py: __all__ = ["HTTP", ... Not sure: Are Python function names byte strings or character strings? Probably doesn't matter either way. Python source code is definitely byte-oriented, explicitly wihthout any assumed encoding, so I'd lean towards byte strings here. _UNKNOWN = 'UNKNOWN' Looks like a character string. However, it is used in self.version = _UNKNOWN # HTTP-Version self.version is later sent on the byte-oriented HTTP protocol, so _UNKNOWN *is* a byte string. _CS_IDLE = 'Idle' These are enumerators, let's say they are character strings. self.fp = sock.makefile('rb', 0) Not sure. Could be a character string. print "reply:", repr(line) Definitely a character string. version = "HTTP/0.9" status = "200" reason = "" Protocol elements, thus byte string. So, I'm arguing that byte strings are far more common than you may think at first sight. In particular, everything passed to .read(), either of a file, or of a socket, is a byte string, since files and network connections are byte-oriented. In the particular case of network connections, applying system conventions for narrow strings would be foolish. Regards, Martin
participants (2)
-
Jack Jansen
-
Martin v. Loewis