
The problem with "s" and "s#" is that they're already semantically overloaded, and will become more so with support for multiple charsets. Some modules use "s#" when they mean "give me a pointer to an area of memory and its length". Writing to binary files is an example of this. Some modules use it to mean "give me a pointer to a string". Writing to a text file is (probably) an example of this. Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This is the case if we're going to actually look at the contents (think of string.upper() and such). I think that the only real solution is to define what "s" means, come up with new getarg-formats for the other two use cases and convert all modules to use the new standard. It'll still cause grief to extension modules that aren't part of the core, but at least the problem will go away after a while. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm

This was done last year!! We have "s#" meaning "give me some bytes." We have "t#" meaning "give me some 8-bit characters." The Python distribution has been completely updated to use the appropriate format in each call. The was done *specifically* to support the introduction of a Unicode type. The intent was that "s#" returns the *raw* bytes of the Unicode string -- NOT a UTF-8 encoding! As a separate argument, MAL can argue that "t#" should create an internal, associated buffer to hold a UTF-8 encoding and then return that. But the "s#" should return the raw bytes! [ and I'll argue against the response to "t#" anyhow... ] -g On Fri, 12 Nov 1999, Jack Jansen wrote:
-- Greg Stein, http://www.lyra.org/

[Greg writes]
Hmm. Climbing over these dead bodies could get a bit smelly :-) Im inclined to agree that holding 2 internal buffers for the unicode object is not ideal. However, I _am_ concerned with getting decent PyArg_ParseTuple and Py_BuildValue support, and if the cost is an extra buffer I will survive. So lets look for solutions that dont require it, rather than holding it up as evil when no other solution is obvious. My requirements appear to me to be very simple (for an anglophile): Lets say I have a platform Unicode value - eg, I got a Unicode value from some external library (say COM :-) Lets assume for now that the Unicode string is fully representable as ASCII - say a file or directory name that COM gave me. I simply want to be able to pass this Unicode object to "open()", and have it work. This assumes that open() will not become "native unicode", simply as the underlying C support is not unicode aware - it needs to be converted to a "char *" (ie, will use the "t#" format) The second side of the equation is when I expose a Python function that talks Unicode - eg, I need to _pass_ a platform Unicode value to an external library. The Python programmer should be able to pass a Unicode object (no problem), or a PyString object. In code terms: Prob1: name = SomeComObject.GetFileName() # A Unicode object f = open(name) Prob2: SomeComObject.SetFileName("foo.txt") IMO it is important that we have a good strategy for dealing with this for extensions. MAL addresses one direction, but not the other. Maybe if we toss around general solutions for this the implementation will fall out. MALs idea of the additional buffer starts to address this, but isnt the whole story. Any ideas on this?

On Sat, 13 Nov 1999, Mark Hammond wrote:
I believe Py_BuildValue is pretty straight-forward. Simply state that it is allowed to perform conversions and place the resulting object into the resulting tuple. (with appropriate refcounting) In other words: tuple = Py_BuildValue("U", stringOb); The stringOb will be converted to a Unicode object. The new Unicode object will go into the tuple (with the tuple holding the only reference!). The stringOb will NOT acquire any additional references. [ "U" format may be wrong; it is here for example purposes ] Okay... now the PyArg_ParseTuple() is the *real* kicker.
Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a string-like object which can be passed to the OS as an 8-bit string. In Prob2, you want a string-like object which can be passed to the OS as a Unicode string. I see three options for PyArg_ParseTuple: 1) allow it to return NEW objects which must be DECREF'd. [ current policy only loans out references ] This option could be difficult in the presence of errors during the parse. For example, the current idiom is: if (!PyArg_ParseTuple(args, "...")) return NULL; If an object was produced, but then a later argument cause a failure, then who is responsible for freeing the object? 2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new objects when an error occurred. This basically answers the last question in option (1) -- ParseTuple is responsible. 3) Return loaned-out-references to objects which have been tested for convertability. Helper functions perform the conversion and the caller will then free the reference. [ this is the model used in PyWin32 ] Code in PyWin32 typically looks like: if (!PyArg_ParseTuple(args, "O", &ob)) return NULL; if ((unicodeOb = GiveMeUnicode(ob)) == NULL) return NULL; ... Py_DECREF(unicodeOb); [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ] In a "real" situation, the ParseTuple format would be "U" and the object would be type-tested for PyStringType or PyUnicodeType. Note that GiveMeUnicode() would also do a type-test, but it can't produce a *specific* error like ParseTuple (e.g. "string/unicode object expected" vs "parameter 3 must be a string/unicode object") Are there more options? Anybody? All three of these avoid the secondary buffer. The last is cleanest w.r.t. to keeping the existing "loaned references" behavior, but can get a bit wordy when you need to convert a bunch of string arguments. Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it would need to keep a "free list" in case an error occurred. Option (1) adds DECREF logic to callers to ensure they clean up. The add'l logic isn't much more than the other two options (the only change is adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..." condition). Note that the caller would probably need to initialize each object to NULL before calling ParseTuple. Personally, I prefer (3) as it makes it very clear that a new object has been created and must be DECREF'd at some point. Also note that GiveMeUnicode() could also accept a second argument for the type of decoding to do (or NULL meaning "UTF-8"). Oh: note there are equivalents of all options for going from unicode-to-string; the above is all about string-to-unicode. However, the tricky part of unicode-to-string is determining whether backwards compatibility will be a requirement. i.e. does existing code that uses the "t" format suddenly achieve the capability to accept a Unicode object? This obviously causes problems in all three options: since a new reference must be created to handle the situation, then who DECREF's it? The old code certainly doesn't. [ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get the ability to accept a Unicode object." The Python code must use str() to do the encoding manually (until the old code is upgraded to one of the above three options). </IMO> ] I think that's it for me. In the several years I've been thinking on this problem, I haven't come up with anything but the above three. There may be a whole new paradigm for argument parsing, but I haven't tried to think on that one (and just fit in around ParseTuple). Cheers, -g -- Greg Stein, http://www.lyra.org/

[Lamenting about PyArg_ParseTuple and managing memory buffers for String/Unicode conversions.] So what is really wrong with Marc's proposal about the extra pointer on the Unicode object? And to double the carnage, who not add the equivilent native Unicode buffer to the PyString object? These would only ever be filled when requested by the conversion routines. They have no other effect than their memory is managed by the object itself; simply a convenience to avoid having extension modules manage the conversion buffers. The only overheads appear to be: * The conversion buffers may be slightly (or much :-) longer-lived - ie, they are not freed until the object itself is freed. * String object slightly bigger, and slightly slower to destroy. It appears to solve the problems, and the cost doesnt seem too high... Mark.

This was done last year!! We have "s#" meaning "give me some bytes." We have "t#" meaning "give me some 8-bit characters." The Python distribution has been completely updated to use the appropriate format in each call. The was done *specifically* to support the introduction of a Unicode type. The intent was that "s#" returns the *raw* bytes of the Unicode string -- NOT a UTF-8 encoding! As a separate argument, MAL can argue that "t#" should create an internal, associated buffer to hold a UTF-8 encoding and then return that. But the "s#" should return the raw bytes! [ and I'll argue against the response to "t#" anyhow... ] -g On Fri, 12 Nov 1999, Jack Jansen wrote:
-- Greg Stein, http://www.lyra.org/

[Greg writes]
Hmm. Climbing over these dead bodies could get a bit smelly :-) Im inclined to agree that holding 2 internal buffers for the unicode object is not ideal. However, I _am_ concerned with getting decent PyArg_ParseTuple and Py_BuildValue support, and if the cost is an extra buffer I will survive. So lets look for solutions that dont require it, rather than holding it up as evil when no other solution is obvious. My requirements appear to me to be very simple (for an anglophile): Lets say I have a platform Unicode value - eg, I got a Unicode value from some external library (say COM :-) Lets assume for now that the Unicode string is fully representable as ASCII - say a file or directory name that COM gave me. I simply want to be able to pass this Unicode object to "open()", and have it work. This assumes that open() will not become "native unicode", simply as the underlying C support is not unicode aware - it needs to be converted to a "char *" (ie, will use the "t#" format) The second side of the equation is when I expose a Python function that talks Unicode - eg, I need to _pass_ a platform Unicode value to an external library. The Python programmer should be able to pass a Unicode object (no problem), or a PyString object. In code terms: Prob1: name = SomeComObject.GetFileName() # A Unicode object f = open(name) Prob2: SomeComObject.SetFileName("foo.txt") IMO it is important that we have a good strategy for dealing with this for extensions. MAL addresses one direction, but not the other. Maybe if we toss around general solutions for this the implementation will fall out. MALs idea of the additional buffer starts to address this, but isnt the whole story. Any ideas on this?

On Sat, 13 Nov 1999, Mark Hammond wrote:
I believe Py_BuildValue is pretty straight-forward. Simply state that it is allowed to perform conversions and place the resulting object into the resulting tuple. (with appropriate refcounting) In other words: tuple = Py_BuildValue("U", stringOb); The stringOb will be converted to a Unicode object. The new Unicode object will go into the tuple (with the tuple holding the only reference!). The stringOb will NOT acquire any additional references. [ "U" format may be wrong; it is here for example purposes ] Okay... now the PyArg_ParseTuple() is the *real* kicker.
Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a string-like object which can be passed to the OS as an 8-bit string. In Prob2, you want a string-like object which can be passed to the OS as a Unicode string. I see three options for PyArg_ParseTuple: 1) allow it to return NEW objects which must be DECREF'd. [ current policy only loans out references ] This option could be difficult in the presence of errors during the parse. For example, the current idiom is: if (!PyArg_ParseTuple(args, "...")) return NULL; If an object was produced, but then a later argument cause a failure, then who is responsible for freeing the object? 2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new objects when an error occurred. This basically answers the last question in option (1) -- ParseTuple is responsible. 3) Return loaned-out-references to objects which have been tested for convertability. Helper functions perform the conversion and the caller will then free the reference. [ this is the model used in PyWin32 ] Code in PyWin32 typically looks like: if (!PyArg_ParseTuple(args, "O", &ob)) return NULL; if ((unicodeOb = GiveMeUnicode(ob)) == NULL) return NULL; ... Py_DECREF(unicodeOb); [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ] In a "real" situation, the ParseTuple format would be "U" and the object would be type-tested for PyStringType or PyUnicodeType. Note that GiveMeUnicode() would also do a type-test, but it can't produce a *specific* error like ParseTuple (e.g. "string/unicode object expected" vs "parameter 3 must be a string/unicode object") Are there more options? Anybody? All three of these avoid the secondary buffer. The last is cleanest w.r.t. to keeping the existing "loaned references" behavior, but can get a bit wordy when you need to convert a bunch of string arguments. Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it would need to keep a "free list" in case an error occurred. Option (1) adds DECREF logic to callers to ensure they clean up. The add'l logic isn't much more than the other two options (the only change is adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..." condition). Note that the caller would probably need to initialize each object to NULL before calling ParseTuple. Personally, I prefer (3) as it makes it very clear that a new object has been created and must be DECREF'd at some point. Also note that GiveMeUnicode() could also accept a second argument for the type of decoding to do (or NULL meaning "UTF-8"). Oh: note there are equivalents of all options for going from unicode-to-string; the above is all about string-to-unicode. However, the tricky part of unicode-to-string is determining whether backwards compatibility will be a requirement. i.e. does existing code that uses the "t" format suddenly achieve the capability to accept a Unicode object? This obviously causes problems in all three options: since a new reference must be created to handle the situation, then who DECREF's it? The old code certainly doesn't. [ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get the ability to accept a Unicode object." The Python code must use str() to do the encoding manually (until the old code is upgraded to one of the above three options). </IMO> ] I think that's it for me. In the several years I've been thinking on this problem, I haven't come up with anything but the above three. There may be a whole new paradigm for argument parsing, but I haven't tried to think on that one (and just fit in around ParseTuple). Cheers, -g -- Greg Stein, http://www.lyra.org/

[Lamenting about PyArg_ParseTuple and managing memory buffers for String/Unicode conversions.] So what is really wrong with Marc's proposal about the extra pointer on the Unicode object? And to double the carnage, who not add the equivilent native Unicode buffer to the PyString object? These would only ever be filled when requested by the conversion routines. They have no other effect than their memory is managed by the object itself; simply a convenience to avoid having extension modules manage the conversion buffers. The only overheads appear to be: * The conversion buffers may be slightly (or much :-) longer-lived - ie, they are not freed until the object itself is freed. * String object slightly bigger, and slightly slower to destroy. It appears to solve the problems, and the cost doesnt seem too high... Mark.
participants (3)
-
Greg Stein
-
Jack Jansen
-
Mark Hammond