Unicode support in getargs.c
I posted a question on Unicode support in getargs.c last month (working on a different project), but now that I'm trying to support unicode-based APIs more seriously I find that it leaves even more to be desired. I'd like to help to fix this, but I need some direction on how things should be fixed. Here are some of the issues I ran in today: - Unicode objects have a companion string object, meaning that you can pass a unicode object to an "s" format and have the right thing happen. String objects have no such accompanying unicode object, and I think they should have. Right now you cannot pass a string object when the C routine expects a unicode object. - There is no unicode equivalent of "c", the single character. - "u#" does something useful, but something completely different from what "s#" does. More to the point, it probably does something dangerous, if I understand correctly. If I write a C routine with an "u#" format and the Python code passes a string object the string object will be used as a buffer object and its binary contents will be interpreted as unicode. If the argument in question is a filename this will produce very surprising results:-) I'd like unicode objects to be get a little more first class citizenship, especially in the light of operating systems that are primarily (or exclusively) unicode based, such as Mac OS X or Windows CE, to sum things up.
Jack Jansen wrote:
I posted a question on Unicode support in getargs.c last month (working on a different project), but now that I'm trying to support unicode-based APIs more seriously I find that it leaves even more to be desired. I'd like to help to fix this, but I need some direction on how things should be fixed.
Here are some of the issues I ran in today: - Unicode objects have a companion string object, meaning that you can pass a unicode object to an "s" format and have the right thing happen. String objects have no such accompanying unicode object, and I think they should have. Right now you cannot pass a string object when the C routine expects a unicode object.
You can: parse the object and then pass it to PyUnicode_FromObject().
- There is no unicode equivalent of "c", the single character. - "u#" does something useful, but something completely different from what "s#" does. More to the point, it probably does something dangerous, if I understand correctly. If I write a C routine with an "u#" format and the Python code passes a string object the string object will be used as a buffer object and its binary contents will be interpreted as unicode. If the argument in question is a filename this will produce very surprising results:-)
True; "u#" does exactly the same as "s#" -- it interprets the input as binary buffer.
I'd like unicode objects to be get a little more first class citizenship, especially in the light of operating systems that are primarily (or exclusively) unicode based, such as Mac OS X or Windows CE, to sum things up.
You would be far better off using the Unicode API on the objects which are passed into the function rather than relying on the getargs parser to try to apply some magic to the input objects. It might be worthwhile extending the parser markers a bit more or allowing e.g. introduce "us#" to return Unicode objects much like "es#" returns strings... I think we'd need some examples of use though before deciding what's the right way to do this ("es#" was implemented after an request by Mark Hammond to be able to handle Unicode file names for Win CE). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
True; "u#" does exactly the same as "s#" -- it interprets the input as binary buffer.
It doesn't do exactly the same. If s# is applied to a Unicode object, it transparently invokes the default encoding, which is sensible. If u# is applied to a byte string, it does not apply the default encoding. Instead, it interprets the string "as-is". I cannot see an application where this is useful, but I can see many applications where it is clearly wrong. IMO, u# cannot and should not be symmetric to s#. Instead, it should accept just Unicode objects, and raise TypeErrors for everything else. Regards, Martin
"Martin v. Loewis" wrote:
True; "u#" does exactly the same as "s#" -- it interprets the input as binary buffer.
It doesn't do exactly the same. If s# is applied to a Unicode object, it transparently invokes the default encoding, which is sensible. If u# is applied to a byte string, it does not apply the default encoding.
That's because the buffer interface on Unicode objects doesn't return the raw binary buffer. If you pass in a memory mapped file or a buffer object wrapping some memory area, u# will take the input as raw binary stream. All this weird behaviour is needed to make Unicode objects behave well together with s#. The implementation of u# is completely symmetric to that of s# though. I agree, though, that it would make more sense to special case Unicode objects here and have u# return a pointer to the raw internal buffer of the Unicode object. Jack will probably also need a way to say "decode this encoded object into Unicode using the encoding xyz". Something like the Unicode version of "es#". How about "eu#" which then passes through Unicode as-is while decoding all other objects according to the given encoding ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
That's because the buffer interface on Unicode objects doesn't return the raw binary buffer. If you pass in a memory mapped file or a buffer object wrapping some memory area, u# will take the input as raw binary stream.
All this weird behaviour is needed to make Unicode objects behave well together with s#.
I don't believe this. Why would the implementation of u# have any effect on making s# work?
Jack will probably also need a way to say "decode this encoded object into Unicode using the encoding xyz". Something like the Unicode version of "es#". How about "eu#" which then passes through Unicode as-is while decoding all other objects according to the given encoding ?!
I'd like to see the requirements, in terms of real-world problems, before considering any extensions. Regards, Martin
"Martin v. Loewis" wrote:
That's because the buffer interface on Unicode objects doesn't return the raw binary buffer. If you pass in a memory mapped file or a buffer object wrapping some memory area, u# will take the input as raw binary stream.
All this weird behaviour is needed to make Unicode objects behave well together with s#.
I don't believe this. Why would the implementation of u# have any effect on making s# work?
To make s# work, we had to map the read buffer interface to the encoded version of Unicode -- not the binary version which would have been the "right" choice in terms of the buffer interface (s# maps to the read buffer interface, while t# maps to the character buffer interface). u# is simply a copy&paste implementation of s# interpreting the results of the read buffer interface as Py_UNICODE array. As I menioned in another mail, we should probably let u# pass through Unicode objects as-is without going through the read buffer interface. This functionality is clearly missing and should be added to make u# useful.
Jack will probably also need a way to say "decode this encoded object into Unicode using the encoding xyz". Something like the Unicode version of "es#". How about "eu#" which then passes through Unicode as-is while decoding all other objects according to the given encoding ?!
I'd like to see the requirements, in terms of real-world problems, before considering any extensions.
Agreed. Jack should post some examples of what he needs for his application. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
All this weird behaviour is needed to make Unicode objects behave well together with s#.
I don't believe this. Why would the implementation of u# have any effect on making s# work? [...] u# is simply a copy&paste implementation of s# interpreting the results of the read buffer interface as Py_UNICODE array.
Ok. That explains its history, but it also clarifies that changing the u# implementation has *no* effect whatsoever proper operation of s#. Therefore, I still think that u# should reject string objects, instead of silently doing the wrong thing.
As I menioned in another mail, we should probably let u# pass through Unicode objects as-is without going through the read buffer interface.
Yes, that would be nice. The only use of u# I can see is that it gives you the number of Py_UNICODE characters, so that the caller doesn't have to look for the terminating NUL. Regards, Martin
On Wed, 2 Jan 2002, Martin v. Loewis wrote:
Jack will probably also need a way to say "decode this encoded object into Unicode using the encoding xyz". Something like the Unicode version of "es#". How about "eu#" which then passes through Unicode as-is while decoding all other objects according to the given encoding ?!
I'd like to see the requirements, in terms of real-world problems, before considering any extensions.
I have a number of MacOSX API's that expect Unicode buffers, passed as "long count, UniChar *buffer". I have the machinery in bgen to generate code for this, iff "u#" (or something else) would work the same as "s#", i.e. it returns you a pointer and a size, and it would work equally well for unicode objects as for classic strings (after conversion). The trick with O and using PyUnicode_FromObject() may do the trick for me, as my code is generated, so a little more glue call doesn't really matter. But as a general solution it doesn't look right: "How do I call a C routine with a string parameter?" "Use the "s" format and you get the string pointer to pass". "How do I call a C routine with a unicode string parameter?" "Use O and PyUnicode_FromObject() and PyUnicode_AsUnicode and make sure you get all your decrefs right and.....". The "es#" is a very strange beast, and a similar "eu#" would help me a little, but it has some serious drawbacks. Aside from it being completely different from the other converters (being a prefix operator in stead of a postfix one, and having a value-return argument) I would also have to pre-allocate the buffer in advance, and that sort of defeats the purpose. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
I have a number of MacOSX API's that expect Unicode buffers, passed as "long count, UniChar *buffer".
Well, my first question would be: Are you sure that UniChar has the same underlying integral type as Py_UNICODE? If not, you lose. So you may need to do even more conversion.
I have the machinery in bgen to generate code for this, iff "u#" (or something else) would work the same as "s#", i.e. it returns you a pointer and a size, and it would work equally well for unicode objects as for classic strings (after conversion).
I see. u# could be made work for Unicode objects alone, but it would have to reject string objects.
But as a general solution it doesn't look right: "How do I call a C routine with a string parameter?" "Use the "s" format and you get the string pointer to pass". "How do I call a C routine with a unicode string parameter?"
For that, the answer is u. But you want the length also. So for that, the answer is u#. But your question is "How do I call a C routine with either a Unicode object or a string object, getting a reasonable Py_UNICODE* and the length?". For that, I'd recommend to use O&, with a conversion function PyObject *Py_UnicodeOrString(PyObject *o, void *ignored)){ if (PyUnicode_Check(o)){ Py_INCREF(o);return o; } if (PyString_Check(o)){ return PyUnicode_FromObject(o); } PyErr_SetString(PyExc_TypeError,"unicode object expecpected"); return NULL; }
"Use O and PyUnicode_FromObject() and PyUnicode_AsUnicode and make sure you get all your decrefs right and.....".
With the function above, this becomes Use O&, passing a PyObject**, the function, and a NULL pointer, using PyUnicode_AS_UNICODE and PyUnicode_SIZE, performing a single DECREF at the end [allowing to specify an encoding is optional] In this scenario, somebody *has* to deallocate memory, you cannot get around this. It is your choice whether this is Py_DECREF or PyMem_Free that you have to call (as with the "esomething" conversions); the DECREF is more efficient as it will not copy a Unicode object.
The "es#" is a very strange beast, and a similar "eu#" would help me a little, but it has some serious drawbacks. Aside from it being completely different from the other converters (being a prefix operator in stead of a postfix one, and having a value-return argument) I would also have to pre-allocate the buffer in advance, and that sort of defeats the purpose.
You don't. If you set the buffer to NULL before invoking getargs, you have to PyMem_Free it afterwards. Regards, Martin
"Martin v. Loewis" wrote:
I have a number of MacOSX API's that expect Unicode buffers, passed as "long count, UniChar *buffer".
Well, my first question would be: Are you sure that UniChar has the same underlying integral type as Py_UNICODE? If not, you lose.
So you may need to do even more conversion.
This should be the first thing to check. Also note that Python has two different flavors of Unicode support: UCS-2 and UCS-4, so you'll have to be careful about this too.
I have the machinery in bgen to generate code for this, iff "u#" (or something else) would work the same as "s#", i.e. it returns you a pointer and a size, and it would work equally well for unicode objects as for classic strings (after conversion).
I see. u# could be made work for Unicode objects alone, but it would have to reject string objects.
Martin, I don't agree here: string objects could hold binary UCS-2/UCS-4 data. Jack, u# cannot auto-convert strings to Unicode since this would require allocation of a temporary object and there's no logic there to free that object after usage. es# has logic in place which allows either copying the raw data to a buffer you provide or have it allocate a buffer of the right size for you. That's why I proposed to extend it support Unicode raw data as well.
But as a general solution it doesn't look right: "How do I call a C routine with a string parameter?" "Use the "s" format and you get the string pointer to pass". "How do I call a C routine with a unicode string parameter?"
For that, the answer is u. But you want the length also. So for that, the answer is u#. But your question is "How do I call a C routine with either a Unicode object or a string object, getting a reasonable Py_UNICODE* and the length?".
For that, I'd recommend to use O&, with a conversion function
PyObject *Py_UnicodeOrString(PyObject *o, void *ignored)){ if (PyUnicode_Check(o)){ Py_INCREF(o);return o; } if (PyString_Check(o)){ return PyUnicode_FromObject(o); } PyErr_SetString(PyExc_TypeError,"unicode object expecpected"); return NULL; }
Martin, note that PyUnicode_FromObject() already does the Unicode pass-through (even more: it makes sure that you get a true Unicode object, not a subclass).
"Use O and PyUnicode_FromObject() and PyUnicode_AsUnicode and make sure you get all your decrefs right and.....".
With the function above, this becomes
Use O&, passing a PyObject**, the function, and a NULL pointer, using PyUnicode_AS_UNICODE and PyUnicode_SIZE, performing a single DECREF at the end [allowing to specify an encoding is optional]
In this scenario, somebody *has* to deallocate memory, you cannot get around this. It is your choice whether this is Py_DECREF or PyMem_Free that you have to call (as with the "esomething" conversions); the DECREF is more efficient as it will not copy a Unicode object.
The "es#" is a very strange beast, and a similar "eu#" would help me a little, but it has some serious drawbacks. Aside from it being completely different from the other converters (being a prefix operator in stead of a postfix one, and having a value-return argument) I would also have to pre-allocate the buffer in advance, and that sort of defeats the purpose.
You don't. If you set the buffer to NULL before invoking getargs, you have to PyMem_Free it afterwards.
Right. Let me see if I can summarize this: Jack wants to get string and Unicode objects converted to Unicode automagically and then receive a pointer to a Py_UNICODE buffer and a size. The current solution for this is to use the "O" parser, fetch the object, pass it through PyUnicode_FromObject(), then use PyUnicode_GET_SIZE() and PyUnicode_AS_UNICODE() to access the Py_UNICODE buffer and finally to Py_DECREF() the object returned by PyUnicode_FromObject(). What I proposed was to extend the "es#" parser marker with a new modifier: "eu#" which does all of the above except that it either copies the Py_UNICODE data to a buffer you provide or a newly allocated buffer which you then have to PyMem_Free() after usage. How does this sound ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
I see. u# could be made work for Unicode objects alone, but it would have to reject string objects.
Martin, I don't agree here: string objects could hold binary UCS-2/UCS-4 data.
They could. Most likely, they don't. Explicit is better then implicit: Anybody wishing to pass UCS-2 binary data to a function expecting character strings should do function(unicode(data, "UCS-2BE")) # or LE if appropriate
es# has logic in place which allows either copying the raw data to a buffer you provide or have it allocate a buffer of the right size for you. That's why I proposed to extend it support Unicode raw data as well.
Even though es# is cleanly defined, it is still undesirable to use, IMO: it requires more copies of data than necessary. If explicit memory management is required, it should be exposed through Py_DECREF. That is easy to understand, and it allows to share immutable objects, thus avoiding copies.
PyObject *Py_UnicodeOrString(PyObject *o, void *ignored)){ if (PyUnicode_Check(o)){ Py_INCREF(o);return o; } if (PyString_Check(o)){ return PyUnicode_FromObject(o); } PyErr_SetString(PyExc_TypeError,"unicode object expecpected"); return NULL; }
Martin, note that PyUnicode_FromObject() already does the Unicode pass-through (even more: it makes sure that you get a true Unicode object, not a subclass).
I noticed. However, I'd like Py_UnicodeOrString to fail if you are not passing a character string (and I'd see no problem in accepting Unicode subtypes without copying them). This is a minor point, though - I might have written PyObject *Py_UnicodeOrString(PyObject *p, void* ignored){ return PyObject_FromObject(o); } as well.
Jack wants to get string and Unicode objects converted to Unicode automagically and then receive a pointer to a Py_UNICODE buffer and a size.
The current solution for this is to use the "O" parser, fetch the object, pass it through PyUnicode_FromObject(), then use PyUnicode_GET_SIZE() and PyUnicode_AS_UNICODE() to access the Py_UNICODE buffer and finally to Py_DECREF() the object returned by PyUnicode_FromObject().
That is the solution, although I would claim that using the O& parser is simpler, and more flexible.
What I proposed was to extend the "es#" parser marker with a new modifier: "eu#" which does all of the above except that it either copies the Py_UNICODE data to a buffer you provide or a newly allocated buffer which you then have to PyMem_Free() after usage.
How does this sound ?
Terrible. It copies a Unicode object without any need. It also adds to the inflation of format specifiers for getargs; this inflation is terrible in itself. Regards, Martin
I'm going to jump out of this discussion for a while. Martin and Mark have a completely different view on Unicode than I do, apparently, and I think I should first try and see if I can use the current implementation. For the record: my view of Unicode is really "ascii done right", i.e. a datatype that allows you to get richer characters than what 1960s ascii gives you. For this it should be as backward-compatible as possible, i.e. if some API expects a unicode filename and I pass "a.out" it should interpret it as u"a.out". All the converting to different charsets is icing on the cake, the number one priority should be that unicode is as compatible as possible with the 8-bit convention used on the platform (whatever it may be). No, make that the number 2 priority: the number one pritority is compatibility with 7-bit ascii. Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer. UnicodeObjects and StringObjects should behave pretty orthogonal to how FloatObjects and IntObjects behave. -- -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
For the record: my view of Unicode is really "ascii done right", i.e. a datatype that allows you to get richer characters than what 1960s ascii gives you.
Exactly, with the stress on *ASCII*. Almost everybody could agree on ASCII; it is the 8-bit character sets where the troubles start.
For this it should be as backward-compatible as possible, i.e. if some API expects a unicode filename and I pass "a.out" it should interpret it as u"a.out".
That works fine with the current API.
All the converting to different charsets is icing on the cake, the number one priority should be that unicode is as compatible as possible with the 8-bit convention used on the platform (whatever it may be).
The problem is that there are multiple conventions on many systems, and only the application can know which of these to apply.
Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer.
This is a conclusion I cannot agree with. Most strings are really binary, if you look at them closely enough :-)
UnicodeObjects and StringObjects should behave pretty orthogonal to how FloatObjects and IntObjects behave.
For the Python programmer: yes; For the C programmer: memory management makes that inherently difficult, which you don't have for int vs float. Regards, Martin
Jack Jansen wrote:
I'm going to jump out of this discussion for a while. Martin and Mark have a completely different view on Unicode than I do, apparently, and I think I should first try and see if I can use the current implementation.
For the record: my view of Unicode is really "ascii done right", i.e. a datatype that allows you to get richer characters than what 1960s ascii gives you. For this it should be as backward-compatible as possible, i.e. if some API expects a unicode filename and I pass "a.out" it should interpret it as u"a.out". All the converting to different charsets is icing on the cake, the number one priority should be that unicode is as compatible as possible with the 8-bit convention used on the platform (whatever it may be). No, make that the number 2 priority: the number one pritority is compatibility with 7-bit ascii. Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer. UnicodeObjects and StringObjects should behave pretty orthogonal to how FloatObjects and IntObjects behave.
It would be nice if Unicode could be made to behave that way, but unfortunately, the 8-bit world is so differentiated with lots of different encodings that not even Harry Potter would have much luck finding the right magic to apply. Another problem is that of the getargs.c API itself: since it returns pointers to data buffers, auto-conversions (if at all possible) which involve temporary objects must be handled differently than normal Python string objects. Now, the question is whether you are willing to pay for the comfort of getting direct access to a Py_UNICODE buffer (or char buffer) with extra copy-action and additional PyMem_Free() cleanup overhead or not. The "O" parser marker doesn't provide any magic on its own, but also reduces the need for copying data and handling memory management in you APIs. In my last message on this thread, I proposed to add "eu#" which returns a Py_UNICODE buffer, possibly decoding a string object using the given encoding first. As Martin noted, this option requires extra copying but simplifies the C coding somewhat. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
In my last message on this thread, I proposed to add "eu#" which returns a Py_UNICODE buffer, possibly decoding a string object using the given encoding first. As Martin noted, this option requires extra copying but simplifies the C coding somewhat.
Also, while it simplifies processing compared to "O", I cannot see any simplification compared to "O&". So I'd be more in favor of offering standard conversion functions for O& instead of inventing new getargs modifiers all the time. This would also simplify creation of cross-version extension modules: people could just incorporate the code of the conversion function into their code base, trusting that O& had been available for ages. Regards, Martin
It might be worthwhile extending the parser markers a bit more or allowing e.g. introduce "us#" to return Unicode objects much like "es#" returns strings... I think we'd need some examples of use though before deciding what's the right way to do this ("es#" was implemented after an request by Mark Hammond to be able to handle Unicode file names for Win CE).
Actually, it was for Windows itself, allowing the nt module to use Unicode objects correctly for the platform. Mark.
String objects have no such accompanying unicode object, and I think they should have.
No. That would either give you cyclic structures, or an ever growing chain of unicode->string->unicode->string objects that could easily result in unacceptable memory consumption. Furthermore, I consider the existance of the embedded string object in a Unicode object as a flaw in itself, as it relies on the default encoding. IMO, the default encoding shouldn't be used if possible, as it only serves the transition towards Unicode, and only in limited ways.
- There is no unicode equivalent of "c", the single character.
Why do you need that?
- "u#" does something useful, but something completely different from what "s#" does. More to the point, it probably does something dangerous, if I understand correctly. If I write a C routine with an "u#" format and the Python code passes a string object the string object will be used as a buffer object and its binary contents will be interpreted as unicode.
That sounds like a bug to me. Passing a string to u# most certainly does not do the right thing; it is bad that does so silently. OTOH, why do you need u#? Normally, you use s# if a string can have embedded null bytes; you do that if the string is "binary". For Unicode, that is useless: A Unicode string typically won't have any embedded null bytes, and it definitely isn't "binary". Regards, Martin
"JJ" == Jack Jansen
writes:
JJ> I'd like unicode objects to be get a little more first class JJ> citizenship, especially in the light of operating systems that JJ> are primarily (or exclusively) unicode based, such as Mac OS X JJ> or Windows CE, to sum things up. string/unicode unification? -Barry
participants (5)
-
barry@zope.com
-
Jack Jansen
-
M.-A. Lemburg
-
Mark Hammond
-
Martin v. Loewis