Re: [XML-SIG] printing Unicode xml to StringIO

[Over to python-dev. Jaco noticed that writing Unicode objects to a StringIO object stopped working in 2.2, see http://mail.python.org/pipermail/xml-sig/2001-December/006891.html ] Marc-Andre writes
IMO, "strings" should include both byte strings and Unicode strings. Mixing them may not be allowed, but that is a different story. In fact, there is an open bug (#216388) that cStringIO rejects Unicode objects. If that gets fixed, we get the funny scenario that StringIO rejects Unicode object, whereas cStringIO accepts them.
There are many developers who take this note literally. Claiming that this was not intentional is a mistake.
Please use the .encode() method on Unicode objects before writing them to a StringIO object.
If you want to end up with a byte string, this is a good idea. But I think it is pointless to require encoding them when you want to end up with a Unicode string; you'd have to invoke unicode() on the result, for no apparent reason but a bug in the StringIO implementation. Regards, Martin

"Martin v. Loewis" wrote:
StringIO and cStringIO use different methods for storing the snippets: StringIO makes use of a buffer list which gets compressed every now and then, while cStringIO uses a raw memory buffer for this purpose. Both of these implementation are targetted at providing a file IO like interface to in-memory "files". Since Python file object don't magically support Unicode, I wonder where the idea came from that StringIO/cStringIO should. That patch I applied to StringIO/cStringIO for 2.2 was aimed at making these two more compatible to the standard Python file object. The latter uses the "s#" parser marker for .write() and thus can also accept memory buffers. This was previously not possible with either of the two StringIO implementation (StringIO.py failed when trying to join different buffer compatible objects, cStringIO only accepted real string objects).
That's the idea behind StringIO objects... they are in-memory file object emulators.
This is a different application. It should be easy enough to subclass StringIO as UnicodeIO class and then have this class implement fast Unicode snippet joining. I'm not sure whether the same can be done with cStringIO's type. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

The exact source of this idea is unknown. However, there are many early references to it: - codecs.open returns an "encoded file" which "will only accept ... Unicode objects". That is perhaps the earliest precedent of a file object supporting Unicode. - At some point in time, you said that it is a bug that cStringIO does not support Unicode strings, see http://mail.python.org/pipermail/i18n-sig/2000-November/000550.html - the documentation of StringIO suggests that they should accept Unicode. So I would not blame the users for adopting far-off ideas, when the Python core itself suggests that these ideas are Pythonic.
There is nothing wrong with that. The patch should just have special-cased Unicode objects (and that bug can still be corrected). Regards, Martin

Whoa! - Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't. - OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does). - Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.) - Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I still believe that the docs are wrong, but nevermind. I'll fix StringIO.py to continue to support Unicode in addition to strings and buffer objects. It's basically only about special casing Unicode in the .write() method. BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
In StringIO.py this is not much of a problem since it uses a list of snippets. Note that this is also why StringIO.py "supported" Unicode in the first place (and that's why I think it was more an artifact of the implementation than true intent).
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Thanks.
BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Can you please add something to the test suite that makes sure this feature works?
That's why I don't care much about it. :-)
But it was useful! :-)
Try to mark the checkin messages as "2.2.1 bugfix", for the 2.2.1 patch czar. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Checked in. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
Not necessarily. You could add a flag saying that whether those bytes represent Unicode strings or not. If they do, you pick an encoding of your choice (perhaps unicode-internal), and convert the existing bytes to that encoding on first sighting of a Unicode string (assuming that everything so far is in the system encoding). When returning the bytes to the user, you build a Unicode object if the flag is set. Of course, you'd still have to touch every method, to analyse the flag... Regards, Martin

"Martin v. Loewis" wrote:
StringIO and cStringIO use different methods for storing the snippets: StringIO makes use of a buffer list which gets compressed every now and then, while cStringIO uses a raw memory buffer for this purpose. Both of these implementation are targetted at providing a file IO like interface to in-memory "files". Since Python file object don't magically support Unicode, I wonder where the idea came from that StringIO/cStringIO should. That patch I applied to StringIO/cStringIO for 2.2 was aimed at making these two more compatible to the standard Python file object. The latter uses the "s#" parser marker for .write() and thus can also accept memory buffers. This was previously not possible with either of the two StringIO implementation (StringIO.py failed when trying to join different buffer compatible objects, cStringIO only accepted real string objects).
That's the idea behind StringIO objects... they are in-memory file object emulators.
This is a different application. It should be easy enough to subclass StringIO as UnicodeIO class and then have this class implement fast Unicode snippet joining. I'm not sure whether the same can be done with cStringIO's type. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

The exact source of this idea is unknown. However, there are many early references to it: - codecs.open returns an "encoded file" which "will only accept ... Unicode objects". That is perhaps the earliest precedent of a file object supporting Unicode. - At some point in time, you said that it is a bug that cStringIO does not support Unicode strings, see http://mail.python.org/pipermail/i18n-sig/2000-November/000550.html - the documentation of StringIO suggests that they should accept Unicode. So I would not blame the users for adopting far-off ideas, when the Python core itself suggests that these ideas are Pythonic.
There is nothing wrong with that. The patch should just have special-cased Unicode objects (and that bug can still be corrected). Regards, Martin

Whoa! - Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't. - OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does). - Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.) - Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I still believe that the docs are wrong, but nevermind. I'll fix StringIO.py to continue to support Unicode in addition to strings and buffer objects. It's basically only about special casing Unicode in the .write() method. BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
In StringIO.py this is not much of a problem since it uses a list of snippets. Note that this is also why StringIO.py "supported" Unicode in the first place (and that's why I think it was more an artifact of the implementation than true intent).
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Thanks.
BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Can you please add something to the test suite that makes sure this feature works?
That's why I don't care much about it. :-)
But it was useful! :-)
Try to mark the checkin messages as "2.2.1 bugfix", for the 2.2.1 patch czar. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Checked in. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
Not necessarily. You could add a flag saying that whether those bytes represent Unicode strings or not. If they do, you pick an encoding of your choice (perhaps unicode-internal), and convert the existing bytes to that encoding on first sighting of a Unicode string (assuming that everything so far is in the system encoding). When returning the bytes to the user, you build a Unicode object if the flag is set. Of course, you'd still have to touch every method, to analyse the flag... Regards, Martin
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
Martin v. Loewis