Re: [XML-SIG] printing Unicode xml to StringIO
[Over to python-dev. Jaco noticed that writing Unicode objects to a StringIO object stopped working in 2.2, see http://mail.python.org/pipermail/xml-sig/2001-December/006891.html ] Marc-Andre writes
Actually, I think that this is a bug in the documentation, not the code. StringIO and cStringIO were never meant to work on anything but strings and memory buffers.
IMO, "strings" should include both byte strings and Unicode strings. Mixing them may not be allowed, but that is a different story. In fact, there is an open bug (#216388) that cStringIO rejects Unicode objects. If that gets fixed, we get the funny scenario that StringIO rejects Unicode object, whereas cStringIO accepts them.
The note that Fred added to the docs about StringIO's capability of storing Unicode in it's buffer list is simply an artifact of the implementation.
There are many developers who take this note literally. Claiming that this was not intentional is a mistake.
Please use the .encode() method on Unicode objects before writing them to a StringIO object.
If you want to end up with a byte string, this is a good idea. But I think it is pointless to require encoding them when you want to end up with a Unicode string; you'd have to invoke unicode() on the result, for no apparent reason but a bug in the StringIO implementation. Regards, Martin
"Martin v. Loewis" wrote:
[Over to python-dev. Jaco noticed that writing Unicode objects to a StringIO object stopped working in 2.2, see http://mail.python.org/pipermail/xml-sig/2001-December/006891.html ]
Marc-Andre writes
Actually, I think that this is a bug in the documentation, not the code. StringIO and cStringIO were never meant to work on anything but strings and memory buffers.
IMO, "strings" should include both byte strings and Unicode strings. Mixing them may not be allowed, but that is a different story.
In fact, there is an open bug (#216388) that cStringIO rejects Unicode objects. If that gets fixed, we get the funny scenario that StringIO rejects Unicode object, whereas cStringIO accepts them.
StringIO and cStringIO use different methods for storing the snippets: StringIO makes use of a buffer list which gets compressed every now and then, while cStringIO uses a raw memory buffer for this purpose. Both of these implementation are targetted at providing a file IO like interface to in-memory "files". Since Python file object don't magically support Unicode, I wonder where the idea came from that StringIO/cStringIO should. That patch I applied to StringIO/cStringIO for 2.2 was aimed at making these two more compatible to the standard Python file object. The latter uses the "s#" parser marker for .write() and thus can also accept memory buffers. This was previously not possible with either of the two StringIO implementation (StringIO.py failed when trying to join different buffer compatible objects, cStringIO only accepted real string objects).
The note that Fred added to the docs about StringIO's capability of storing Unicode in it's buffer list is simply an artifact of the implementation.
There are many developers who take this note literally. Claiming that this was not intentional is a mistake.
Please use the .encode() method on Unicode objects before writing them to a StringIO object.
If you want to end up with a byte string, this is a good idea.
That's the idea behind StringIO objects... they are in-memory file object emulators.
But I think it is pointless to require encoding them when you want to end up with a Unicode string; you'd have to invoke unicode() on the result, for no apparent reason but a bug in the StringIO implementation.
This is a different application. It should be easy enough to subclass StringIO as UnicodeIO class and then have this class implement fast Unicode snippet joining. I'm not sure whether the same can be done with cStringIO's type. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Both of these implementation are targetted at providing a file IO like interface to in-memory "files". Since Python file object don't magically support Unicode, I wonder where the idea came from that StringIO/cStringIO should.
The exact source of this idea is unknown. However, there are many early references to it: - codecs.open returns an "encoded file" which "will only accept ... Unicode objects". That is perhaps the earliest precedent of a file object supporting Unicode. - At some point in time, you said that it is a bug that cStringIO does not support Unicode strings, see http://mail.python.org/pipermail/i18n-sig/2000-November/000550.html - the documentation of StringIO suggests that they should accept Unicode. So I would not blame the users for adopting far-off ideas, when the Python core itself suggests that these ideas are Pythonic.
That patch I applied to StringIO/cStringIO for 2.2 was aimed at making these two more compatible to the standard Python file object. The latter uses the "s#" parser marker for .write() and thus can also accept memory buffers. This was previously not possible with either of the two StringIO implementation (StringIO.py failed when trying to join different buffer compatible objects, cStringIO only accepted real string objects).
There is nothing wrong with that. The patch should just have special-cased Unicode objects (and that bug can still be corrected). Regards, Martin
Whoa! - Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't. - OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does). - Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.) - Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Whoa!
- Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't.
I still believe that the docs are wrong, but nevermind. I'll fix StringIO.py to continue to support Unicode in addition to strings and buffer objects. It's basically only about special casing Unicode in the .write() method. BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
- OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does).
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
- Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.)
In StringIO.py this is not much of a problem since it uses a list of snippets. Note that this is also why StringIO.py "supported" Unicode in the first place (and that's why I think it was more an artifact of the implementation than true intent).
- Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed.
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
- Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't.
I still believe that the docs are wrong, but nevermind. I'll fix StringIO.py to continue to support Unicode in addition to strings and buffer objects. It's basically only about special casing Unicode in the .write() method.
Thanks.
BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Can you please add something to the test suite that makes sure this feature works?
- OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does).
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
That's why I don't care much about it. :-)
- Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.)
In StringIO.py this is not much of a problem since it uses a list of snippets. Note that this is also why StringIO.py "supported" Unicode in the first place (and that's why I think it was more an artifact of the implementation than true intent).
But it was useful! :-)
- Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed.
Try to mark the checkin messages as "2.2.1 bugfix", for the 2.2.1 patch czar. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
- Since we added a note to the docs that StringIO supports Unicode, we clearly should continue to support that, and it's a bug if it doesn't.
I still believe that the docs are wrong, but nevermind. I'll fix StringIO.py to continue to support Unicode in addition to strings and buffer objects. It's basically only about special casing Unicode in the .write() method.
Thanks.
BTW, I was never aware of the doc changes in this area and the test suite didn't bring up the issues either.
Can you please add something to the test suite that makes sure this feature works?
- OTOH, Unicode for cStringIO should be considered at best a feature request. I don't mind if cStringIO doesn't support Unicode -- it never has, AFAIK, so it won't break much code. I don't believe it's much faster than StringIO, unless you use the C API (like cPickle does).
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
That's why I don't care much about it. :-)
- Of course, when Unicode is supported, mixing ASCII and Unicode should be supported too. (But not necessarily mixing 8-bit strings containing characters in the range \200-\377, since there's no default encoding for this range.)
In StringIO.py this is not much of a problem since it uses a list of snippets. Note that this is also why StringIO.py "supported" Unicode in the first place (and that's why I think it was more an artifact of the implementation than true intent).
But it was useful! :-)
- Since this changed from 2.1 to 2.2, we should restore this capability in 2.2.1; I would say that 2.2.1 can't go out until this is fixed.
Try to mark the checkin messages as "2.2.1 bugfix", for the 2.2.1 patch czar.
Checked in. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Unicode support in cStringIO would require a new implementation since the machinery uses raw byte buffers.
Not necessarily. You could add a flag saying that whether those bytes represent Unicode strings or not. If they do, you pick an encoding of your choice (perhaps unicode-internal), and convert the existing bytes to that encoding on first sighting of a Unicode string (assuming that everything so far is in the system encoding). When returning the bytes to the user, you build a Unicode object if the flag is set. Of course, you'd still have to touch every method, to analyse the flag... Regards, Martin
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
Martin v. Loewis