[Python-Dev] Re: [XML-SIG] printing Unicode xml to StringIO

M.-A. Lemburg mal@lemburg.com
Fri, 28 Dec 2001 11:50:25 +0100


"Martin v. Loewis" wrote:
> 
> [Over to python-dev. Jaco noticed that writing Unicode objects to
>  a StringIO object stopped working in 2.2, see
> http://mail.python.org/pipermail/xml-sig/2001-December/006891.html
> ]
> 
> Marc-Andre writes
> > Actually, I think that this is a bug in the documentation, not the
> > code. StringIO and cStringIO were never meant to work on anything but
> > strings and memory buffers.
> 
> IMO, "strings" should include both byte strings and Unicode strings.
> Mixing them may not be allowed, but that is a different story.
> 
> In fact, there is an open bug (#216388) that cStringIO rejects Unicode
> objects. If that gets fixed, we get the funny scenario that StringIO
> rejects Unicode object, whereas cStringIO accepts them.

StringIO and cStringIO use different methods for storing the
snippets: StringIO makes use of a buffer list which gets
compressed every now and then, while cStringIO uses a raw
memory buffer for this purpose.

Both of these implementation are targetted at providing
a file IO like interface to in-memory "files". Since Python
file object don't magically support Unicode, I wonder where the
idea came from that StringIO/cStringIO should.

That patch I applied to StringIO/cStringIO for 2.2 was
aimed at making these two more compatible to the standard
Python file object. The latter uses the "s#" parser
marker for .write() and thus can also accept memory
buffers. This was previously not possible with either
of the two StringIO implementation (StringIO.py failed
when trying to join different buffer compatible objects,
cStringIO only accepted real string objects).

> > The note that Fred added to the docs about StringIO's capability of
> > storing Unicode in it's buffer list is simply an artifact of the
> > implementation.
> 
> There are many developers who take this note literally. Claiming that
> this was not intentional is a mistake.
>
> > Please use the .encode() method on Unicode objects before writing
> > them to a StringIO object.
> 
> If you want to end up with a byte string, this is a good idea. 

That's the idea behind StringIO objects... they are in-memory file 
object emulators.

> But I
> think it is pointless to require encoding them when you want to end up
> with a Unicode string; you'd have to invoke unicode() on the result,
> for no apparent reason but a bug in the StringIO implementation.

This is a different application. It should be easy enough to
subclass StringIO as UnicodeIO class and then have this class
implement fast Unicode snippet joining. I'm not sure whether
the same can be done with cStringIO's type.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/