[Python-bugs-list] [ python-Bugs-547537 ] cStringIO mangles Unicode

noreply@sourceforge.net noreply@sourceforge.net
Sat, 27 Apr 2002 08:13:29 -0700


Bugs item #547537, was opened at 2002-04-23 12:52
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=547537&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Guido van Rossum (gvanrossum)
Assigned to: M.-A. Lemburg (lemburg)
Summary: cStringIO mangles Unicode

Initial Comment:
The last few comments added to bug 216388 indicate a
new problem in cStringIO. Rather than abusing that bug
report, I'm opening a new one here. The problem is that
cStringIO now accepts Unicode strings to write(), but
when you use this, getvalue() returns binary garbage.
The cause is apparently MAL's checkin for cStringIO
2.30, which enabled read buffers.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-27 15:13

Message:
Logged In: YES 
user_id=38388

Another note: the bug title is wrong: cStringIO doesn't
mangle Unicode, it just returns the raw binary data. Not
that this is of much use, but it's in sync with what the
file object does.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-27 15:02

Message:
Logged In: YES 
user_id=38388

The idea to rip out the old string only approach was to make
cStringIO more compatible to the file object implementation.

Rather than switching from s# to t#, the cStringIO object
should maintain a binary switch just like the file
object does and then use s# for pseudo files opened
in binary mode (default) and t# for text mode ones.

Note that in any case, Unicode should be explicitly
encoded before writing it to a file. 

Simply switching to t# would cause compatibility 
problems, since a different buffer API would be used
for all input objects.



----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-04-26 21:08

Message:
Logged In: YES 
user_id=6380

Should I just check this in? It looks pretty safe to me...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-04-23 12:59

Message:
Logged In: YES 
user_id=6380

I wonder if perhaps the fix is as simple as using "t#"
instead of "s#" in the PyArg_... format string in P_write().
That accepts Unicode strings as args to write() only when
they are ASCII (actually, it uses the default encoding).

Marc-Andre, can you explain the reason for the change in the
first place (other than fixing a dubious dependency on
PyString_GetSize() raising an exception for a non-string
object)?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=547537&group_id=5470