cStringIO unicode weirdness

Mon Jun 18 19:12:50 EDT 2007

On Jun 19, 8:56 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>     Python 2.5 (r25:51908, Oct  6 2006, 15:24:43)
>     [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2
>     Type "help", "copyright", "credits" or "license" for more information.
>     >>> import StringIO, cStringIO
>     >>> StringIO.StringIO('a').getvalue()
>     'a'
>     >>> cStringIO.StringIO('a').getvalue()
>     'a'
>     >>> StringIO.StringIO(u'a').getvalue()
>     u'a'
>     >>> cStringIO.StringIO(u'a').getvalue()
>     'a\x00\x00\x00'
>     >>>
>
> I would have thought StringIO and cStringIO would return the
> same result for this ascii-encodeable string.

Looks like a bug to me.

> Worse:
>
>     >>> StringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8')
>     u'a'
>
> does the right thing, but
>
>     >>> cStringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8')
>     u'a\x00\x00\x00'
>
> looks bogus.  Am I misunderstanding something?

Not worse, no more bogus than before. Note that an explicit design
feature of utf8 is that ASCII characters (ord(c) < 128) are unchanged
by the transformation.

>>> 'a\x00\x00\x00'.encode('utf-8')
# IMPLICIT conversion to unicode (effectively .decode('ascii')), then
encoding as utf8
'a\x00\x00\x00' # no change to original buggy result
>>>
>>> 'a\x00\x00\x00'.decode('utf-8')
u'a\x00\x00\x00' # as expected
>>>