cStringIO unicode weirdness
John Machin
sjmachin at lexicon.net
Mon Jun 18 19:12:50 EDT 2007
On Jun 19, 8:56 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> Python 2.5 (r25:51908, Oct 6 2006, 15:24:43)
> [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import StringIO, cStringIO
> >>> StringIO.StringIO('a').getvalue()
> 'a'
> >>> cStringIO.StringIO('a').getvalue()
> 'a'
> >>> StringIO.StringIO(u'a').getvalue()
> u'a'
> >>> cStringIO.StringIO(u'a').getvalue()
> 'a\x00\x00\x00'
> >>>
>
> I would have thought StringIO and cStringIO would return the
> same result for this ascii-encodeable string.
Looks like a bug to me.
> Worse:
>
> >>> StringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8')
> u'a'
>
> does the right thing, but
>
> >>> cStringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8')
> u'a\x00\x00\x00'
>
> looks bogus. Am I misunderstanding something?
Not worse, no more bogus than before. Note that an explicit design
feature of utf8 is that ASCII characters (ord(c) < 128) are unchanged
by the transformation.
>>> 'a\x00\x00\x00'.encode('utf-8')
# IMPLICIT conversion to unicode (effectively .decode('ascii')), then
encoding as utf8
'a\x00\x00\x00' # no change to original buggy result
>>>
>>> 'a\x00\x00\x00'.decode('utf-8')
u'a\x00\x00\x00' # as expected
>>>
More information about the Python-list
mailing list