harmful str(bytes)

Fri Oct 8 15:34:51 EDT 2010

Steven D'Aprano writes:
>On Fri, 08 Oct 2010 15:31:27 +0200, Hallvard B Furuseth wrote:
>> That's not the point - the point is that for 2.* code which _uses_ str
>> vs unicode, the equivalent 3.* code uses str vs bytes.  Yet not the same
>> way - a 2.* 'str' will sometimes be 3.* bytes, sometime str.  So
>> upgraded old code will have to expect both str and bytes.
>
> I'm sorry, this makes no sense to me. I've read it repeatedly, and I 
> still don't understand what you're trying to say.

OK, here is a simplified example after 2to3:

    try:    from urlparse     import urlparse, urlunparse # Python 2.6
    except: from urllib.parse import urlparse, urlunparse # Python 3.2a

    foo, bar = b"/foo", b"bar" # Data from network, bar normally empty

    # Statement inserted for 2.3 when urlparse below said TypeError
    if isinstance(foo, bytes): foo = foo.decode("ASCII")

    p = list(urlparse(foo))
    if bar: p[3] = bar
    print(urlunparse(p))

2.6 prints "/foo;bar", 3.2a prints "/foo;b'bar'"

You have a module which receives some strings/bytes, maybe data which
originates on the net or in a database.  The module _and its callers_
may date back to before the 'bytes' type, maybe before 'unicode'.
The module is supposed to work with this data and produce some 'str's
or bytes to output.  _Not_ a Python representation like "b'bar'".

The module doesn't always know which input is 'bytes' and which is
'str'.  Or the callers don't know what it expects, or haven't kept
track.  Maybe the input originated as bytes and were converted to
str at some point, maybe not.

Look at urrlib.parse.py and its isinstance(<data>, <str or bytes>)
calls.  urlencode() looks particularly gross, though that one has code
which could be factored out.  They didn't catch everything either, I
posted this when a 2to3'ed module of mine produced URLs with "b'bar'".

In the pre-'unicode type' Python (was that early Python 2, or should
I have said Python 1?) that was a non-issue - it Just Worked, sans
possible charset issues.

In Python 2 with unicode, the module would get it right or raise an
exception.  Which helps the programmer fix any charset issues.

In Python 3, the module does not raise an exception, it produces
"b'bar'" when it was supposed to produce "bar".

>> In 2.*, str<->unicode conversion failed or produced the equivalent
>> character/byte data.  Yes, there could be charset problems if the
>> defaults were set up wrong, but that's a smaller problem than in 3.*. In
>> 3.*, the bytes->str conversion always _silently_ produces garbage.
>
> So you say, but I don't see it. Why is this garbage?

To the user of the module, stuff with Python syntax is garbage.  It
was supposed to be text/string data.

>>>> b = b'abc\xff'
>>>> str(b)
> "b'abc\\xff'"
>
> That's what I would expect from the str() function called with a bytes 
> argument. Since decoding bytes requires a codec, which you haven't given, 
> it can only return a string representation of the bytes.
>
> If you want to decode bytes into a string, you need to specify a codec:

Except I didn't intend to decode anything - I just intended to output
the contents of the string - which was stored in a 'bytes' object.
But __str__ got called because a lot of code does that.  It wasn't
even my code which did it.

There's often no obvious place to decide when to consider a stream of
data as raw bytes and when to consider it text, and no obvious time
to convert between bytes and str.  When writing a program, one simply
has to decide.  Such as network data (bytes) vs urllib URLs (str)
in my program.  And the decision is different from what one would
decide for when to use str and when to use unicode in Python 2.

In this case I'll bugreport urlunparse to python.org, but there'll be
a _lot_ of such code around.  And without an Exception getting raised,
it'll take time to find it.  So it looks like it'll be a long time
before I dare entrust my data to Python 3, except maybe with modules
written from scratch.

-- 
Hallvard