[Web-SIG] Fwd: bytes / unicode

Graham Dumpleton graham.dumpleton at gmail.com
Mon Jun 21 06:17:26 CEST 2010


Can you please join the Python WEB-SIG and continue the existing
conversation there.

  http://groups.google.com/group/python-web-sig?lnk=

At the time I was merely facilitating a discussion and am not an
expert on the issues.

I have cc'd the web-sig for those who still may be interested in this.

Graham


---------- Forwarded message ----------
From: Terry Reedy <tjreedy at udel.edu>
Date: 21 June 2010 13:56
Subject: Re: bytes / unicode
To:
Cc: graham.dumpleton at gmail.com


On 6/20/2010 9:33 PM, P.J. Eby wrote:
>
> At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote:
>>
>> Do you have in mind any tools that could and should operate on both,
>> but do not?
>
>  From http://mail.python.org/pipermail/web-sig/2009-September/004105.html :

Thank for the concrete examples in this and your other post.
I am cc-ing the author of the above.

> """The problem which arises is that unquoting of URLs in Python 3.X
> stdlib can only be done on unicode strings.

Actually, I believe this is an encoding rather than bytes versus unicode issue.

> If though a string
>
> contains non UTF-8 encoded characters it can fail."""

Which is to say, I believe, if the ascii text in the (unicode) string
has a % encoding of a byte that that is not a legal utf-8 encoding of
anything.

The specific example is

>>> urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

where the character after 'b' is white ? in dark diamond, indicating an error.

parse_qsl() splits that input on '=' and sends each piece to
urllib.parse.unquote
unquote() attempts to "Replace %xx escapes by their single-character
equivalent.". unquote has an encoding parameter that defaults to
'utf-8' in *its* call to .decode. parse_qsl does not have an encoding
parameter. If it did, and it passed that to unquote, then
the above example would become (simulated interaction)

>>> urllib.parse.parse_qsl('a=b%e0', encoding='latin-1')
[('a', 'bà')]

I got that output by copying the file and adding "encoding-'latin-1'"
to the unquote call.

Does this solve this problem?
Has anything like this been added for 3.2?
Should it be?

> I don't have any direct experience with the specific issue demonstrated
> in that post, but in the context of the discussion as a whole, I
> understood the overall issue as "if you pass bytes to certain stdlib
> functions, you might get back unicode, an explicit error, or (at least
> in the case shown above) something that's just plain wrong."

As indicated above, I so far think that the problem is with the
application of the new model, not the model itself.

Just for 'fun', I tried feeding bytes to the function.
>>> p.parse_qsl(b'a=b%e0')
Traceback (most recent call last):
 File "<pyshell#2>", line 1, in <module>
   p.parse_qsl(b'a=b%e0')
 File "C:\Programs\Python31\lib\urllib\parse.py", line 377, in parse_qsl
   pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')]
TypeError: Type str doesn't support the buffer API

I do not know if that message is correct, but certainly trying to
split bytes with unicode is (now, at least) a mistake. This could be
'fixed' by replacing the typed literals with expressions that match
the type of the input. But I am not sure if that is sensible since the
next step is to unquote and decode to unicode anyway. I just do not
know the use case.

Terry Jan Reedy


More information about the Web-SIG mailing list