[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Ian Bicking ianb at colorstudy.com
Tue Sep 21 17:57:11 CEST 2010


On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

>  > What are the cases you believe will cause new mojibake?
>
> Calling operations like urlsplit on byte sequences in non-ASCII
> compatible encodings and operations like urljoin on byte sequences
> that are encoded with different encodings. These errors differ from
> the URL escaping errors you cite, since they can produce true mojibake
> (i.e. a byte sequence without a single consistent encoding), rather
> than merely non-compliant URLs. However, if someone has let their
> encodings get that badly out of whack in URL manipulation they're
> probably doomed anyway...
>

FWIW, while I understand the problems non-ASCII-compatible encodings can
create, I've never encountered them, perhaps because ASCII-compatible
encodings are so dominant.

There are ways you can get a URL (HTTP specifically) where there is no
notion of Unicode.  I think the use case everyone has in mind here is where
you get a URL from one of these sources, and you want to handle it.  I have
a hard time imagining the sequence of events that would lead to mojibake.
Naive parsing of a document in bytes couldn't do it, because if you have a
non-ASCII-compatible document your ASCII-based parsing will also fail (e.g.,
looking for b'href="(.*?)"').  I suppose if you did
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end
up with the problem.

All this is unrelated to the question, though -- a separate byte-oriented
function won't help any case I can think of.  If the programmer is
implementing something like
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because
they *want* to get bytes out.  So if it's named urlparse.urlsplit_bytes()
they'll just use that, with the same corruption.  Since bytes and text don't
interact well, the choice of bytes in and bytes out will be a deliberate
one.  *Or*, bytes will unintentionally come through, but that will just
delay the error a while when the bytes out don't work (e.g.,
urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path).  Delaying the
error is a little annoying, but a delayed error doesn't lead to mojibake.

Mojibake is caused by allowing bytes and text to intermix, and the
polymorphic functions as proposed don't add new dangers in that regard.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100921/a07f73ad/attachment-0001.html>


More information about the Python-Dev mailing list