[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Ian Bicking ianb at colorstudy.com
Tue Sep 21 17:57:11 CEST 2010

On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

>  > What are the cases you believe will cause new mojibake?
> Calling operations like urlsplit on byte sequences in non-ASCII
> compatible encodings and operations like urljoin on byte sequences
> that are encoded with different encodings. These errors differ from
> the URL escaping errors you cite, since they can produce true mojibake
> (i.e. a byte sequence without a single consistent encoding), rather
> than merely non-compliant URLs. However, if someone has let their
> encodings get that badly out of whack in URL manipulation they're
> probably doomed anyway...

FWIW, while I understand the problems non-ASCII-compatible encodings can
create, I've never encountered them, perhaps because ASCII-compatible
encodings are so dominant.

There are ways you can get a URL (HTTP specifically) where there is no
notion of Unicode.  I think the use case everyone has in mind here is where
you get a URL from one of these sources, and you want to handle it.  I have
a hard time imagining the sequence of events that would lead to mojibake.
Naive parsing of a document in bytes couldn't do it, because if you have a
non-ASCII-compatible document your ASCII-based parsing will also fail (e.g.,
looking for b'href="(.*?)"').  I suppose if you did
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end
up with the problem.

All this is unrelated to the question, though -- a separate byte-oriented
function won't help any case I can think of.  If the programmer is
implementing something like
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because
they *want* to get bytes out.  So if it's named urlparse.urlsplit_bytes()
they'll just use that, with the same corruption.  Since bytes and text don't
interact well, the choice of bytes in and bytes out will be a deliberate
one.  *Or*, bytes will unintentionally come through, but that will just
delay the error a while when the bytes out don't work (e.g.,
urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path).  Delaying the
error is a little annoying, but a delayed error doesn't lead to mojibake.

Mojibake is caused by allowing bytes and text to intermix, and the
polymorphic functions as proposed don't add new dangers in that regard.

Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100921/a07f73ad/attachment-0001.html>

More information about the Python-Dev mailing list