[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Ian Bicking
ianb at colorstudy.com
Tue Sep 21 17:57:11 CEST 2010
On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > What are the cases you believe will cause new mojibake?
>
> Calling operations like urlsplit on byte sequences in non-ASCII
> compatible encodings and operations like urljoin on byte sequences
> that are encoded with different encodings. These errors differ from
> the URL escaping errors you cite, since they can produce true mojibake
> (i.e. a byte sequence without a single consistent encoding), rather
> than merely non-compliant URLs. However, if someone has let their
> encodings get that badly out of whack in URL manipulation they're
> probably doomed anyway...
>
FWIW, while I understand the problems non-ASCII-compatible encodings can
create, I've never encountered them, perhaps because ASCII-compatible
encodings are so dominant.
There are ways you can get a URL (HTTP specifically) where there is no
notion of Unicode. I think the use case everyone has in mind here is where
you get a URL from one of these sources, and you want to handle it. I have
a hard time imagining the sequence of events that would lead to mojibake.
Naive parsing of a document in bytes couldn't do it, because if you have a
non-ASCII-compatible document your ASCII-based parsing will also fail (e.g.,
looking for b'href="(.*?)"'). I suppose if you did
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end
up with the problem.
All this is unrelated to the question, though -- a separate byte-oriented
function won't help any case I can think of. If the programmer is
implementing something like
urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because
they *want* to get bytes out. So if it's named urlparse.urlsplit_bytes()
they'll just use that, with the same corruption. Since bytes and text don't
interact well, the choice of bytes in and bytes out will be a deliberate
one. *Or*, bytes will unintentionally come through, but that will just
delay the error a while when the bytes out don't work (e.g.,
urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path). Delaying the
error is a little annoying, but a delayed error doesn't lead to mojibake.
Mojibake is caused by allowing bytes and text to intermix, and the
polymorphic functions as proposed don't add new dangers in that regard.
--
Ian Bicking | http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100921/a07f73ad/attachment-0001.html>
More information about the Python-Dev
mailing list