[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Mon Sep 20 15:23:20 CEST 2010

On Mon, Sep 20, 2010 at 10:12 PM, Chris McDonough <chrism at plope.com> wrote:
> urllib.parse.urlparse/urllib.parse.urlsplit will never need to decode
> anything when passed bytes input.

Correct. Supporting manipulation of bytes directly is primarily a
speed hack for when an application wants to avoid the
decoding/encoding overhead needed to perform the operations in the
text domain when the fragments being manipulated are all already
correctly encoded ASCII text.

However, supporting direct manipulation of bytes *implicitly* in the
current functions is problematic, since it means that the function may
fail silently when given bytes that are encoded with an ASCII
incompatible codec (or which there are many, especially when it comes
to multibyte codecs and other stateful codecs). Even ASCII compatible
codecs are a potential source of hard to detect bugs, since using
different encodings for different fragments will lead directly to
mojibake.

Moving the raw bytes support out to separate APIs allows their
constraints to be spelled out clearly and for programmers to make a
conscious decision that that is what they want to do. The onus is then
on the programmer to get their encodings correct.

If we decide to add implicit support later, that's pretty easy (just
have urllib.parse.* delegate to urllib.parse.*b when given bytes).
Taking implicit support *away* after providing it, however, means
going through the whole deprecation song and dance. Given the choice,
I prefer the API design that allows me to more easily change my mind
later if I decide I made the wrong call.

> There's effectively already a "shadow" bytes-only API in the urlparse module in the form of the *_to_bytes and *_from_bytes functions in most places where it counts.

If by "most places where it counts" you mean "quote" and "unquote",
then sure. However, those two functions predate most of the work on
fixing the bytes/unicode issues in the OS facing libraries, so they
blur the lines more than may be desirable (although reading
http://bugs.python.org/issue3300 shows that there were a few other
constraints in play when it comes to those two operations, especially
those related to the encoding of the original URL *prior* to
percent-encoding for transmission over the wire).

Regardless, quoteb and unquoteb will both be strictly bytes->bytes
functions, whereas the existing quoting APIs attempt to deal with both
text encoding and URL quoting all at the same time (and become a fair
bit more complicated as a result).

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia