[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Mon Sep 20 20:30:35 CEST 2010

On Mon, 2010-09-20 at 23:23 +1000, Nick Coghlan wrote:
> On Mon, Sep 20, 2010 at 10:12 PM, Chris McDonough <chrism at plope.com> wrote:
> > urllib.parse.urlparse/urllib.parse.urlsplit will never need to decode
> > anything when passed bytes input.
> 
> Correct. Supporting manipulation of bytes directly is primarily a
> speed hack for when an application wants to avoid the
> decoding/encoding overhead needed to perform the operations in the
> text domain when the fragments being manipulated are all already
> correctly encoded ASCII text.

The urllib.parse.urlparse/urlsplit functions should never need to know
or care whether the input they're passed is correctly encoded.  They
actually don't care right now: both will happily consume non-ASCII
characters and spit out nonsense in the parse results.  If passed
garbage, they'll return garbage:

  >>> urlparse('http://www.cwi.nl:80/%7Eguido/LaPeña.html')
  ParseResult(scheme='http', netloc='www.cwi.nl:80', 
              path='/%7Eguido/LaPeña.html', params='', 
              query='', fragment='')

> However, supporting direct manipulation of bytes *implicitly* in the
> current functions is problematic, since it means that the function may
> fail silently when given bytes that are encoded with an ASCII
> incompatible codec (or which there are many, especially when it comes
> to multibyte codecs and other stateful codecs). Even ASCII compatible
> codecs are a potential source of hard to detect bugs, since using
> different encodings for different fragments will lead directly to
> mojibake.

The "path" component result above is potentially useless and broken, and
if it is inserted into a web page as a link, it may cause mojibake, but
urlparse doesn't (can't) complain.  As far as I can tell, there would be
no more and no less potential for mojibake if the same API were able to
be fed a bytes object.  The result can already be nonsense and allowing
for bytes as input doesn't add any greater potential to receive nonsense
back.

Most APIs in urllib.parse exhibit the same behavior today, e.g. urljoin:

   >>> urljoin('http://gooñle.com', '%7Eguido/LaPeña.html')
   'http://gooñle.com/%7Eguido/LaPeña.html'

The resulting URL is total nonsense.

> Moving the raw bytes support out to separate APIs allows their
> constraints to be spelled out clearly and for programmers to make a
> conscious decision that that is what they want to do. The onus is then
> on the programmer to get their encodings correct.

I guess my argument is that the onus already *is* on the programmer to
get their encodings right.  They can just as easily screw up while using
str inputs.

> If we decide to add implicit support later, that's pretty easy (just
> have urllib.parse.* delegate to urllib.parse.*b when given bytes).
> Taking implicit support *away* after providing it, however, means
> going through the whole deprecation song and dance. Given the choice,
> I prefer the API design that allows me to more easily change my mind
> later if I decide I made the wrong call.

Existing APIs save for "quote" don't really need to deal with charset
encodings at all, at least on any level that Python needs to care about.
The potential already exists to emit garbage which will turn into
mojibake from almost all existing APIs.  The only remaining issue seems
to be fear of making a design mistake while designing APIs.

IMO, having a separate module for all urllib.parse APIs, each designed
for only bytes input is a design mistake greater than any mistake that
could be made by allowing for both bytes and str input to existing APIs
and returning whatever type was passed.  The existence of such a module
will make it more difficult to maintain a codebase which straddles
Python 2 and Python 3.

- C