[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Mon Sep 20 14:12:18 CEST 2010

On Sun, 2010-09-19 at 12:03 +1000, Nick Coghlan wrote:
> On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <nagle at animats.com> wrote:
> > On 9/18/2010 2:29 AM, python-dev-request at python.org wrote:
> >>
> >> Polymorphic best practices [was: (Not) delaying the 3.2 release]
> >
> >   If you're hung up on this, try writing the user-level documentation
> > first.  Your target audience is a working-level Web programmer, not
> > someone who knows six programming languages and has a CS degree.
> > If the explanation is too complex, so is the design.
> >
> >   Coding in this area is quite hard to do right.  There are
> > issues with character set, HTML encoding, URL encoding, and
> > internationalized domain names.  It's often done wrong;
> > I recently found a Google service which botched it.
> > Python libraries should strive to deliver textual data to the programmer
> > in clean Unicode.  If someone needs the underlying wire representation
> > it should be available, but not the default.
> 
> Even though URL byte sequences are defined as using only an ASCII
> subset, I'm currently inclined to add raw bytes supports to
> urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb,
> etc) rather than doing it implicitly in the normal functions.
> 
> My rationale is as follows:
> - while URLs are *meant* to be encoded correctly as an ASCII subset,
> the real world isn't always quite so tidy (i.e. applications treat as
> URLs things that technically are not because the encoding is wrong)
> - separating the APIs forces the programmer to declare that they know
> they're working with the raw bytes off the wire to avoid the
> decode/encode overhead that comes with working in the Unicode domain
> - easier to change our minds later. Adding implicit bytes support to
> the normal names can be done any time, but removing it would require
> an extensive deprecation period
> 
> Essentially, while I can see strong use cases for wanting to
> manipulate URLs in wire format, I *don't* see strong use cases for
> manipulating URLs without *knowing* whether they're in wire format
> (encoded bytes) or display format (Unicode text). For some APIs that
> work for arbitrary encodings (e.g. os.listdir) switching based on
> argument type seems like a reasonable idea. For those that may
> silently produce incorrect output for ASCII-incompatible encodings,
> the os.environ/os.environb seems like a better approach.

urllib.parse.urlparse/urllib.parse.urlsplit will never need to decode
anything when passed bytes input.  Both could just put the bytes
comprising the hex-encoded components (the path and query string) into
its respective place in the parse results, just like it does now for
string input.  As far as I can tell, the only thing preventing it from
working against bytes right now is the use of string literals in the
source instead of input-type-dictated-literals.  There should not really
be any need to create a "urllib.parse.urlsplitb" unless the goal is to
continue down the (not great IMO) precedent already set by the shadow
bytes API in urllib.parse (*_to_bytes, *_from_bytes) or if we just want
to make it deliberately harder to parse URLs. 

The only decoding that needs to be done to potential bytes input by APIs
in urllib.parse will be in the face of percent encodings in the path and
query components (handled entirely by "unquote" and "unquote_plus",
which already deal in bytes under the hood).  The only encoding that
needs to be done by urllib.parse is in the face of input to the
"urlencode" and "quote" APIs.  "quote" already deals with bytes as input
under the hood.  "urlencode" does not, but it might be changed use the
same strategy that "quote" does now (by using a "urlencode_to_bytes"
under the hood).

However, I think any thought about "adding raw bytes support" is largely
moot at this point.  This pool has already been peed in.There's
effectively already a "shadow" bytes-only API in the urlparse module in
the form of the *_to_bytes and *_from_bytes functions in most places
where it counts.  So as I see it, the options are:

1) continue the *_to_bytes and *_from_bytes pattern as necessary.

2) create a new module (urllib.parse2) that has only polymorphic
   functions.

#1 is not very pleasant to think about as a web developer if I need to
maintain a both-2-and-3-compatible codebase.  Neither is #2, really, if
I had to support Python 3.1 and 3.2.  From my (obviously limited)
perspective, a more attractive third option is backwards incompatibility
in a later Python 3 version, where encoding-aware functions like quote,
urlencode, and unquote_plus were polymorphic, accepting both bytes and
string objects and returning same-typed data.

- C