[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Sun Sep 19 04:03:03 CEST 2010

On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <nagle at animats.com> wrote:
> On 9/18/2010 2:29 AM, python-dev-request at python.org wrote:
>>
>> Polymorphic best practices [was: (Not) delaying the 3.2 release]
>
>   If you're hung up on this, try writing the user-level documentation
> first.  Your target audience is a working-level Web programmer, not
> someone who knows six programming languages and has a CS degree.
> If the explanation is too complex, so is the design.
>
>   Coding in this area is quite hard to do right.  There are
> issues with character set, HTML encoding, URL encoding, and
> internationalized domain names.  It's often done wrong;
> I recently found a Google service which botched it.
> Python libraries should strive to deliver textual data to the programmer
> in clean Unicode.  If someone needs the underlying wire representation
> it should be available, but not the default.

Even though URL byte sequences are defined as using only an ASCII
subset, I'm currently inclined to add raw bytes supports to
urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb,
etc) rather than doing it implicitly in the normal functions.

My rationale is as follows:
- while URLs are *meant* to be encoded correctly as an ASCII subset,
the real world isn't always quite so tidy (i.e. applications treat as
URLs things that technically are not because the encoding is wrong)
- separating the APIs forces the programmer to declare that they know
they're working with the raw bytes off the wire to avoid the
decode/encode overhead that comes with working in the Unicode domain
- easier to change our minds later. Adding implicit bytes support to
the normal names can be done any time, but removing it would require
an extensive deprecation period

Essentially, while I can see strong use cases for wanting to
manipulate URLs in wire format, I *don't* see strong use cases for
manipulating URLs without *knowing* whether they're in wire format
(encoded bytes) or display format (Unicode text). For some APIs that
work for arbitrary encodings (e.g. os.listdir) switching based on
argument type seems like a reasonable idea. For those that may
silently produce incorrect output for ASCII-incompatible encodings,
the os.environ/os.environb seems like a better approach.

I could probably be persuaded to merge the APIs, but the email6
precedent suggests to me that separating the APIs better reflects the
mental model we're trying to encourage in programmers manipulating
text (i.e. the difference between the raw octet sequence and the text
character sequence/parsed data).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia