[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Sun Sep 19 04:18:58 CEST 2010

On 9/18/2010 10:03 PM, Nick Coghlan wrote:
> On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <nagle at animats.com> wrote:
>> On 9/18/2010 2:29 AM, python-dev-request at python.org wrote:
>>>
>>> Polymorphic best practices [was: (Not) delaying the 3.2 release]
>>
>>   If you're hung up on this, try writing the user-level documentation
>> first.  Your target audience is a working-level Web programmer, not
>> someone who knows six programming languages and has a CS degree.
>> If the explanation is too complex, so is the design.
>>
>>   Coding in this area is quite hard to do right.  There are
>> issues with character set, HTML encoding, URL encoding, and
>> internationalized domain names.  It's often done wrong;
>> I recently found a Google service which botched it.
>> Python libraries should strive to deliver textual data to the programmer
>> in clean Unicode.  If someone needs the underlying wire representation
>> it should be available, but not the default.
> 
> Even though URL byte sequences are defined as using only an ASCII
> subset, I'm currently inclined to add raw bytes supports to
> urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb,
> etc) rather than doing it implicitly in the normal functions.
> 
> My rationale is as follows:
> - while URLs are *meant* to be encoded correctly as an ASCII subset,
> the real world isn't always quite so tidy (i.e. applications treat as
> URLs things that technically are not because the encoding is wrong)
> - separating the APIs forces the programmer to declare that they know
> they're working with the raw bytes off the wire to avoid the
> decode/encode overhead that comes with working in the Unicode domain
> - easier to change our minds later. Adding implicit bytes support to
> the normal names can be done any time, but removing it would require
> an extensive deprecation period
> 
> Essentially, while I can see strong use cases for wanting to
> manipulate URLs in wire format, I *don't* see strong use cases for
> manipulating URLs without *knowing* whether they're in wire format
> (encoded bytes) or display format (Unicode text). For some APIs that
> work for arbitrary encodings (e.g. os.listdir) switching based on
> argument type seems like a reasonable idea. For those that may
> silently produce incorrect output for ASCII-incompatible encodings,
> the os.environ/os.environb seems like a better approach.
> 
> I could probably be persuaded to merge the APIs, but the email6
> precedent suggests to me that separating the APIs better reflects the
> mental model we're trying to encourage in programmers manipulating
> text (i.e. the difference between the raw octet sequence and the text
> character sequence/parsed data).
> 
That sounds pretty sane and coherent to me.

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
DjangoCon US September 7-9, 2010    http://djangocon.us/
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/