[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Mon Sep 20 23:12:13 CEST 2010

On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote:
> Existing APIs save for "quote" don't really need to deal with charset
> encodings at all, at least on any level that Python needs to care about.
> The potential already exists to emit garbage which will turn into
> mojibake from almost all existing APIs.  The only remaining issue seems
> to be fear of making a design mistake while designing APIs.
>
> IMO, having a separate module for all urllib.parse APIs, each designed
> for only bytes input is a design mistake greater than any mistake that
> could be made by allowing for both bytes and str input to existing APIs
> and returning whatever type was passed.  The existence of such a module
> will make it more difficult to maintain a codebase which straddles
> Python 2 and Python 3.

Failure to use quote/unquote correctly is a completely different
problem from using bytes with an ASCII incompatible encoding, or
mixing bytes with different encodings. Yes, if you don't quote your
URLs you may end up with mojibake. That's not a justification for
creating a *new* way to accidentally create mojibake.

Separating the APIs means that application programmers will be
expected to know whether they are working with data formatted for
display to the user (i.e. Unicode text) or transfer over the wire
(i.e. ASCII compatible bytes).

Can you give me a concrete use case where the application programmer
won't *know* which format they're working with? Py3k made the
conscious decision to stop allowing careless mixing of encoded and
unencoded text. This is just taking that philosophy and propagating it
further up the API stack (as has already been done with several OS
facing APIs for 3.2).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia