[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Mon Sep 20 23:39:13 CEST 2010

On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote:
> On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote:
> > Existing APIs save for "quote" don't really need to deal with charset
> > encodings at all, at least on any level that Python needs to care about.
> > The potential already exists to emit garbage which will turn into
> > mojibake from almost all existing APIs.  The only remaining issue seems
> > to be fear of making a design mistake while designing APIs.
> >
> > IMO, having a separate module for all urllib.parse APIs, each designed
> > for only bytes input is a design mistake greater than any mistake that
> > could be made by allowing for both bytes and str input to existing APIs
> > and returning whatever type was passed.  The existence of such a module
> > will make it more difficult to maintain a codebase which straddles
> > Python 2 and Python 3.
> 
> Failure to use quote/unquote correctly is a completely different
> problem from using bytes with an ASCII incompatible encoding, or
> mixing bytes with different encodings. Yes, if you don't quote your
> URLs you may end up with mojibake. That's not a justification for
> creating a *new* way to accidentally create mojibake.

There's no new way to accidentally create new mojibake here by allowing
bytes input, as far as I can tell.

- If a user passes something that has character data outside the range
  0-127 to an API that expects a URL or a "component" (in the
  definition that urllib.parse.urlparse uses for "component") of a URI,
  he can keep both pieces when it breaks.  Whether that data is
  represented via bytes or text is not relevant.  He provided 
  bad input, he is going to lose one way or another.

- If a user passes a bytestring to ``quote``, because ``quote`` is
  implemented in terms of ``quote_to_bytes`` the case is *already*
  handled by quote_to_bytes implicitly failing to convert nonascii
  characters.

What are the cases you believe will cause new mojibake? 

> Separating the APIs means that application programmers will be
> expected to know whether they are working with data formatted for
> display to the user (i.e. Unicode text) or transfer over the wire
> (i.e. ASCII compatible bytes).
> 
> Can you give me a concrete use case where the application programmer
> won't *know* which format they're working with? Py3k made the
> conscious decision to stop allowing careless mixing of encoded and
> unencoded text. This is just taking that philosophy and propagating it
> further up the API stack (as has already been done with several OS
> facing APIs for 3.2).

Yes.  Code which must explicitly deal with bytes input and output meant
to straddle both Python 2 and Python 3.  Please try to write some code
which 1) uses the same codebase to straddle Python 2.6 and Python 3.2
and 2) which uses bytes input, and expects bytes output from, say,
urlsplit.  It becomes complex very quickly.  A proposal to create yet
another bytes-only API only makes it more complex, AFAICT.

- C