[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Tue Sep 21 00:28:44 CEST 2010

On Tue, 2010-09-21 at 08:19 +1000, Nick Coghlan wrote:
> On Tue, Sep 21, 2010 at 7:39 AM, Chris McDonough <chrism at plope.com> wrote:
> > On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote:
> >> On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote:
> >> > Existing APIs save for "quote" don't really need to deal with charset
> >> > encodings at all, at least on any level that Python needs to care about.
> >> > The potential already exists to emit garbage which will turn into
> >> > mojibake from almost all existing APIs.  The only remaining issue seems
> >> > to be fear of making a design mistake while designing APIs.
> >> >
> >> > IMO, having a separate module for all urllib.parse APIs, each designed
> >> > for only bytes input is a design mistake greater than any mistake that
> >> > could be made by allowing for both bytes and str input to existing APIs
> >> > and returning whatever type was passed.  The existence of such a module
> >> > will make it more difficult to maintain a codebase which straddles
> >> > Python 2 and Python 3.
> >>
> >> Failure to use quote/unquote correctly is a completely different
> >> problem from using bytes with an ASCII incompatible encoding, or
> >> mixing bytes with different encodings. Yes, if you don't quote your
> >> URLs you may end up with mojibake. That's not a justification for
> >> creating a *new* way to accidentally create mojibake.
> >
> > There's no new way to accidentally create new mojibake here by allowing
> > bytes input, as far as I can tell.
> >
> > - If a user passes something that has character data outside the range
> >  0-127 to an API that expects a URL or a "component" (in the
> >  definition that urllib.parse.urlparse uses for "component") of a URI,
> >  he can keep both pieces when it breaks.  Whether that data is
> >  represented via bytes or text is not relevant.  He provided
> >  bad input, he is going to lose one way or another.
> >
> > - If a user passes a bytestring to ``quote``, because ``quote`` is
> >  implemented in terms of ``quote_to_bytes`` the case is *already*
> >  handled by quote_to_bytes implicitly failing to convert nonascii
> >  characters.
> >
> > What are the cases you believe will cause new mojibake?
> 
> Calling operations like urlsplit on byte sequences in non-ASCII
> compatible encodings and operations like urljoin on byte sequences
> that are encoded with different encodings. These errors differ from
> the URL escaping errors you cite, since they can produce true mojibake
> (i.e. a byte sequence without a single consistent encoding), rather
> than merely non-compliant URLs. However, if someone has let their
> encodings get that badly out of whack in URL manipulation they're
> probably doomed anyway...

Right, the bytes issue here is really a red herring in both the urlsplit
and urljoin cases, I think.

> It's certainly possible I hadn't given enough weight to the practical
> issues associated with migration of existing code from 2.x to 3.x
> (particularly with the precedent of some degree of polymorphism being
> set back when Issue 3300 was dealt with).
> 
> Given that a separate API still places the onus on the developer to
> manage their encodings correctly, I'm beginning to lean back towards
> the idea of a polymorphic API rather than separate functions. (the
> quote/unquote legacy becomes somewhat unfortunate in that situation,
> as they always returns str objects rather than allowing the type of
> the result to be determined by the type of the argument. Something
> like quotep/unquotep may prove necessary in order to work around that
> situation and provide a bytes->bytes, str->str API)

Yay, sounds much, much better!

- C