[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Tue Sep 21 00:19:12 CEST 2010

On Tue, Sep 21, 2010 at 7:39 AM, Chris McDonough <chrism at plope.com> wrote:
> On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote:
>> On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote:
>> > Existing APIs save for "quote" don't really need to deal with charset
>> > encodings at all, at least on any level that Python needs to care about.
>> > The potential already exists to emit garbage which will turn into
>> > mojibake from almost all existing APIs.  The only remaining issue seems
>> > to be fear of making a design mistake while designing APIs.
>> >
>> > IMO, having a separate module for all urllib.parse APIs, each designed
>> > for only bytes input is a design mistake greater than any mistake that
>> > could be made by allowing for both bytes and str input to existing APIs
>> > and returning whatever type was passed.  The existence of such a module
>> > will make it more difficult to maintain a codebase which straddles
>> > Python 2 and Python 3.
>>
>> Failure to use quote/unquote correctly is a completely different
>> problem from using bytes with an ASCII incompatible encoding, or
>> mixing bytes with different encodings. Yes, if you don't quote your
>> URLs you may end up with mojibake. That's not a justification for
>> creating a *new* way to accidentally create mojibake.
>
> There's no new way to accidentally create new mojibake here by allowing
> bytes input, as far as I can tell.
>
> - If a user passes something that has character data outside the range
>  0-127 to an API that expects a URL or a "component" (in the
>  definition that urllib.parse.urlparse uses for "component") of a URI,
>  he can keep both pieces when it breaks.  Whether that data is
>  represented via bytes or text is not relevant.  He provided
>  bad input, he is going to lose one way or another.
>
> - If a user passes a bytestring to ``quote``, because ``quote`` is
>  implemented in terms of ``quote_to_bytes`` the case is *already*
>  handled by quote_to_bytes implicitly failing to convert nonascii
>  characters.
>
> What are the cases you believe will cause new mojibake?

Calling operations like urlsplit on byte sequences in non-ASCII
compatible encodings and operations like urljoin on byte sequences
that are encoded with different encodings. These errors differ from
the URL escaping errors you cite, since they can produce true mojibake
(i.e. a byte sequence without a single consistent encoding), rather
than merely non-compliant URLs. However, if someone has let their
encodings get that badly out of whack in URL manipulation they're
probably doomed anyway...

It's certainly possible I hadn't given enough weight to the practical
issues associated with migration of existing code from 2.x to 3.x
(particularly with the precedent of some degree of polymorphism being
set back when Issue 3300 was dealt with).

Given that a separate API still places the onus on the developer to
manage their encodings correctly, I'm beginning to lean back towards
the idea of a polymorphic API rather than separate functions. (the
quote/unquote legacy becomes somewhat unfortunate in that situation,
as they always returns str objects rather than allowing the type of
the result to be determined by the type of the argument. Something
like quotep/unquotep may prove necessary in order to work around that
situation and provide a bytes->bytes, str->str API)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia