[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Wed Sep 22 04:59:23 CEST 2010

Neil Hodgson writes:

 >    Over time, the set of trail bytes used has expanded - in GB18030
 > digits are possible although many of the most important characters
 > for parsing such as ''' "#%&.?/''' are still safe as they may not
 > be trail bytes in the common double-byte character sets.

That's just not true.  Many double-byte character sets in use are
based on ISO-2022, which allows the whole GL repertoire to be used.

Perhaps you're thinking about variable-width encodings like Shift JIS
and Big5, where I believe that restriction on trailing bytes for
double-byte characters holds.  However, 7-bit encodings with control
sequences remain common in several contexts, at least in Japan and
Korea.  In particular, I can't say how frequent it is, especially
nowadays, but I have seen ISO-2022-JP in URLs "on the wire".

What really saves the day here is not that "common encodings just
don't do that".  It's that even in the case where only syntactically
significant bytes in the representation are URL-encoded, they *are*
URL-encoded.  As long as the parsing library restricts itself to
treating only wire-format input, you're OK.[1]  But once you start
doing things that involve decoding URL-encoding, you can run into
trouble.

Footnotes: 
[1]  With conforming input.  I assume that the libraries know how to
defend themselves from non-conforming input, which could be any kind
of bug or attack, not just mojibake.