[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Nick Coghlan ncoghlan at gmail.com
Wed Sep 22 14:07:47 CEST 2010


On Wed, Sep 22, 2010 at 12:59 PM, Stephen J. Turnbull
<stephen at xemacs.org> wrote:
> Neil Hodgson writes:
>
>  >    Over time, the set of trail bytes used has expanded - in GB18030
>  > digits are possible although many of the most important characters
>  > for parsing such as ''' "#%&.?/''' are still safe as they may not
>  > be trail bytes in the common double-byte character sets.
>
> That's just not true.  Many double-byte character sets in use are
> based on ISO-2022, which allows the whole GL repertoire to be used.
>
> Perhaps you're thinking about variable-width encodings like Shift JIS
> and Big5, where I believe that restriction on trailing bytes for
> double-byte characters holds.  However, 7-bit encodings with control
> sequences remain common in several contexts, at least in Japan and
> Korea.  In particular, I can't say how frequent it is, especially
> nowadays, but I have seen ISO-2022-JP in URLs "on the wire".

Notably, utf-16 and utf-32 make no promises regarding avoidance of
ASCII character codes in trail bytes - only utf-8 is guaranteed to be
compatible with parsing as if it were ASCII (and even then, you need
to be careful only to split the string at known ASCII characters
rather than at arbitrary points).

The known-ASCII-incompatible multibyte encodings I came up with when I
reviewed the list in the codecs module docs the other day were:
CP932 (the example posted here that prompted me to embark on this
check in the first place)
UTF-7
UTF-16
UTF-32
shift-JIS
big5
iso-2022-*
EUC-CN/KR/TW

The only known-ASCII-compatible multibyte encodings I found were UTF-8
and EUC-JP (all of the non-EBCDIC single byte encodings appeared to be
ASCII compatible though)

I didn't check any of the other CP* encodings though, since I already
had plenty of examples to show that the assumption of ASCII
compatibility isn't likely to be valid in general unless there is some
other constraint (such as the RFCs for safely encoding URLs to an
octet-sequence).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list