[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)

Neil Hodgson nyamatongwe at gmail.com
Wed Sep 22 01:15:13 CEST 2010


Ian Bicking:

> I think the use case everyone has in mind here is where
> you get a URL from one of these sources, and you want to handle it.  I have
> a hard time imagining the sequence of events that would lead to mojibake.
> Naive parsing of a document in bytes couldn't do it, because if you have a
> non-ASCII-compatible document your ASCII-based parsing will also fail (e.g.,
> looking for b'href="(.*?)"').

   It depends on what the particular ASCII-based parsing is doing. For
example, the set of trail bytes in Shift-JIS includes the same bytes
as some of the punctuation characters in ASCII as well as all the
letters. A search or split on '@' or '|' may find the trail byte in a
two-byte character rather than a true occurrence of that character so
the operation 'succeeds' but produces an incorrect result.

   Over time, the set of trail bytes used has expanded - in GB18030
digits are possible although many of the most important characters for
parsing such as ''' "#%&.?/''' are still safe as they may not be trail
bytes in the common double-byte character sets.

   Neil


More information about the Python-Dev mailing list