[Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

Xavier Morel catch-all at masklinn.net
Sat Jan 4 17:50:23 CET 2014


On 2014-01-04, at 17:24 , Chris Angelico <rosuav at gmail.com> wrote:

> On Sun, Jan 5, 2014 at 2:36 AM, Hugo G. Fierro <hugo at gfierro.com> wrote:
>> I am trying to download an HTML document. I get an HTTP 301 (Moved
>> Permanently) with a UTF-8 encoded Location header and http.client decodes it
>> as iso-8859-1. When there's a non-ASCII character in the redirect URL then I
>> can't download the document.
>> 
>> In client.py def parse_headers() I see the call to decode('iso-8859-1'). My
>> personal  hack is to use whatever charset is defined in the Content-Type
>> HTTP header (utf8) or fall back into iso-8859-1.
>> 
>> At this point I am not sure where/how a fix should occur  so I thought I'd
>> run it by you in case I should file a bug. Note that I don't use http.client
>> directly, but through the python-requests library.
> 
> I'm not 100% sure, but I believe non-ASCII characters are outright
> forbidden in a Location: header. It's possible that an RFC2047 tag
> might be used, but my reading of RFC2616 is that that's only for text
> fields, not for Location. These non-ASCII characters ought to be
> percent-encoded, and anything doing otherwise is buggy.

That is also my reading, the Location field’s value is defined as an
absoluteURI (RFC2616, section 14.30):

> Location = "Location" ":" absoluteURI

section 3.2.1 indicates that "absoluteURI" (and other related
concepts) are used as defined by RFC 2396 "Uniform Resource
Identifiers (URI): Generic Syntax", that is:

> absoluteURI = scheme ":" ( hier_part | opaque_part )

both "hier_part" and "opaque_part" consist of some punctuation
characters, "escaped" and "unreserved". "escaped" is %-encoded
characters which leaves "unreserved" defined as "alphanum | mark".
"mark" is more punctuation and "alphanum" is ASCII's alphanumeric
ranges.

Furthermore, although RFC 3986 moves some stuff around and renames some
production rules, it seems to have kept this limitation.


More information about the Python-Dev mailing list