[Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

Hugo G. Fierro hugo at gfierro.com
Sat Jan 4 16:36:18 CET 2014


Hi Python devs,

I am trying to download an HTML document. I get an HTTP 301 (Moved
Permanently) with a UTF-8 encoded Location header and http.client decodes
it as iso-8859-1. When there's a non-ASCII character in the redirect URL
then I can't download the document.

In client.py def parse_headers() I see the call to decode('iso-8859-1'). My
personal  hack is to use whatever charset is defined in the Content-Type
HTTP header (utf8) or fall back into iso-8859-1.

At this point I am not sure where/how a fix should occur  so I thought I'd
run it by you in case I should file a bug. Note that I don't use
http.client directly, but through the python-requests library.

I include some code to reproduce the problem below.

Cheers,

Hugo

-----

#!/usr/bin/env python3

# Trying to replicate what wget does with a 301 redirect:
# wget --server-response
www.starbucks.com/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010

import http.client
import urllib.parse

s2='/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010'
s3='
http://www.starbucks.com/store/158/at/karntnerstrasse/k%C3%A4rntnerstrasse-49-vienna-9-1010
'

conn = http.client.HTTPConnection('www.starbucks.com')
conn.request('GET', s2)
r = conn.getresponse()
print('Location', r.headers.get('Location'))
print('Expected', urllib.parse.unquote(s3))
assert r.status == 301
assert r.headers.get('Location') == urllib.parse.unquote(s3), \
    'decoded as iso-8859-1 instead of utf8'

conn = http.client.HTTPConnection('www.starbucks.com')
conn.request('GET', s3)
r = conn.getresponse()
assert r.status == 200
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140104/683eaa3c/attachment.html>


More information about the Python-Dev mailing list