UnicodeDecodeError having fetch web page
Rob Williscroft
rtw at rtw.me.uk
Tue May 25 16:12:18 EDT 2010
Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
@m21g2000vbr.googlegroups.com in gmane.comp.python.general:
> Hi,
>
> The code below is giving me the error:
>
> Traceback (most recent call last):
> File "C:\Users\Administratör\Desktop\test.py", line 4, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
> unexpected code byte
>
>
> What am i doing wrong?
It may not be you, en.wiktionary.org is sending gzip
encoded content back, it seems to do this even if you set
the Accept header as in:
request.add_header( "Accept", "text/html" )
But maybe I'm not doing it correctly.
#encoding: utf-8
import urllib
import urllib.request
request = urllib.request.Request
(url='http://en.wiktionary.org/wiki/baby',headers={'User-
Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127
Firefox/2.0.0.11'} )
response = urllib.request.urlopen(request)
info = response.info()
enc = info[ 'Content-Encoding' ]
print( "Encoding: " + enc )
from io import BytesIO
import gzip
buf = BytesIO( response.read() )
unziped = gzip.GzipFile( "wahatever", mode = 'rb', fileobj = buf )
html = unziped.read().decode('utf-8')
print( html.encode( "ascii", "backslashreplace" ) )
Rob.
More information about the Python-list
mailing list