UnicodeDecodeError having fetch web page

Rob Williscroft rtw at rtw.me.uk
Tue May 25 16:12:18 EDT 2010


Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
@m21g2000vbr.googlegroups.com in gmane.comp.python.general:

> Hi,
> 
> The code below is giving me the error:
> 
> Traceback (most recent call last):
>   File "C:\Users\Administratör\Desktop\test.py", line 4, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
> unexpected code byte
> 
> 
> What am i doing wrong?

It may not be you, en.wiktionary.org is sending gzip 
encoded content back, it seems to do this even if you set
the Accept header as in:

request.add_header( "Accept", "text/html" )

But maybe I'm not doing it correctly.

#encoding: utf-8
import urllib
import urllib.request

request = urllib.request.Request
(url='http://en.wiktionary.org/wiki/baby',headers={'User-
Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 
Firefox/2.0.0.11'} )

response = urllib.request.urlopen(request)
info = response.info()
enc = info[ 'Content-Encoding' ]
print( "Encoding: " + enc )

from io import BytesIO    
import gzip

buf = BytesIO( response.read() )
unziped = gzip.GzipFile( "wahatever", mode = 'rb', fileobj = buf )
html = unziped.read().decode('utf-8')

print( html.encode( "ascii", "backslashreplace" ) )

Rob.




More information about the Python-list mailing list