Deflate with urllib2...

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu Sep 18 17:10:34 EDT 2008


En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samslists at gmail.com> escribió:

> Gabriel, et al.
>
> It's hard to find a web site that uses deflate these days.
>
> Luckily, slashdot to the rescue.
>
> I even wrote a test script.
>
> If someone can tell me what's wrong that would be great.
>
> Here's what I get when I run it:
> Data is compressed using deflate.  Length is:   107160
> Traceback (most recent call last):
>   File "my_deflate_test.py", line 19, in <module>
>     data = zlib.decompress(data)
> zlib.error: Error -3 while decompressing data: incorrect header check

And that's true. The slashdot server is sending bogus data:

py> s = socket.socket()
py> s.connect(('slashdot.org',80))
py> s.sendall("GET / HTTP/1.1\r\nHost: slashdot.org\r\nAccept-Encoding:  
deflate\
r\n\r\n")
py> s.recv(500)
'HTTP/1.1 200 OK\r\nDate: Thu, 18 Sep 2008 20:48:34 GMT\r\nServer:  
Apache/1.3.41
  (Unix) mod_perl/1.31-rc4\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash  
2.0050
01220\r\nX-Bender: Alright! Closure!\r\nCache-Control: private\r\nPragma:  
privat
e\r\nConnection: close\r\nContent-Type: text/html;  
charset=iso-8859-1\r\nVary: A
ccept-Encoding, User-Agent\r\nContent-Encoding:  
deflate\r\nTransfer-Encoding: ch
unked\r\n\r\n1c76\r\n\x02\x00\x00\x00\xff\xff\x00\xc1\x0f>\xf0<!DOCTYPE  
HTML PUB
LIC "-//W3C//DTD HTML 4.01//EN"\n             
"http://www.w3.org/TR/html4/str...'

Note those 11 bytes starting with "\x02\x00\x00\xff..." followed by the  
page contents in plain text.
According to RFC 2616 (HTTP 1.1), the deflate content coding consists of  
the "zlib" format defined in RFC 1950 in combination with the "deflate"  
compression mechanism described in RFC 1951. RFC 1950 says that the lower  
4 bits of the first byte in a zlib stream represent the compression  
method; the only compression method defined is "deflate" with value 8. The  
slashdot data contains a 2 instead, so it is not valid.

> #!/usr/bin/env python
>
> import urllib2
> import zlib
>
> opener = urllib2.build_opener()
> opener.addheaders = [('Accept-encoding', 'deflate')]
>
> stream = opener.open('http://www.slashdot.org')
> data = stream.read()
> encoded = stream.headers.get('Content-Encoding')
>
> if encoded == 'deflate':
>     print "Data is compressed using deflate.  Length is:  ",
> str(len(data))
>     data = zlib.decompress(data)
>     print "After uncompressing, length is: ", str(len(data))
> else:
>     print "Data is not deflated."

The code is correct - try with another server. I tested it with a  
LightHTTPd server and worked fine.

-- 
Gabriel Genellina




More information about the Python-list mailing list