read() returns data of different sizes

Chris Rebert clp2 at rebertia.com
Sat Oct 2 14:13:18 CEST 2010


On Sat, Oct 2, 2010 at 4:58 AM, jimgardener <jimgardener at gmail.com> wrote:
> hi
> while trying out urllib.urlopen ,I wrote this code to read a url and
> return the data length
>
> import datetime,time,urllib
>
> def get_page_size(pageurlstr):
>    h=urllib.urlopen(pageurlstr)
>    data=h.read()
>    return len(data)
>
>    while True:
>        print 'reading url www.google.com
> at',datetime.datetime.now().isoformat(' ')
>        print 'size=%d'%get_page_size('http://www.google.com')
>        time.sleep(5)
>
>
> I got this output
>
> reading url www.google.com at 2010-10-02 17:22:24.691654
> size=9512
> reading url www.google.com at 2010-10-02 17:22:30.681236
> size=9530
> reading url www.google.com at 2010-10-02 17:22:36.886369
> size=9530
> reading url www.google.com at 2010-10-02 17:22:42.315392
> size=9512
> reading url www.google.com at 2010-10-02 17:22:48.763693
> size=9512
> reading url www.google.com at 2010-10-02 17:22:54.711666
> size=9548
> reading url www.google.com at 2010-10-02 17:23:00.151843
> size=9530
> reading url www.google.com at 2010-10-02 17:23:05.844620
> size=9548
>
>
> Why is it that the sizes are different?

Because Google does not always send back the *exact* same HTML every
time you request their homepage (note how small the variance is). You
can easily verify this using the "Save Page" function of your browser
and diff-ing the HTML for 2 different loads. What is varying is
possibly some sort of tracking ID.

> what must I do to ensure that the whole page is read ?

Nothing. Using .read() already ensures it.

Cheers,
Chris
--
http://blog.rebertia.com



More information about the Python-list mailing list