Problem when fetching page using urllib2.urlopen

jitu nair.jitendra at gmail.com
Tue Aug 11 01:15:31 EDT 2009


Yes Piet you were right this works. But seems does not work on google
app engine, since  it appends it own agent info as seen below

'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;
rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;
(+http://code.google.com/appengine)'

Any way Thanks . Good to know about the User-Agent field.

Jitu


On Aug 11, 12:36 am, Piet van Oostrum <p... at cs.uu.nl> wrote:
> >>>>> jitu <nair.jiten... at gmail.com> (j) wrote:
> >j> Hi,
> >j> A html page  contains 'anchor' elements with 'href' attribute  having
> >j> a semicolon  in the url , while fetching the page using
> >j> urllib2.urlopen, all such href's  containing  'semicolons' are
> >j> truncated.
> >j> For example the hrefhttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
> >j> get truncated tohttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
> >j> The page I am talking about can be fetched from
> >j>http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...
>
> It's not python that causes this. It is the server that sends you the
> URLs without these parameters (that's what they are).
>
> To get them you have to tell the server that you are a respectable
> browser. E.g.
>
> import urllib2
>
> url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
>
> url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...
>
> hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
>        'Accept': 'image/*'}
>
> request = urllib2.Request(url = url, headers = hdrs)
> page = urllib2.urlopen(request).read()
>
> --
> Piet van Oostrum <p... at cs.uu.nl>
> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
> Private email: p... at vanoostrum.org




More information about the Python-list mailing list