Problem when fetching page using urllib2.urlopen
Piet van Oostrum
piet at cs.uu.nl
Mon Aug 10 15:36:55 EDT 2009
>>>>> jitu <nair.jitendra at gmail.com> (j) wrote:
>j> Hi,
>j> A html page contains 'anchor' elements with 'href' attribute having
>j> a semicolon in the url , while fetching the page using
>j> urllib2.urlopen, all such href's containing 'semicolons' are
>j> truncated.
>j> For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
>j> get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
>j> The page I am talking about can be fetched from
>j> http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--
It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).
To get them you have to tell the server that you are a respectable
browser. E.g.
import urllib2
url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'
url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--'
hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
'Accept': 'image/*'}
request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
More information about the Python-list
mailing list