Reâ€: ‬get wikipedia source failed†(‬urrlib2â€)‬
Michael Jâ€. ‬Fromberger
Michael.J.Fromberger at Clothing.Dartmouth.EDU
Tue Aug 7 10:18:05 EDT 2007
In article†<‬1186476847.728759.166610 at o61g2000hsh.googlegroups.comâ€>,‬
†‬shahargs at gmail.com wroteâ€:‬
â€> ‬Hiâ€,‬
â€> ‬I'm trying to get wikipedia page source with urllib2â€:‬
â€> ‬usock†= ‬urllib2â€.‬urlopenâ€("‬httpâ€://‬en.wikipedia.org/wikiâ€/‬
â€> ‬Albert_Einsteinâ€")‬
â€> ‬data†= ‬usock.readâ€();‬
â€> ‬usock.closeâ€();‬
â€> ‬return data
â€> ‬I got exception because HTTP 403†‬errorâ€. ‬whyâ€? ‬with my browser i can't
â€> ‬access it without any problemâ€?‬
â€> ‬
â€> ‬Thanksâ€,‬
â€> ‬Shaharâ€.‬
It appears that Wikipedia may inspect the contents of the User-Agent†‬
HTTP headerâ€, ‬and that it does not particularly like the string it†‬
receives from Python's urllibâ€. ‬I was able to make it work with urllib†‬
via the following codeâ€:‬
import urllib
class CustomURLopener†(‬urllib.FancyURLopenerâ€):‬
†‬version†= '‬Mozilla/5.0â€'‬
urllibâ€.‬_urlopener†= ‬CustomURLopenerâ€()‬
u†= ‬urllib.urlopenâ€('‬httpâ€://‬en.wikipedia.org/wiki/Albert_Einsteinâ€')‬
data†= ‬u.readâ€()‬
I'm assuming a similar trick could be used with urllib2â€, ‬though I didn't†‬
actually try itâ€. ‬Another thing to watch out forâ€, ‬is that some sites†‬
will redirect a public URL X to an internal URL Yâ€, ‬and will check that†‬
access to Y is only permitted if the Referer field indicates coming from†‬
somewhere internal to the siteâ€. ‬I have seen both of these techniques†‬
used to foil screen-scrapingâ€.‬
Cheersâ€,‬
â€-‬M
â€-- ‬
Michael Jâ€. ‬Fromberger†| ‬Lecturerâ€, ‬Deptâ€. ‬of Computer Science
httpâ€://‬www.dartmouth.eduâ€/‬~stingâ€/ | ‬Dartmouth Collegeâ€, ‬Hanoverâ€, ‬NHâ€, ‬USA
More information about the Python-list
mailing list