[Tutor] fetching wikipedia articles
amit sethi
amit.pureenergy at gmail.com
Fri Jan 23 09:09:19 CET 2009
Well that is interesting but why should that happen in case I am using a
different User Agent because I tried doing
status=rp.can_fetch('Mozilla/5.0', "
http://en.wikipedia.org/wiki/Sachin_Tendulkar")
but even that returns false
Is there something wrong with the syntax , Is there a catch that i don't
understand.
On Thu, Jan 22, 2009 at 10:45 PM, Andre Engels <andreengels at gmail.com>wrote:
> On Thu, Jan 22, 2009 at 6:08 PM, amit sethi <amit.pureenergy at gmail.com>
> wrote:
> > hi , I need help as to how i can fetch a wikipedia article i tried
> changing
> > my user agent but it did not work . Although as far as my knowledge of
> > robots.txt goes , looking at en.wikipedia.org/robots.txt it does not
> seem it
> > should block a useragent (*, which is what i would normally use) from
> > accesing a simple article like say
> > "http://en.wikipedia.org/wiki/Sachin_Tendulkar" but still robotparser
> > returns false
> > status=rp.can_fetch("*", "http://en.wikipedia.org/wiki/Sachin_Tendulkar
> ")
> > where rp is a robot parser object . why is that?
>
> Yes, Wikipedia is blocking the Python default user agent. This was
> done to block the main internal bot in its early days (it was
> misbehaving by getting each page twice); when it got to allowing the
> bot again, it had already changed to having its own user agent string,
> and apparently it was not deemed necessary to unblock the user
> string...
>
>
>
>
> --
> André Engels, andreengels at gmail.com
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
--
A-M-I-T S|S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090123/1a94d113/attachment-0001.htm>
More information about the Tutor
mailing list