urllib2 - 403 that _should_ not occur.
prologic at shortcircuit.net.au
Mon Jan 12 04:50:28 CET 2009
On Mon, Jan 12, 2009 at 1:25 PM, Philip Semanchuk <philip at semanchuk.com> wrote:
> Oooops, I guess it is my brain that's not working, then! Sorry about that.
> I tried your sample and got the 403. This works for me:
> Some sites ban UAs that look like bots. I know there's a Java-based bot with
> a distinct UA that was really badly-behaved when visiting my server. Ignored
> robots.txt, fetched pages as quickly as it could etc. That was worthy of
> banning. FWIW, when I try the code above with a UA of "funny fish" it still
> works OK, so it looks like the groups.google.com server has it out for UAs
> with Python in them, not just unknown ones.
> I'm sure that if you changed wget's UA string to something Pythonic it would
> start to fail too.
My problem that I'm solving and my use-case
is a tool to periodically check configured RSS
feeds for updates. I was going to use urllib2
to get the data and pass this off to feedparser.parse(...)
Because of the UA problem though (which can
be overcome) - I decided to try a different approach
and use feedparse entirely (which uses urllib internally).
Problem is, feedparser doesn't store the http's
response content anywhere - only the parsed
results - *sigh*.
My solution now is to parse and store the data
I required in a simple object and pickle this to a
set of cached files and compare this against
hashed versions of the content.
More information about the Python-list