[Moin-user] 403 response on lots of pages from crawlers?

Sat Mar 13 15:19:01 EST 2004

Kevin Altis noticed a bunch of 403s in the www.python.org access log
summary.  A little investigation showed that almost all of them were from
Googlebot or Yahoo! Slurp.  Looking at the pages being accessed I can't see
an obvious pattern to the URLs being requested other than that almost all
have an action or value parameter.  Unfortunately, the robots.txt file only
allows you to specify the beginning of a path to be excluded.  It's not a
general purpose pattern scheme, so I can't use something like

    Disallow: /cgi-bin/moinmoin/*?action=
    Disallow: /cgi-bin/moinmoin/*?value=

to keep crawlers from traversing those sorts of URLs.  ("*" is only allowed
in the User-Agent line.)

I'd like to keep crawlers from requesting pages with a parameter but still
let them otherwise wander around in the Wiki.  I'm open to suggestions.

Thanks,

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://www.spambayes.org/
skip at pobox.com