[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

karl report at bugs.python.org
Mon Jun 23 02:00:28 CEST 2014


karl added the comment:

→ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
>>> rp.read()
>>> 


Let's check the server logs:

127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17"

Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above:

This is the proposed test for 3.4

import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:9999')
rp.read()


The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :)

Currently robotparser.py imports urllib user agent.
http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364

It's a common failure we encounter when using urllib in general, including robotparser.


As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib. 

GET /robots.txt HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, compress
Host: en.wikipedia.org
User-Agent: Python-urllib/1.17

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3161
Cache-control: s-maxage=3600, must-revalidate, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 5208
Content-Type: text/plain; charset=utf-8
Date: Sun, 22 Jun 2014 23:59:16 GMT
Last-modified: Tue, 26 Nov 2013 17:39:43 GMT
Server: Apache
Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org
Vary: X-Subdomain
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
X-Article-ID: 19292575
X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215)
X-Content-Type-Options: nosniff
X-Language: en
X-Site: wikipedia
X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894


Many other sites still do. :)

----------
versions: +Python 3.4 -Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue15851>
_______________________________________


More information about the Python-bugs-list mailing list