im a novice python programmer. i have made two changes to robotparser.py. i apologize if this is the wrong list to post this mail.

1. some sites /* specially wikipedia */ returns 403 when default User-Agent is used. so i have changed the code to use urllib2 and added a set_user_agent method. this is simple.

2. this problem is slight complicated. please check the robots.txt file from mathworld.

   http://mathworld.wolfram.com/robots.txt

it contains 2 User-Agent: * lines.

from http://www.robotstxt.org/norobots-rfc.txt
These name tokens are used in User-agent lines in /robots.txt to
identify to which specific robots the record applies. The robot
must obey the first record in /robots.txt that contains a User-
Agent line whose value contains the name token of the robot as a
substring. The name comparisons are case-insensitive. If no such
record exists, it should obey the first record with a User-agent
line with a "*" value, if present. If no record satisfied either
condition, or no records are present at all, access is unlimited.

but it seems that our robotparser is obeying the 2nd one. the problem occures because robotparser assumes that no robots.txt will contain two * user-agent. it should not have two two such line, but in reality many site may have two.

so i have changed the code as follow:

    def _add_entry(self, entry):
        if "*" in entry.useragents:
            # the default entry is considered last
            if self.default_entry == None:
                   self.default_entry = entry
        else:
            self.entries.append(entry)

and at the end of parse(self, lines) method

        if state==2:
#            self.entries.append(entry)
            self._add_entry(entry)

red marked lines are added by me.

as im a very novice python programmer, i really want some experts comment about this matter.

i apologize again if im wasting ur times.

thanks in advance
Taskinoor Hasan Sajid