need some debug-infos on a simple regex

MRAB python at mrabarnett.plus.com
Fri Nov 12 21:54:14 EST 2010


On 13/11/2010 01:21, Martin Kaspar wrote:
> hello dear list!
>
> i'm very new to programming and self teaching myself. I'm having a
> problem with a little project.
>
> I'm trying to preform an fetch-process, but every time i try it i runs
> into errors.
> i have read the Python-documents for  more than ten hours now!  And i
> have several books here
> - but they do not help at the moment. This code runs like a charme!!
>
>
> import urllib
> import urlparse
> import re
>
> url = "http://search.cpan.org/author/?W"
> html = urllib.urlopen(url).read()
> for lk, capname, name in re.findall('<a
> href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
>      alk = urlparse.urljoin(url, lk)
>
>      data = { 'url':alk, 'name':name, 'cname':capname }
>
>      phtml = urllib.urlopen(alk).read()
>      memail = re.search('<a href="mailto:(.*?)">', phtml)
>      if memail:
>          data['email'] = memail.group(1)
>
>      print data
>
> Note the above mentioned code runs very very good. All is nice. Now i
> want to apply it on a new target. I can learn alot with this ...Let us
> say on this swiss-site:educa.ch:
>
> What is aimed: I want to adopt it on a new target to learn mor about
> regex  and to do some homework - (working as a teacher - and
> collecting some data bout colleagues) How should we fetch the sites -
> that is the problem..i want to learn while applying the
> code...What is necessary to apply the example on the target!?
>
> the target:  http://www.educa.ch/dyn/79362.asp?action=search
>
> But the code (see below) does not run - i tried several things to
> debug - can yozu help me!?
> BTW - should i fetch the pages and load them into an array or should i
> loop over the
>
> http://www.educa.ch/dyn/79376.asp?id=2635
> http://www.educa.ch/dyn/79376.asp?id=3493
> and so on...
>
> see the code that does not work!?
>
> import urllib
> import urlparse
> import re
>
> url = "http://www.educa.ch/dyn/"
> html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
> action=search").read()
> for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
> +).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
> +)\'', html):
> alk = urlparse.urljoin(url, lk)
>
> data = { 'url':alk, 'cname':capname }
>
> phtml = urllib.urlopen(alk).read()
> memail = re.search('<a href="mailto.*?)">', phtml)
> if memail:
> data['email'] = memail.group(1)
>
> print data
>
> Look forward to get some starting points...
>
Don't just say "does not run" or "does not work". That's not very
helpful. It's like saying "My car doesn't work. How should I fix it?".
:-)

When writing regexes it's recommended that you use raw string literals.

Your first regex contains 'asp?', which is saying that 'p' is optional.
I think you meant 'asp\?'. Also, '.' will match any character except
'\n'. If want to match an actual '.' then use '\.'.

Your second regex contains a closing parenthesis ')' but no opening
parenthesis '('.



More information about the Python-list mailing list