need some debug-infos on a simple regex
Martin Kaspar
martin.kaspar at campus-24.com
Fri Nov 12 20:21:04 EST 2010
hello dear list!
i'm very new to programming and self teaching myself. I'm having a
problem with a little project.
I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for more than ten hours now! And i
have several books here
- but they do not help at the moment. This code runs like a charme!!
import urllib
import urlparse
import re
url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
alk = urlparse.urljoin(url, lk)
data = { 'url':alk, 'name':name, 'cname':capname }
phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto:(.*?)">', phtml)
if memail:
data['email'] = memail.group(1)
print data
Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:
What is aimed: I want to adopt it on a new target to learn mor about
regex and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?
the target: http://www.educa.ch/dyn/79362.asp?action=search
But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the
http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...
see the code that does not work!?
import urllib
import urlparse
import re
url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)
data = { 'url':alk, 'cname':capname }
phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)
print data
Look forward to get some starting points...
thx matze
More information about the Python-list
mailing list