webspider, regexp not working, why?
notnorwegian at yahoo.se
notnorwegian at yahoo.se
Fri May 23 12:42:57 EDT 2008
url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")
why isnt this url catching something like:
<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />
site = urllib.urlopen("http://www.python.org")
for row in site:
obj = url.search(row)
if obj != None:
print "url: ", obj.group()
i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.
search and match yields the same results.
but when you put something like href= in front of it it doesnt work.
i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com hi
matches.
i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.
More information about the Python-list
mailing list