Help on regular expression match

Johnny Lee johnnyandfiona at
Fri Sep 23 09:09:17 CEST 2005

Fredrik Lundh wrote:
> ".*" gives the longest possible match (you can think of it as searching back-
> wards from the right end).  if you want to search for "everything until a given
> character", searching for "[^x]*x" is often a better choice than ".*x".
> in this case, I suggest using something like
>     print re.findall("href=\"([^\"]+)\"", text)
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
>     from HTMLParser import HTMLParser
>     class MyHTMLParser(HTMLParser):
>         def handle_starttag(self, tag, attrs):
>             if tag == "a":
>                 for key, value in attrs:
>                     if key == "href":
>                         print value
>     p = MyHTMLParser()
>     p.feed(text)
>     p.close()
> see:
> </F>

Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
for url in parser.anchorlist:
	if url[0:7] == "http://":
		print url

when the baseUrl="", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?

More information about the Python-list mailing list