Help on regular expression match

Fredrik Lundh fredrik at
Fri Sep 23 08:35:30 CEST 2005

Johnny Lee wrote:

>   I've met a problem in match a regular expression in python. Hope
> any of you could help me. Here are the details:
>   I have many tags like this:
>      xxx<a href="" xxx>xxx
>      xxx<a href="wap://" xxx>xxx
>      xxx<a href="" xxx>xxx
>      .....
>   And I want to find all the "" out, so I do it
> like this:
>      httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
>      result = httpPat.findall(data)
>   I use this to observe my output:
>      for i in result:
>         print i[2]
>   Surprisingly I will get some output like this:
>   In fact it's filtered from this kind of source:
>      <a href="">xxx</a>xxx"
>   But some result are right, I wonder how can I get the all the
> answers clean like ""? Thanks for your help.

".*" gives the longest possible match (you can think of it as searching back-
wards from the right end).  if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

    print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):

        def handle_starttag(self, tag, attrs):
            if tag == "a":
                for key, value in attrs:
                    if key == "href":
                        print value

    p = MyHTMLParser()



More information about the Python-list mailing list