regular expressions: grabbing variables from multiple matches

Heather Lynn White hwhite at chiliad.com
Thu Jan 4 16:00:43 EST 2001


Fredrik,

Thankyou for the suggestion.  I tried out your module in the
context of my scripts.  However, I find that the sgmllib's parsing 
takes much more processtime than I can afford. I think it just does a bit
too much compared to what I need. I think I will stick to my own methods,
which are more primitive, but much faster.  But thanks.

Heather

On Thu, 4 Jan 2001, Fredrik Lundh wrote:

> Heather Lynn White wrote:
> > Suppose I have a regular expression to grab all variations on a meta tag,
> > and I will want to extract from any matches the name and content values
> > for this tag.
> >
> > I use the following re
> 
> alex has already explained how to use the optional "pos"
> argument to search forward from the last match.
> 
> but supposing you really are out to extract meta tags from an
> HTML document, it might be a better idea to use the HTML/SGML
> parser in sgmllib:
> 
> # extract meta tags from a HTML document
> # (based on sgmllib-example-1 in the effbot guide)
> 
> import sgmllib
> 
> class ExtractMeta(sgmllib.SGMLParser):
> 
>     def __init__(self, verbose=0):
>         sgmllib.SGMLParser.__init__(self, verbose)
>         self.meta = []
> 
>     def do_meta(self, attrs):
>         name = content = None
>         for k, v in attrs:
>             if k == "name":
>                 name = v
>             if k == "content":
>                 content = v
>         if name and content:
>             self.meta.append((name, content))
> 
>     def end_title(self):
>         # ignore meta tags after </title>.  you
>         # can comment away this method if you
>         # want to parse the entire file
>         raise EOFError
> 
> def getmeta(file):
>     # extract meta tags from an HTML/SGML stream
>     p = ExtractMeta()
>     try:
>         p.feed(file.read())
>         p.close()
>     except EOFError:
>         pass
>     return p.meta
> 
> #
> # try it out
> 
> import urllib
> print getmeta(urllib.urlopen("http://www.python.org"))
> 
> Hope this helps!
> 
> Cheers /F
> 
> <!-- (the eff-bot guide to) the standard python library:
> http://www.pythonware.com/people/fredrik/librarybook.htm
> -->
> 
> 
> 





More information about the Python-list mailing list