regular expressions: grabbing variables from multiple matches
Heather Lynn White
hwhite at chiliad.com
Thu Jan 4 16:00:43 EST 2001
Fredrik,
Thankyou for the suggestion. I tried out your module in the
context of my scripts. However, I find that the sgmllib's parsing
takes much more processtime than I can afford. I think it just does a bit
too much compared to what I need. I think I will stick to my own methods,
which are more primitive, but much faster. But thanks.
Heather
On Thu, 4 Jan 2001, Fredrik Lundh wrote:
> Heather Lynn White wrote:
> > Suppose I have a regular expression to grab all variations on a meta tag,
> > and I will want to extract from any matches the name and content values
> > for this tag.
> >
> > I use the following re
>
> alex has already explained how to use the optional "pos"
> argument to search forward from the last match.
>
> but supposing you really are out to extract meta tags from an
> HTML document, it might be a better idea to use the HTML/SGML
> parser in sgmllib:
>
> # extract meta tags from a HTML document
> # (based on sgmllib-example-1 in the effbot guide)
>
> import sgmllib
>
> class ExtractMeta(sgmllib.SGMLParser):
>
> def __init__(self, verbose=0):
> sgmllib.SGMLParser.__init__(self, verbose)
> self.meta = []
>
> def do_meta(self, attrs):
> name = content = None
> for k, v in attrs:
> if k == "name":
> name = v
> if k == "content":
> content = v
> if name and content:
> self.meta.append((name, content))
>
> def end_title(self):
> # ignore meta tags after </title>. you
> # can comment away this method if you
> # want to parse the entire file
> raise EOFError
>
> def getmeta(file):
> # extract meta tags from an HTML/SGML stream
> p = ExtractMeta()
> try:
> p.feed(file.read())
> p.close()
> except EOFError:
> pass
> return p.meta
>
> #
> # try it out
>
> import urllib
> print getmeta(urllib.urlopen("http://www.python.org"))
>
> Hope this helps!
>
> Cheers /F
>
> <!-- (the eff-bot guide to) the standard python library:
> http://www.pythonware.com/people/fredrik/librarybook.htm
> -->
>
>
>
More information about the Python-list
mailing list