Something faster then sgmllib for sucking out URLs

Fredrik Lundh fredrik at
Thu Jun 13 12:24:18 CEST 2002

Martin v. Loewis wrote:
> > I'm working on a webspider to fit my sick needs. The profiler
> > tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> > solely for extracting URLs. I'm looking for a faster way of doing
> > this. Regular expressions, string searches? What's the way to go? I'm
> > not a python purist. Calling some fast C program with the html as
> > argument and getting back a list of URLs would be fine by me.
> I recommend to use sgmlop, which is distributed both as part of PyXML,
> and separately by Fredrik Lundh. It is the fastest SGML/XML parser I
> know of, for use within Python.

the latest version (1.1a3) is available here:

here's a code snippet that extracts A HREF anchors
from a webpage:

import sgmlop
import urllib

class AnchorHandler:
    def __init__(self):
        self.anchors = []
    def finish_starttag(self,tag,attrs):
        if tag == "a":
            for k, v in attrs:
                if k == "href":

def getanchors(page):
    handler = AnchorHandler()
    parser = sgmlop.SGMLParser()
    parser.close() # we're done
    return handler.anchors

print getanchors("")


More information about the Python-list mailing list