Using Beautiful Soup to entangle bookmarks.html

Claudio Grondi claudio.grondi at
Thu Sep 7 20:03:43 CEST 2006

Diez B. Roggisch wrote:
> Francach schrieb:
>> Hi,
>> I'm trying to use the Beautiful Soup package to parse through the
>> "bookmarks.html" file which Firefox exports all your bookmarks into.
>> I've been struggling with the documentation trying to figure out how to
>> extract all the urls. Has anybody got a couple of longer examples using
>> Beautiful Soup I could play around with?
> Why do you use BeautifulSoup on that? It's generated content, and I 
> suppose it is well-formed, most probably even xml. So use a standard 
> parser here, better yet somthing like lxml/elementtree
> Diez

Once upon a time I have written for my own purposes some code on this 
subject, so maybe it can be used as a starter (tested a bit, but 
consider its status as a kind of alpha release):

from urllib  import urlopen
from sgmllib import SGMLParser

class mySGMLParserClassProvidingListOf_HREFs(SGMLParser):
# provides only HREFs <a href="someURL"> for links to another pages skipping
# references to:
#   - internal links on same page :  "#..."
#   - email adresses              :  "mailto:..."
# and skipping part with appended internal link info, so that e.g.:
#   - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
# ---
   # reset() overwrites an empty function available in SGMLParser class
   def reset(self):
     self.A_HREFs = []
   #: def reset(self)

   # start_a() overwrites an empty function available in SGMLParser class
   # from which this class is derived.  start_a() will be called each 
time the
   # SGMLParser detects an <a ...> tag within the feed(ed) HTML document:
   def start_a(self, tagAttributes_asListOfNameValuePairs):
     for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
       if attrName=='href':
         if attrValue[0] != '#' and attrValue[:7] !='mailto:':
           if attrValue.find('#') >= 0:
             attrValue = attrValue[:attrValue.find('#')]
           #: if
         #: if
       #: if
     #: for
   #: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
#: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
# ---
# Execution block:
fileLikeObjFrom_urlopen = urlopen('') # set URL
mySGMLParserClassObj_withListOfHREFs = 

for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
   print href
#: for

Claudio Grondi

More information about the Python-list mailing list