Using Beautiful Soup to entangle bookmarks.html

Claudio Grondi claudio.grondi at freenet.de
Thu Sep 7 20:03:43 CEST 2006


Diez B. Roggisch wrote:
> Francach schrieb:
> 
>> Hi,
>>
>> I'm trying to use the Beautiful Soup package to parse through the
>> "bookmarks.html" file which Firefox exports all your bookmarks into.
>> I've been struggling with the documentation trying to figure out how to
>> extract all the urls. Has anybody got a couple of longer examples using
>> Beautiful Soup I could play around with?
> 
> 
> Why do you use BeautifulSoup on that? It's generated content, and I 
> suppose it is well-formed, most probably even xml. So use a standard 
> parser here, better yet somthing like lxml/elementtree
> 
> Diez

Once upon a time I have written for my own purposes some code on this 
subject, so maybe it can be used as a starter (tested a bit, but 
consider its status as a kind of alpha release):

<code>
from urllib  import urlopen
from sgmllib import SGMLParser

class mySGMLParserClassProvidingListOf_HREFs(SGMLParser):
# provides only HREFs <a href="someURL"> for links to another pages skipping
# references to:
#   - internal links on same page :  "#..."
#   - email adresses              :  "mailto:..."
# and skipping part with appended internal link info, so that e.g.:
#   - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
# ---
   # reset() overwrites an empty function available in SGMLParser class
   def reset(self):
     SGMLParser.reset(self)
     self.A_HREFs = []
   #: def reset(self)

   # start_a() overwrites an empty function available in SGMLParser class
   # from which this class is derived.  start_a() will be called each 
time the
   # SGMLParser detects an <a ...> tag within the feed(ed) HTML document:
   def start_a(self, tagAttributes_asListOfNameValuePairs):
     for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
       if attrName=='href':
         if attrValue[0] != '#' and attrValue[:7] !='mailto:':
           if attrValue.find('#') >= 0:
             attrValue = attrValue[:attrValue.find('#')]
           #: if
           self.A_HREFs.append(attrValue)
         #: if
       #: if
     #: for
   #: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
#: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
# 
------------------------------------------------------------------------------
# ---
# Execution block:
fileLikeObjFrom_urlopen = urlopen('www.google.com') # set URL
mySGMLParserClassObj_withListOfHREFs = 
mySGMLParserClassProvidingListOf_HREFs()
mySGMLParserClassObj_withListOfHREFs.feed(fileLikeObjFrom_urlopen.read())
mySGMLParserClassObj_withListOfHREFs.close()
fileLikeObjFrom_urlopen.close()

for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
   print href
#: for
</code>

Claudio Grondi



More information about the Python-list mailing list