[omaha] Parsing bad html

Eli Criffield elicriffield at gmail.com
Wed Dec 12 06:25:43 CET 2007


I love BeautifulSoup, it takes any random crap html you throw at it
and turns in a beautiful pythonic object with methods for everything
you would want to be able to do with it. I use it to even "prettify"
(yep thats a method for a Soup object too) some of my html before i
publish.

For example the heart of my Sipie (player for sirius online radio in
python) use BeautifulSoup:

def getAsxURL(self)
      ''' Simplified for viewing, for real code see
http://sipie.sourceforge.net/ '''

       # do i have a valid stream/channel already set in self.__stream?
        self.validateStream()

        # self.token is set when you run self.auth(), as are the
cookies you need
        post = {'activity': 'selectStream', 'stream': self.__stream,
'token': self.token}

        url = 'http://%s/sirius/servlet/MediaPlayer' % self.host
        data = self.__getURL(url, post).read()
        soup = BeautifulSoup(data)
        try:
            asxURL = soup.find('param', {'name': 'FileName'})['value']
        except TypeError: # you get a TypeError if it can't find it
            # you must not have the right cookies, you probably didn't
login yet
            raise AuthError

        return asxURL

Once you have the url of the it contains a hash key thats good for one connect.
you can feed that asxURL to mplayer, totem, VLC or windows media
player and it'll play. Its all very simple thinks to python and
BeautifulSoup's parsing.

Eli Criffield


On Dec 11, 2007 9:36 PM, Jeff Hinrichs - DM&T <jeffh at dundeemt.com> wrote:
> One of the reasons I like Python:
>
> I had to reformat some html today to take it from a poorly hand coded
> page and get it in to a wiki.
>
> Here is an example nugget of the raw source html ( this isn't the
> worst - which had mangled mismatched tags)
> <a href="http://www.newscientist.com/">NS+</a>
>        -- <a href="http://www.adquest3d.com/">Classified Ads </a>-- <a
> href="http://www.radio-locator.com/cgi-bin/home">Radio
>         </a>-- <a href="http://www.bookbrowser.com/Resources/Index.html">Book
>         Links</a></b></font></font><b><font
> face="Arial,Helvetica,Monaco"><font size="1"><a
> href="http://www.ceoexpress.com/"> </a>--
>         <a href="http://www.obscurestore.com/">Obscure</a> -- <a
> href="http://www.ebay.com/">eBAY</a>
>        -- <a href="http://www.online-pr.com/">Online PR</a> -- <a
> href="http://catalogs.google.com/">Catalogs</a>
>        -- <a href="http://www.nytimes.com/books/first/first-nonfiction.html">FirstChaps</a>
>        -- <a href="http://www.loc.gov/">LOC</a> -- <a
> href="http://www.ac6v.com/swl1.htm#WEBRADIO">WebRadio</a>
>
> I needed to get in into a form like "* [[ TITLE | URL ]]"
>
> Well I've used Beautifulsoup
> (http://www.crummy.com/software/BeautifulSoup/) before but its been a
> while so the exact way to do it was not in my L1 cache<g>.  I knew I
> didn't have the module on this machine so I had to get it loaded and
> start from there.  I searched the docs for "links" and found
> http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document
> -- midway of the section is an example that is darn near exactly what
> I want.
>
> What follows is what I did next:
>
> jlh at jlh-d520:~$ sudo easy_install beautifulsoup
> [sudo] password for jlh:
> Searching for beautifulsoup
> Reading http://cheeseshop.python.org/pypi/beautifulsoup/
> Couldn't find index page for 'beautifulsoup' (maybe misspelled?)
> Scanning index of all packages (this may take a while)
> Reading http://cheeseshop.python.org/pypi/
> Reading http://cheeseshop.python.org/pypi/BeautifulSoup/3.0.4
> Reading http://www.crummy.com/software/BeautifulSoup/
> Reading http://www.crummy.com/software/BeautifulSoup/download/
> Best match: BeautifulSoup 3.0.4
> Downloading http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup-3.0.4.tar.gz
> Processing BeautifulSoup-3.0.4.tar.gz
> Running BeautifulSoup-3.0.4/setup.py -q bdist_egg --dist-dir
> /tmp/easy_install-Ihuiu5/BeautifulSoup-3.0.4/egg-dist-tmp-gKUTwa
> zip_safe flag not set; analyzing archive contents...
> Adding BeautifulSoup 3.0.4 to easy-install.pth file
>
> Installed /usr/lib/python2.5/site-packages/BeautifulSoup-3.0.4-py2.5.egg
> Processing dependencies for beautifulsoup
> Finished processing dependencies for beautifulsoup
> jlh at jlh-d520:~$ python
> Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32)
> [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s = """
> ... <a href="http://www.google.com/">Google</a>
> ...        -- <a href="http://www.alltheweb.com/">FAST</a> --<a
> href="http://www.profusion.com/"> Prof</a>
> ...        -- <a href="http://www.ftpsearchengines.com/">FTP</a> -- <a
> href="http://dogpile.com/">Dogpile</a> --<a
> href="http://www.beaucoup.com/">Beaucoup</a>
> </b></font></font><b><font face="Arial,Helvetica,Monaco"><font
> size="1">--
> ...         <a href="http://www.findarticles.com/PI/index.jhtml">Articles</a>
> --<a href="http://www.archive.org/"> Archives</a>
> ...        -- <a href="http://www.allacademic.com/">Academic</a> -- <a
> href="http://www.kartoo.com/">Kartoo</a>
> ...        -- <a href="http://clusty.com/">Clusty </a>-- <a
> href="http://www.teoma.com/">Teoma
> ...         </a>-- <a href="http://beta.search.msn.com/">MSN</a> --<a
> href="http://www.cranky.com"><font color="RED"> Cranky</font></a>
> ...        -- <a href="http://discussion.lycos.com/">Discussions</a> --</font>
> ... """
> >>> from BeautifulSoup import BeautifulSoup, SoupStrainer
> >>>
> >>>
> >>> links = SoupStrainer('a')
> >>> thelinks = [tag for tag in BeautifulSoup(s, parseOnlyThese=links)]
> ... for el in thelinks:
> ...     print el
> ...
> <a href="http://www.google.com/">Google</a>
> <a href="http://www.alltheweb.com/">FAST</a>
> <a href="http://www.profusion.com/"> Prof</a>
> <a href="http://www.ftpsearchengines.com/">FTP</a>
> <a href="http://dogpile.com/">Dogpile</a>
> <a href="http://www.beaucoup.com/">Beaucoup</a>
> <a href="http://www.findarticles.com/PI/index.jhtml">Articles</a>
> <a href="http://www.archive.org/"> Archives</a>
> <a href="http://www.allacademic.com/">Academic</a>
> <a href="http://www.kartoo.com/">Kartoo</a>
> <a href="http://clusty.com/">Clusty </a>
> <a href="http://www.teoma.com/">Teoma
>         </a>
> <a href="http://beta.search.msn.com/">MSN</a>
> <a href="http://www.cranky.com"><font color="RED"> Cranky</font></a>
> <a href="http://discussion.lycos.com/">Discussions</a>
> >>> dir(thelinks)
> ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
> '__delslice__', '__doc__', '__eq__', '__ge__', '__getattribute__',
> '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__',
> '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
> '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
> '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
> '__setslice__', '__str__', 'append', 'count', 'extend', 'index',
> 'insert', 'pop', 'remove', 'reverse', 'sort']
> >>> dir(thelinks[0])
> ['XML_SPECIAL_CHARS_TO_ENTITIES', '__call__', '__contains__',
> '__delitem__', '__doc__', '__eq__', '__getattr__', '__getitem__',
> '__init__', '__iter__', '__len__', '__module__', '__ne__',
> '__nonzero__', '__repr__', '__setitem__', '__str__', '__unicode__',
> '_findAll', '_findOne', '_getAttrMap', '_lastRecursiveChild',
> 'append', 'attrs', 'childGenerator', 'containsSubstitutions',
> 'contents', 'extract', 'fetch', 'fetchNextSiblings', 'fetchParents',
> 'fetchPrevious', 'fetchPreviousSiblings', 'fetchText', 'find',
> 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
> 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
> 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
> 'findPreviousSiblings', 'first', 'firstText', 'get', 'has_key',
> 'hidden', 'insert', 'isSelfClosing', 'name', 'next', 'nextGenerator',
> 'nextSibling', 'nextSiblingGenerator', 'parent', 'parentGenerator',
> 'parserClass', 'prettify', 'previous', 'previousGenerator',
> 'previousSibling', 'previousSiblingGenerator',
> 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'setup',
> 'string', 'substituteEncoding', 'toEncoding']
> >>> thelinks[0].attrs
> [(u'href', u'http://www.google.com/')]
> >>> thelinks[0].attrs[1]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> IndexError: list index out of range
> >>> thelinks[0].attrs[0][1]
> u'http://www.google.com/'
> >>> thelinks[0].fetchText
> <bound method Tag.fetchText of <a href="http://www.google.com/">Google</a>>
> >>> thelinks[0].fetch
> <bound method Tag.findAll of <a href="http://www.google.com/">Google</a>>
> >>> thelinks[0].name
> u'a'
> >>> thelinks[0].setup
> <bound method Tag.setup of <a href="http://www.google.com/">Google</a>>
> >>> thelinks[0].extract
> <bound method Tag.extract of <a href="http://www.google.com/">Google</a>>
> >>> thelinks[0].extract()
> >>> thelinks[0].contents
> [u'Google']
> >>> thelinks[0].attrs[0][1]
> u'http://www.google.com/'
> >>>
>
> dir(something) in the interactive interpreter lists all of the
> properties and methods available for a given object.  By looking at
> the return of "thelinks" it was obviously a list, or some other
> object that was implemented as a list.  So then I needed to figure out
> what the list elements were as trying to .strip() them was resulting
> in a TypeError.  a quick dir(thelinks[0]) showed that the elements
> were not simple strings or lists of strings but a more complicated
> object.  Then I just needed to find what would return the URL and
> Title.   Not caring to read more documentation you'll see my attempts
> before finding the two necessary properties: .contents and .attrs
> So I end up with the following script to hammer my way through a few
> hundred links
>
> from BeautifulSoup import BeautifulSoup, SoupStrainer
>
>
> links = SoupStrainer('a')
> thelinks = [tag for tag in BeautifulSoup(s, parseOnlyThese=links)]
> for el in thelinks:
>     try:
>         print ' * [[%s|%s]]' % ((el.contents[0]).strip(),el.attrs[0][1])
>     except TypeError:
>         print ' * [[%s|%s]]' % (el.contents[0],el.attrs[0][1])
>
> You'll notice the try:except block.  I needed that when the .contents
> returned a more complicated element than a string.  (i.e. <font
> color="red">stuff</font>) That is a html element object and it doesn't
> take kindly when I try to peform a .strip() (string object) method on
> something that doesn't support it.  For those, I just flattened it out
> and hand edited the output.  Total time for researching, experimenting
> and implementing about 20 minutes.  Compared with the time it was
> taking to edit the source html snippets to wiki links -- that was a
> huge savings.
>
> Not a fancy script by any stretch of the imagination -- but a decent
> example of using interactive python to your advantage and letting me
> remain the lazy guy that I am.  The other thing to remember, when
> forced to parse html of questionable quality, BeautifulSoup is your
> friend.
>
>
> --
> Jeff Hinrichs
> Dundee Media & Technology, Inc
> jeffh at dundeemt.com
> 402.218.1473
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>


More information about the Omaha mailing list