[Tutor] BeautifulSoup - deleting tags

Kent Johnson kent37 at tds.net
Tue Mar 28 12:50:02 CEST 2006


jonasmg at softhome.net wrote:
> Is possible deleting all tags from a text and how? 
> 
> i.e.: 
> 
> qwe='<td><a href="..." title="...">foo bar</a>;<br />
> <a href="..." title="...">foo2</a> <a href="..." title="...">bar2</a></td>' 
> 
> so, I would get only: foo bar, foo2, bar2 

How about this?

In [1]: import BeautifulSoup

In [2]: s=BeautifulSoup.BeautifulSoup('''<td><a href="..." 
title="...">foo bar</a>;<br />
    ...: <a href="..." title="...">foo2</a> <a href="..." 
title="...">bar2</a></td>''')

In [4]: ' '.join(i.string for i in s.fetch() if i.string)
Out[4]: 'foo bar foo2 bar2'


Here are a couple of tag strippers that don't use BS:
http://www.aminus.org/rbre/python/cleanhtml.py
http://www.oluyede.org/blog/2006/02/13/html-stripper/

Kent



More information about the Tutor mailing list