[Tutor] Removing tags with BeautifulSoup

Sebastien sebastien at solutions-linux.org
Wed Aug 8 19:22:54 CEST 2007


I'm in the process of cleaning some html files with BeautifulSoup and
I want to remove all traces of the tables. Here is the bit of the code
that deals with tables:

def remove(soup, tagname):
    for tag in soup.findAll(tagname):
        contents = tag.contents
        parent = tag.parent
        for tag in contents:

remove(soup, "table")
remove(soup, "tr")
remove(soup, "td")

It works fine but leaves an empty table structure at the end of the
soup. Like:




And the extract method of BeautifulSoup seems the extract only what is
in the tags.

So I'm just looking for a quick and dirty way to remove this table
structure at the end of the documents. I'm thinking with re but there
must be a way to do it with BeautifulSoup, maybe I'm missing

An other thing that makes me wonder, this code:

    for script in soup("script"):

Works fine and remove script tags, but:

    for table in soup("table"):

Raises AttributeError: 'NoneType' object has no attribute 'extract'

Oh, and BTW, when I extract script tags this way, all the tag is gone,
like I want it, it doesn't only removes the content of the tag.

Thanks in advance 

More information about the Tutor mailing list