Removing certain tags from html files

Stefan Behnel stefan.behnel-n05pAM at web.de
Sat Jul 28 06:44:06 CEST 2007


sebzzz at gmail.com wrote:
> I'm doing a little script with the help of the BeautifulSoup HTML
> parser and uTidyLib (HTML Tidy warper for python).
> 
> Essentially what it does is fetch all the html files in a given
> directory (and it's subdirectories) clean the code with Tidy (removes
> deprecated tags, change the output to be xhtml) and than BeautifulSoup
> removes a couple of things that I don't want in the files (Because I'm
> stripping the files to bare bone, just keeping layout information).
> 
> Finally, I want to remove all trace of layout tables (because the new
> layout will be in css for positioning). Now, there is tables to layout
> things on the page and tables to represent tabular data, but I think
> it would be too hard to make a script that finds out the difference.
> 
> My question, since I'm quite new to python, is about what tool I
> should use to remove the table, tr and td tags, but not what's
> enclosed in it. I think BeautifulSoup isn't good for that because it
> removes what's enclosed as well.

Use lxml.html. Honestly, you can't have HTML cleanup simpler than that.

It's not released yet (lxml is, but lxml.html is just close), but you can
build it from an SVN branch:

http://codespeak.net/svn/lxml/branch/html/

Looks like you're on Linux, so that's a simple run of setup.py.

Then, use the dedicated "clean" module for your job. See the "Cleaning up
HTML" section in the docs for some examples:

http://codespeak.net/svn/lxml/branch/html/doc/lxmlhtml.txt

and the docstring of the Cleaner class to see all the available options:

http://codespeak.net/svn/lxml/branch/html/src/lxml/html/clean.py

In case you still prefer BeautifulSoup for parsing (just in case you're not
dealing with HTML-like pages, but just with real tag soup), you can also use
the ElementSoup parser:

http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py

but lxml is generally quite good in dealing with broken HTML already.

Have fun,
Stefan



More information about the Python-list mailing list