[Tutor] Remove certain tags in html files

Sebastien Noel sebastien at solutions-linux.org
Fri Jul 27 19:38:56 CEST 2007


Hi,

I'm doing a little script with the help of the BeautifulSoup HTML parser 
and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given 
directory (and it's subdirectories) clean the code with Tidy (removes 
deprecated tags, change the output to be xhtml) and than BeautifulSoup 
removes a couple of things that I don't want in the files (Because I'm 
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new 
layout will be in css for positioning). Now, there is tables to layout 
things on the page and tables to represent tabular data, but I think it 
would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I should 
use to remove the table, tr and td tags, but not what's enclosed in it. 
I think BeautifulSoup isn't good for that because it removes what's 
enclosed as well.

Is re the good module for that? Basically, if I make an iteration that 
scans the text and tries to match every occurrence of a given regular 
expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it 
ressemble something like this: re.compile("<table.*>")?

Thanks for the help.


More information about the Tutor mailing list