[Tutor] Remove certain tags in html files

Eric Brunson brunson at brunson.com
Fri Jul 27 20:03:48 CEST 2007


Sebastien Noel wrote:
> Hi,
>
> I'm doing a little script with the help of the BeautifulSoup HTML parser 
> and uTidyLib (HTML Tidy warper for python).
>
> Essentially what it does is fetch all the html files in a given 
> directory (and it's subdirectories) clean the code with Tidy (removes 
> deprecated tags, change the output to be xhtml) and than BeautifulSoup 
> removes a couple of things that I don't want in the files (Because I'm 
> stripping the files to bare bone, just keeping layout information).
>
> Finally, I want to remove all trace of layout tables (because the new 
> layout will be in css for positioning). Now, there is tables to layout 
> things on the page and tables to represent tabular data, but I think it 
> would be too hard to make a script that finds out the difference.
>
> My question, since I'm quite new to python, is about what tool I should 
> use to remove the table, tr and td tags, but not what's enclosed in it. 
> I think BeautifulSoup isn't good for that because it removes what's 
> enclosed as well.
>   

You want to look at htmllib:  http://docs.python.org/lib/module-htmllib.html

If you've used a SAX parser for XML, it's similar.  Your parser parses 
the file and every time it hit a tag, it runs a callback which you've 
defined.  You can assign a default callback that simply prints out the 
tag as parsed, then a custom callback for each tag you want to clean up.

It took me a little time to wrap my head around it the first time I used 
it, but once you "get it" it's *really* powerful and really easy to 
implement.

Read the docs and play around a little bit, then if you have questions, 
post back and I'll see if I can dig up some examples I've written.

e.

> Is re the good module for that? Basically, if I make an iteration that 
> scans the text and tries to match every occurrence of a given regular 
> expression, would it be a good idea?
>
> Now, I'm quite new to the concept of regular expressions, but would it 
> ressemble something like this: re.compile("<table.*>")?
>
> Thanks for the help.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>   



More information about the Tutor mailing list