[Tutor] Remove certain tags in html files
Eric Brunson
brunson at brunson.com
Fri Jul 27 20:03:48 CEST 2007
Sebastien Noel wrote:
> Hi,
>
> I'm doing a little script with the help of the BeautifulSoup HTML parser
> and uTidyLib (HTML Tidy warper for python).
>
> Essentially what it does is fetch all the html files in a given
> directory (and it's subdirectories) clean the code with Tidy (removes
> deprecated tags, change the output to be xhtml) and than BeautifulSoup
> removes a couple of things that I don't want in the files (Because I'm
> stripping the files to bare bone, just keeping layout information).
>
> Finally, I want to remove all trace of layout tables (because the new
> layout will be in css for positioning). Now, there is tables to layout
> things on the page and tables to represent tabular data, but I think it
> would be too hard to make a script that finds out the difference.
>
> My question, since I'm quite new to python, is about what tool I should
> use to remove the table, tr and td tags, but not what's enclosed in it.
> I think BeautifulSoup isn't good for that because it removes what's
> enclosed as well.
>
You want to look at htmllib: http://docs.python.org/lib/module-htmllib.html
If you've used a SAX parser for XML, it's similar. Your parser parses
the file and every time it hit a tag, it runs a callback which you've
defined. You can assign a default callback that simply prints out the
tag as parsed, then a custom callback for each tag you want to clean up.
It took me a little time to wrap my head around it the first time I used
it, but once you "get it" it's *really* powerful and really easy to
implement.
Read the docs and play around a little bit, then if you have questions,
post back and I'll see if I can dig up some examples I've written.
e.
> Is re the good module for that? Basically, if I make an iteration that
> scans the text and tries to match every occurrence of a given regular
> expression, would it be a good idea?
>
> Now, I'm quite new to the concept of regular expressions, but would it
> ressemble something like this: re.compile("<table.*>")?
>
> Thanks for the help.
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list