I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>…

Stéphane Klein stephane at harobed.org
Mon Mar 29 10:12:09 CEST 2010


I work on HTML cleaner.

I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

some difficulty :

* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>

to do this process, I use lxml and pyquery.

Question :

* are there some xml helper tools in Python to do this process ? I've 
looked for in pypi, I found nothing about it

If you confirm than this tools don't exists, I'll maybe publish a helper 
package to do this "clean" processing.

Thanks for your help,

More information about the Python-list mailing list