Re: I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>…

Stefan Behnel stefan_ml at behnel.de
Mon Mar 29 16:14:22 EDT 2010


Stéphane Klein, 29.03.2010 10:12:
> I work on HTML cleaner.
>
> I export OpenOffice.org documents to HTML.
> Next, I would like clean this HTML export files :
>
> * remove comment
> * remove style
> * remove dispensable tag
> * ...
>
> some difficulty :
>
> * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
> * convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>
>
> to do this process, I use lxml and pyquery.

lxml.html has tools for that in the 'clean' module. Just specify the list 
of tags that you want to discard.


> * are there some xml helper tools in Python to do this process ? I've
> looked for in pypi, I found nothing about it

The HTML tools in the standard library are close to non-existant. You can 
achieve some things with the builtin tools, but if they fail for a 
particular input document, there's little you can do.


> If you confirm than this tools don't exists, I'll maybe publish a helper
> package to do this "clean" processing.

Take a look at lxml.html.clean first.

Stefan




More information about the Python-list mailing list