[XML-SIG] HTML Processing

Ar18 at comcast.net Ar18 at comcast.net
Tue May 8 03:54:02 CEST 2007


I would like to investigate (and possibly implement it) the possibility of using Python for processing html pages.

The actual work would look something like this:
* Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major errors in it
* Create a usable DOM for the files (considering the fact that they may have malformed html) OR...  extract the stuff I need directly from the potentially malformed html.
* If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM.
Additional features needed:

I wonder, is this a good place to talk about this?

I know the goal is XML, but I think this still fits.  What libraries should I be looking into to do things like this?  I would prefer to look at all the options, if possible.


More information about the XML-SIG mailing list