[XML-SIG] HTML Processing
Ar18 at comcast.net
Ar18 at comcast.net
Tue May 8 03:54:02 CEST 2007
I would like to investigate (and possibly implement it) the possibility of using Python for processing html pages.
The actual work would look something like this:
* Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major errors in it
* Create a usable DOM for the files (considering the fact that they may have malformed html) OR... extract the stuff I need directly from the potentially malformed html.
* If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM.
Additional features needed:
I wonder, is this a good place to talk about this?
I know the goal is XML, but I think this still fits. What libraries should I be looking into to do things like this? I would prefer to look at all the options, if possible.
More information about the XML-SIG
mailing list