[XML-SIG] HTML Processing

Tue May 8 04:35:50 CEST 2007

Ar18 at comcast.net wrote:
> I would like to investigate (and possibly implement it) the possibility of using Python for processing html pages.
>
> The actual work would look something like this:
> * Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major errors in it
> * Create a usable DOM for the files (considering the fact that they may have malformed html) OR...  extract the stuff I need directly from the potentially malformed html.
> * If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM.
> Additional features needed:
>
> I wonder, is this a good place to talk about this?
>
> I know the goal is XML, but I think this still fits.  What libraries should I be looking into to do things like this?  I would prefer to look at all the options, if possible.
> _______________________________________________
> XML-SIG maillist  -  XML-SIG at python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>
>   
I wrote an application to do just this. I found that the existing 
xml.dom module had some serious bugs, has not been touched since 2004, 
and had no easy way of creating and inserting subtrees in the DOM, or 
working with subsets of the DOM. This looks like it was written, then 
abandoned for some reason. Not sure why.
I tried to use the elementree from effbot, but also with no success. It 
is not DOM compliant, and it's nesting is odd. For example, text 
appearing after a <p>...</P. tag on the same line is stuffed into a 
'tail variable of the same node, instead of being made into a sibling 
node of the <p> node. I found it very odd, and not useful for DOM 
manipulation at all. I wrote to Mr. Lundh, and got an indifferent response.

I ended up writing my own DOM tree manager, which is DOM 2 compliant for 
the most part. A range() interface still needs to be fully written, 
which will allow it to reference anywhere in the tag structure 
arbitrarily. Right now I limit my DOM referencing to well-defined 
components of the tags and elements. I have not yet written the code to 
allow for a completely unlimited referencing of content in any node, and 
across any range. Once that is added to my module, it will be complete 
and even more DOM2 compliant. But that functionality is not required for 
my app, so I may not get the chance to write it. It has the ability to 
work with any subtree and insert it using array syntax. The nesting is 
exactly what you'd expect in a DOM structure.

If you want this module, and you reach the point where you can help me 
debug and improve it, then contact me and we shall talk about the 
details. based on this e-mail, it sounds like you're not yet there. For 
serving  RESTful front end app, I highly recommend CherryPy.

Best,
Gloria