[XML-SIG] HTML Processing
strangest at comcast.net
Tue May 8 04:35:50 CEST 2007
Ar18 at comcast.net wrote:
> I would like to investigate (and possibly implement it) the possibility of using Python for processing html pages.
> The actual work would look something like this:
> * Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major errors in it
> * Create a usable DOM for the files (considering the fact that they may have malformed html) OR... extract the stuff I need directly from the potentially malformed html.
> * If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM.
> Additional features needed:
> I wonder, is this a good place to talk about this?
> I know the goal is XML, but I think this still fits. What libraries should I be looking into to do things like this? I would prefer to look at all the options, if possible.
> XML-SIG maillist - XML-SIG at python.org
I wrote an application to do just this. I found that the existing
xml.dom module had some serious bugs, has not been touched since 2004,
and had no easy way of creating and inserting subtrees in the DOM, or
working with subsets of the DOM. This looks like it was written, then
abandoned for some reason. Not sure why.
I tried to use the elementree from effbot, but also with no success. It
is not DOM compliant, and it's nesting is odd. For example, text
appearing after a <p>...</P. tag on the same line is stuffed into a
'tail variable of the same node, instead of being made into a sibling
node of the <p> node. I found it very odd, and not useful for DOM
manipulation at all. I wrote to Mr. Lundh, and got an indifferent response.
I ended up writing my own DOM tree manager, which is DOM 2 compliant for
the most part. A range() interface still needs to be fully written,
which will allow it to reference anywhere in the tag structure
arbitrarily. Right now I limit my DOM referencing to well-defined
components of the tags and elements. I have not yet written the code to
allow for a completely unlimited referencing of content in any node, and
across any range. Once that is added to my module, it will be complete
and even more DOM2 compliant. But that functionality is not required for
my app, so I may not get the chance to write it. It has the ability to
work with any subtree and insert it using array syntax. The nesting is
exactly what you'd expect in a DOM structure.
If you want this module, and you reach the point where you can help me
debug and improve it, then contact me and we shall talk about the
details. based on this e-mail, it sounds like you're not yet there. For
serving RESTful front end app, I highly recommend CherryPy.
More information about the XML-SIG