lisp is winner in DOM parsing contest! 8-]

Paul Rubin http
Mon Jul 12 14:44:03 EDT 2004


tfb+google at tfeb.org (Tim Bradshaw) writes:
> I had a theory of doing this all lazily, so you wouldn't have to do
> the (slow) parsing step up-front but would just lie and say `OK, I
> parsed it', then actually doing the work only on demand.

1) I think you do need to do the parsing up front, since you have to
provide basically random access to the contents of the tree.

2) The parsing needn't be slow.  It should just make a very fast
linear scan over the document, emitting an in-memory record every time
it sees a tag that makes a new tree node.  It would remember on a
stack where it is in the tree structure as it scans, and whenever it
found a tag that closes off a subtree (e.g. </table>), it could emit
pointers to where that subtree's parent and previously seen sibling
are, and update the parent to know where the last-seen child is.
Finally, it could make another pass over the in-memory structure to
vectorize the list of children at each node.  This could be done with
even less memory by making two passes over the document (one to count
the children for each node and one to build the tree with a vector of
children at each node, allocated to the now-known size), at the
expense of some speed.  The point is, not much processing needs to be
done during the scan.  It should certainly be possible to parse a 3 MB
document in under a second (I assume we're talking about a C
extension).  It just baffles me why the libraries that are out there
are so much slower.



More information about the Python-list mailing list