lisp is winner in DOM parsing contest! 8-]
tfb+google at tfeb.org
Mon Jul 12 12:44:50 CEST 2004
Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote in message news:<7xllhpc3yx.fsf at ruckus.brouhaha.com>...
> Rather than either reading incrementally or else slurping in the
> entire document in many-noded glory, I wonder if anyone's implemented
> a parser that scans over the XML doc and makes a compact sequential
> representation of the tree structure, and then provides access methods
> that let you traverse the tree as if it were a real DOM, by fetching
> the appropriate strings from the (probably mmap'ed) disk file as you
> walk around in the tree.
I dunno if this has been done recently, but this is the sort of thing
that people used to do for very large SGML documents. I forget the
details, but I remember things that were some hundreds of MB (parsed
or unparsed I'm not sure) which would be written out in some parsed
form into large files, which could then be manipulated as if the whole
object was there. Of course no one would care about a few hundred MB
of memory now, but they did then (this was 91-92 I think).
I had a theory of doing this all lazily, so you wouldn't have to do
the (slow) parsing step up-front but would just lie and say `OK, I
parsed it', then actually doing the work only on demand.
More information about the Python-list