[XML-SIG] building XML docs using ?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 14 May 2001 22:19:42 +0200

> I am converting many large "legacy" text files to XML.  Some of the
> original text files are upwards of 100 MB.  What is the most efficient,
> using the speed/memory metrics, way to convert these text files to XML?

The less markup, the less the memory overhead, and the faster the
processing. So if you have a plain text file with contents XXX, the
most efficient XML document you could get (from the viewpoint of
parsing speed) is


Provided there is no markup in XXX, this is also the smallest XML
document storing all bytes of XXX :-)

> Currently, I parse through the text files and create a DOM Document
> representation.

Ah, so you are apparently bound by some DTD. In that case, it very
much depends on how complex the transformation is.

>     node = doc.createElement(...)
>     node.setAttribute(...)
>     node.appendChild(...)
>     docelement.appendChild(node)

So you create one element per line, in a single pass over the file?
That is quite a simple conversion procedure.

> Should I forgo the ease of using the DOM objects by simply generating
> outputting "hand-generated" markup?  

Yes, definitely.

> I was doing this previously, it's efficient, but definitely not as
> nice/clean as it could be...

Why is that? If you create the right template for a single line, e.g.

template = '<elem attr1='%d' attr2='%s'>%s</elem>'

then a simple print statement would suffice to fill out this template.
This also make a nice separation of structure and content.

> So basically, is there a lightweight XML module which provides for (as a
> graphics programmer would say) "immediate mode" output, with as nice an
> interface as the DOM modules?  

You could use the SAX interfaces, essentially implementing a Reader
class, and using an xml.sax.XMLGenerator as the content handler.
Then, you'd do proper startElement and endElement calls; the
XMLGenerator will do immediate output.

> Oh, and BTW, can XML solve all my problems???  ;-)

Almost. To get rich quick, you still need to write chain letters :-)