[XML-SIG] building XML docs using ?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 14 May 2001 22:19:42 +0200


> I am converting many large "legacy" text files to XML.  Some of the
> original text files are upwards of 100 MB.  What is the most efficient,
> using the speed/memory metrics, way to convert these text files to XML?

The less markup, the less the memory overhead, and the faster the
processing. So if you have a plain text file with contents XXX, the
most efficient XML document you could get (from the viewpoint of
parsing speed) is

<plaintext>
XXX
</plaintext>

Provided there is no markup in XXX, this is also the smallest XML
document storing all bytes of XXX :-)

> Currently, I parse through the text files and create a DOM Document
> representation.

Ah, so you are apparently bound by some DTD. In that case, it very
much depends on how complex the transformation is.

>     node = doc.createElement(...)
>     node.setAttribute(...)
>     node.appendChild(...)
>     docelement.appendChild(node)

So you create one element per line, in a single pass over the file?
That is quite a simple conversion procedure.

> Should I forgo the ease of using the DOM objects by simply generating
> outputting "hand-generated" markup?  

Yes, definitely.

> I was doing this previously, it's efficient, but definitely not as
> nice/clean as it could be...

Why is that? If you create the right template for a single line, e.g.

template = '<elem attr1='%d' attr2='%s'>%s</elem>'

then a simple print statement would suffice to fill out this template.
This also make a nice separation of structure and content.

> So basically, is there a lightweight XML module which provides for (as a
> graphics programmer would say) "immediate mode" output, with as nice an
> interface as the DOM modules?  

You could use the SAX interfaces, essentially implementing a Reader
class, and using an xml.sax.XMLGenerator as the content handler.
Then, you'd do proper startElement and endElement calls; the
XMLGenerator will do immediate output.

> Oh, and BTW, can XML solve all my problems???  ;-)

Almost. To get rich quick, you still need to write chain letters :-)

Regards,
Martin