[XML-SIG] building XML docs using ?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Mon, 14 May 2001 22:19:42 +0200
> I am converting many large "legacy" text files to XML. Some of the
> original text files are upwards of 100 MB. What is the most efficient,
> using the speed/memory metrics, way to convert these text files to XML?
The less markup, the less the memory overhead, and the faster the
processing. So if you have a plain text file with contents XXX, the
most efficient XML document you could get (from the viewpoint of
parsing speed) is
<plaintext>
XXX
</plaintext>
Provided there is no markup in XXX, this is also the smallest XML
document storing all bytes of XXX :-)
> Currently, I parse through the text files and create a DOM Document
> representation.
Ah, so you are apparently bound by some DTD. In that case, it very
much depends on how complex the transformation is.
> node = doc.createElement(...)
> node.setAttribute(...)
> node.appendChild(...)
> docelement.appendChild(node)
So you create one element per line, in a single pass over the file?
That is quite a simple conversion procedure.
> Should I forgo the ease of using the DOM objects by simply generating
> outputting "hand-generated" markup?
Yes, definitely.
> I was doing this previously, it's efficient, but definitely not as
> nice/clean as it could be...
Why is that? If you create the right template for a single line, e.g.
template = '<elem attr1='%d' attr2='%s'>%s</elem>'
then a simple print statement would suffice to fill out this template.
This also make a nice separation of structure and content.
> So basically, is there a lightweight XML module which provides for (as a
> graphics programmer would say) "immediate mode" output, with as nice an
> interface as the DOM modules?
You could use the SAX interfaces, essentially implementing a Reader
class, and using an xml.sax.XMLGenerator as the content handler.
Then, you'd do proper startElement and endElement calls; the
XMLGenerator will do immediate output.
> Oh, and BTW, can XML solve all my problems??? ;-)
Almost. To get rich quick, you still need to write chain letters :-)
Regards,
Martin