XML overuse? (was Re: Python to XML to Python conversion)

Fri Jul 12 09:44:10 EDT 2002

On 12/7/2002 9:37, in article HzwX8.71016$vm5.2601431 at news2.tin.it, "Alex
Martelli" <aleax at aleax.it> wrote:

> Jonathan Hogg wrote:
>       ...
>> I'm not sure it is possible to "overuse" XML.
> 
> It is -- easily.  My pet peeve is the idea of using XML files for
> tasks that obviously need a real database, preferably a relational one.
> 
> I think some people never really GOT relational databases, no matter
> that they've been around for decades and are so widespread, and they're
> now turning to *overusing* XML to cover up for that:-).

OK, I'll narrow my earlier statement to:

    I'm not sure, when discussing file formats for off-line structured
    data, that it is possible to "overuse" XML.

Clearly when the data has to be accessed on-line (by this I mean that the
file is left open and used throughout the lifetime of the application) a
database makes much more sense. I was assuming my statement would be read
within the context of the thread (Pickling data).

>> If you need to read and
>> write structured data, why bother coming up with your own format? (see:
>> the entire contents of /etc) Or why use something that is proprietary to a
>> particular language or system? (see: Pickle)
> 
> Speed, size, and convenience are possible reasons.  If the structure
> is highly repetitious, and the amount of data is very large, then
> repeating the tags identically a zillion times can impose substantial
> overhead of space and time. [...]

I like to consider these as special cases where I have no choice but to use
a compact file format. Even there, if space is the issue (rather than time),
running the file through a good compressor/decompressor as it is
written/read is likely to result in much better savings than trying to think
of a super-compact binary layout.

> The need to search "random-access" wise, or "keyed"-wise, is often
> an excellent reason to avoid a format that requires reading though
> all of a file to get at a particular piece of data.  dbm variants,
> shelve, and relational databases, can be huge wins here (compared to
> XML, pickle, or any other choice requiring whole-file reloading).

As above, a specialised database will always make more sense here. Though
arguably, for small quantities of data that can be slurped completely into
memory, it may not make any difference. Also, XML databases can be used and
make a lot of sense if the data can be arbitrarily structured. People are
often very poor at designing relational schemas.

> While XML is reasonably human-editable, there may well be formats
> that are more convenient than it for this purpose, avoiding the
> need of special-purpose XML-oriented editors and allowing the use
> of any good old text editor with maximal ease.  This is a good
> reason to keep a human-editable configuration file in non-XML
> form, in my opinion.

I just don't think that this argument is strong. XML is readable enough and
most decent editors will provide at least rudimentary syntax highlighting
for XML. With the addition of DTDs or Schemas it is easy for a generic XML
editor to validate the format of the file - very useful for complex
configuration files.

I think the most value in using XML comes in the associated standards such
as XSLT, XPointer, XPath, etc. Consider a recent question on obtaining a
list of nameservers on a machine. The answer on UNIX was variously given as:
"read /etc/resolv.conf pulling out the lines beginning with 'nameserver' and
grabbing the part after that".

Consider if resolv.conf used XML - perhaps something looking like:

    <resolve-config>
      <searchpath>
        <domain> python.org </domain>
      </searchpath>
      <nameservers>
        <nameserver>123.45.67.89</nameserver>
        <nameserver>87.65.43.21</nameserver>
      </nameservers>
    </resolve-config>

You might say that this is verbose and monstrous, but it's readable and
fairly obvious in meaning. However, going back to the question, we can now
reference the information required using a simple XPath query:

>>> from xml.dom.minidom import parse
>>> from xml.xpath import Evaluate
>>> 
>>> config = parse( 'resolve-config.xml' )
>>> query = '/resolve-config/nameservers/nameserver/text()'
>>> for node in Evaluate( query, config ):
...     print node.nodeValue
... 
123.45.67.89
87.65.43.21

Now someone can follow up to this post with a similarly small program that
would have pulled the information out of the normal resolve.conf file
faster, but this is probably the simplest format of any configuration file
used in a UNIX system. Try getting anything useful out of an Apache config
file.

If Apache used XML, then I could write an XPath query that would pull out
the document root of a particular virtual server with a single query
something like:

 //virtual[normalize-space(server-name)='www.python.org']/document-root

A good XML editor would be able to point out errors in my configuration file
as I make them, and I could write an XSLT transformer to convert the
configuration file into a nice XHTML document, or more importantly use an
XSLT transformer to *create* the configuration file from XML website
specifications conveniently stored in a central XML configuration database.

> Let's try to avoid pro-XML hype in an attempt to counter the
> anti-XML hype that's suddenly burst on this group...:-)

Agreed :-)

But on the other hand, I don't think this is hype. This is stuff I have to
deal with all the time. My life would be about an order of magnitude easier
if everything used XML for configuration and interchange.

And the great thing? It's already happening, and I already see things being
made easier for me.

Jonathan