XML

Mon Jun 23 07:50:57 EDT 2003

Paul Boddie wrote:

>> Indeed. The assertion that "7-bit plain text ASCII" is even a
>> meaningful format is highly dubious, at least when it comes to
>> understanding the information presented in that format.

Erik Max Francis wrote:

> 7-bit ASCII is pretty clear, it's just ASCII (which only defines the
> lower 7 bits).

I think we could find our discussion going around in circles here, because one
person's encoding is another person's format. I think it is appropriate to
define some terminology for further discussion

Encoding: (Roughly) A mapping from byte values to character glyphs or
references.

Structure: (Roughly) A sequence of (possibly nested) content areas, with
embedded character tokens delimiting areas of content, e.g. "\n". It should be
possible to write a deterministic processor which parses a structure.

Format: A given structure, which may be stored in a file in one or more
encodings.

I believe that some of the proponents of ASCII as a format are confusing
encoding with file structure. Yes, ASCII can be used to represent MIME
documents, for example, but MIME also implies a complex structure, which must be
parsed. This posting is encoded in 7-bit ascii, but it's not a MIME document
(although it will get wrapped in a MIME document and unwrapped before you read
it).

I propose that having to write special case parsers for every novel format that
an app designer might devise for their application is a significant burden.
Deciding your delimiters and grammar rules is usually non-trivial. Encoding that
grammar is also non-trivial. There is undoubtedly a benefit to be had from
following that process: clarity of expression (e.g. "python" v. "xython"), speed
and memory efficiency, etc. But if you're just shunting basic data around (e.g.
the classic "purchase order" example), then maybe this extra effort is not worth
it. Even more so when you're exchanging data with other peoples systems: you
then have to implement your format processors on their platform too, e.g. in
Java.

With app-specific formats, you most often have to use app-specific data models.
This is a good thing when there is a clear and specific advantage to the data
model, such as more efficient querying, or memory efficient representation. But
you still have to write bespoke class hierarchies and query processors for the
data model. If you're just sending and receiving purchase orders, then maybe a
proprietary format is not worth the effort.

What XML gives in relation to all the above is

1. It deals with (almost) all data interchange encoding problems, in a manner
that is cleaner than other formats. The user should declare a character encoding
for the file, and if they don't specify one, then they get the default "utf-8".
The recipient can easily decipher what they've received, without guessing or
sophisticated processing.

2. It deals with all format issues, permitting you to represent most imaginable
data relatively cleanly, without having to decide delimiters, tokens, grammar
rules, etc.

3. It provides a very wide range of interoperable software that gives a full
range of choice when making trade-offs of speed vs resources vs usefulness, etc.
By this I mean that, for example, you could process with SAX2 and build your own
object model, or you could just build a DOM, thus saving yourself a stack of
work but consuming a fair percentage of your system memory (which you may or may
not care about).

4. Building an instance of a data model for an XML file is trivial, even for the
beginner

mydom = xml.dom.minidom.parse('myfile.xml')

I wonder how the average newbie would get on with, say, SPARK or PLY, parsing
something like URIs, HTTP or MIME?

5. XML offers a number of specialised query languages, which are suitable for a
wide range of use cases. Not having to write your own query processors is a
significant time saving. Furthermore, the query strings will be completely
platform-independent (assuming compliance of the processor with standards).

6. XML offers a number of schema languages which allow you constrain the content
of a document to a fine degree. In a proprietary grammar scenario, you would get
this kind of structure validation for "free" as a result of implementing your
grammar. With XML, it's an optional extra step that you can take, meaning that
newbies can get started easily, and advance their knowledge as and when they
need to. The schema documents are also completely platform independent (though I
don't dare to think about how large the test suite is for XML-schema. TREX and
Relax are superior, IMHO).

I'm not putting down proprietary formats. When XSL started out, Xpath had not
yet been separated from XSLT (which had not yet been separated from XSL-FO).
Some attempts at enabling node selection from a tree structure involved writing
matching rules that were expressed as XML documents. They were highly
unreadable. Thankfully, the Xpath people decided to adopt a non-XML grammar, and
Xpath queries are now very simple and powerful. I've taught several beginner
programmers Xpath, and they've all got most of it very quickly. If the queries
were expressed as XML documents, I'm sure they would have found it more
difficult to learn.

And then, of course, there is the best example of all: our beloved python. I
shudder to think how horrible python would look if the syntax were XML :-O~

And lastly, there is URIs. I'm pretty certain that if URIs were expressed in
XML, we wouldn't be seeing them on advertising billboards and taxi cabs.

-- 
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/mailto/alan