[XML-SIG] ANN: SLiP and SLIDE - a quick XML shorthand syntax and tool for editing

BudP.Bruegger BudP.Bruegger
Mon, 19 Aug 2002 19:28:19 +0200


hello,

(A little late,) I have noted the announcement of SLiP and the followup
discussion on XML shorthand on this list.  Have you guys followed up on
the
topic and are working on a joint specification/implementation?  I would
be
interested to join in.  In the following I describe some ideas.  

In 1999, I did some work on an XML shorthand syntax in the context of a
successor of Ian Clatworthy's SDF (see
http://www.cpan.org/modules/by-authors/id/IANC/sdf-2.000.readme).  Some
of the
ideas may be quite applicable to the discussion here.  Here is a
modernized/simplified version (that takes your postings into account):

[see also examples below]

I think of the XML document in terms of DOM nodes where nodes are either
elements, attributes, text, or comments.  (Maybe also processing
instructions,
entities, etc. should be added). 

Generally, each node starts on a line of its own and is often contained
in a
single line.  (Note that PYX uses a similar approach, see for example
http://www.xml.com/pub/a/2000/03/15/feature/index.html).  This makes
parsing
easy, makes the structure more explicit, and possibly allows ad-hoc
analysis
using tools such as grep and similar.  Most importantly, however, this
approach
makes it also much easier to mix text nodes with element nodes such as
in the
following example: "this is a <strong>bold</strong>word".  

The hierarchy of nodes is defined by (pythonic) indentation.

An element has the format 
<indentation>*<optionalNamespace:><name>

An attribute has the format
<indentation>@<optionalNamespace:><name><optinalWS>"<value>"
(WS=white space without newline)

A comment has the format
<indentation>#<optionalNamespace:><name><optinalWS>"<value>"
(WS=white space without newline)
Note that this does not seem to conflict with the use of '#' in URLs.

A text node has the format 
<indentation>"<singleLineText>" 

or on multiple lines:

<indentation>"""
<indentation><firstLineOfText>
...
<indentation><lastLineOfText>"""

details: 
  Note that occurrences of '"' or '"""' after the opening '"' or '"""'
should
  be escaped with '\'.  Similarly, the special characters "*", "@", "#",
and
  "%" as first characters of a multi-line text need to be quoted with
"\".
  Also, it is on purpose that the multi-line text starts only on the
line below
  the '"""'--I like the indentation better, particularly for
pre-formatted
  things.  
  
  Note also that instead of (single or tripple) double quote characters
('"'),
  single quote characters ("'") could be used equivalently.  
  
  Also, the indentation is stripped off in the equivalent XML document. 
This
  prevents multi-line (possibly pre-formatted) text to break the visual
  structure of the document.  

For rapid authoring, a text node is not really a strict text node but
automatically should quote the common character entities.  For example
"1 > 0"
really represents "1 &gt; 0".  

Also, in single-line text, if the line breaks need to be expressed
precisely,
using '\n' to represent line breaks is highly beneficial (see example 2
below).  
If entity substitution (quoting) is not desired, and for many other
uses,
variations of the plain text node can be used.  They have the following
format:

<indentation>%<textType><optionalWS>"<singleLineText>" 

or in the case of multiple lines:

<indentation>%<textType><optionalWS>""" ...

While the approach really offers unlimited possibilities, I have thought
of the following possible text types:

  * normal:  i.e., equivalent to stating no %<textType>
  
  * raw:     no quotation or different substitution.  This allows also
to embed
	     xml into the shorthand document
  
  * structuredText or st:  This makes it easy to write lists, emphasized
words,
	     etc.  
  
  * stNG:    same in different version
  
  * sh:      the resulting text returned by some shell command
  
  * py:      the resulting text returned by some python expression
  
  * incl:    inclusion of the content of an external file (could be done
with
             sh)
  
  * img:     translates to an html image element but smartly computes
size,
             changes format, creates thumbnail, or similar.  
  
  * table:   some smart way of making tables (see for example the
several
  	   approaches in SDF that were quite easy and successful--I
always used
  	   a fixed format one)

Obviously, there are unlimited possibilities and the shorthand package
could
come with a small predefined library and a simple mechanism to add ones
own. (I
implemented a web template system many years ago in perl that used the
same
approach very successfully).  While I haven't thought this through in
detail,
the text processing code could either expect a certain formatting
convention
(as in ordered lists or tables) and/or use attributes that are
associated with
the text node.  (Note that this is an extension of the DOM model where
text
nodes cannot have attributes).  

Some useful shortcuts:
  While I proposed to start each DOM node on a separate line, here is a
  possible exception that makes it possible to be terser (but makes it
  impossible to analyse the document using easy grepping).  In the case
where
  an element has no attributes and contains a single, single-line text
node,
  the text can be added to the same line as the node:
  <indentation>*<optionalNS:><elementName><optionalWS>"<text>"
Similarly, in
  the case of a multi-line text following an attribute less element, one
could
  write: <indentation>*<optionalNS:><elementName><optionalWS>"""
  <incrementedIndentation><firstTextLine> ...
  <incrementedIndentation><lastTextLine>"""

The implementation seems rather straight forward.  In particularly, the
type of
node can be detected by a parser looking simply at the beginning of each
line. 

One source of ambiguity that has to be solved is how to differentiate
empty
elements from non-empty elements without content.  For example, how
should the
shorthand differ for <empty/> and <nonempty></nonempty>?  One possible
solution
would be to use "*/" instead of "*" to prefix empty elements.  

While converting xml to shorthand is trivial, some more challenges are
to be
expected when going from xml back to shorthand.  There is an ambiguity
of what
types of text to use and how to chose.  In case it is possible to come
up with
simple rules for the kinds of text to chose (always normal except for
UL, OL,
and TABLE elements), it is easy.  Normal text may optionally be wrapped
differently.  Since I propose an open framework, the mapping is not
necessarily
always that well defined.  Maybe each text module needs to define some
rules
for when it can be applied???

One strength of the approach seems that--in the case of document centric
xml--it can precisely preserve line breaks and white space that, for
example,
make a difference in how web browsers render (x)html.  

Anyhow, I hope this is of interest and I would be happy to discuss more
or
participate in an implementation.  

kind regards

--bud



----------- example 1 -------------------

*root
  # "this is an example of my xml shorthand ideas"
  *address
    @type "home"
    *street "123 Sesame Street"
    *city "Wonderland"
    *state "CA"
    *zipCode "90012"
    *comment """
      Please leave packages with Grouch in
      garbage can next door."""

----------- example 2 -------------------

*root
  @attr1 "cool"
  @attr2 'moose'
  *budNS:someElement
    "Some words are "
    *strong "bold"
    " and some are not.\n"
    #"""
    note that this is a single line consisting of
    two text nodes that surround a _strong_ element"""
  *someOtherElement
    %structuredText'''
    It would be rather more cumbersome to write

    * nested unordered lists
       * such as this
       * and others
    * and odered lists or tables

    without the use of _structured text_.'''
    *someNestedElement '''
      note that the flexible use of single or double quote characters 
      makes quoting of " and ' easier even when they are trippled as in
      """ or \'''.'''
    *foo
      "this is equivalent to*"
    *foo "this format here"
    *note
       """
       this is the same (format) equivalence
       example on multiple lines"""
    *note """
       with two lines
       of text here"""

	
/-----------------------------------------------------------------
| Bud P. Bruegger, Ph.D. 
| Sistema (www.sistema.it)
| Via U. Bassi, 54
| 58100 Grosseto, Italy
| +39-0564-411682 (voice and fax)
\-----------------------------------------------------------------