[XML-SIG] Re: PyDoc/XML?

John Day jday@csihq.com
Wed, 29 Sep 1999 11:20:45 -0400


Day:
>> So this was my original question: Is it possible to define a
markup-language
>> like this in XML? 
Uche:
>I don't get it.  Presumably your objection to XML-in-the-midst-of-code is
all 
>the palaver with angle brackets, beginning and end-tags, character entities 
>and all that.  In that case, how can we possibly come up with an XML-based 
>language that is not problematic?

Exactly. That's _my_ question. I guess you are confirming my hunch that XML
has some limitations for creating markup-languages that are 'friendly' and 
'readable' enough for humans to manipulate. 
>
>So I'm quite in the dark at this point.  I think it might help if you could 
>provide examples, even poor ones.
>
OK, I will illustrate what I'm working on.

Years ago I created a markup language called RML (Report Markup Language)
to allow documents to be marked up as 'boiler plate' for automatic report
generation. The markup language was very flexible and both declarative and
procedure in nature (you could choose mostly declarative or mostly 
procedural, or anything in between). It was very friendly because you could
use WYSIWYG features of the word processor (DecWrite) as markups, or you
insert RML markups (.BOLD .CENTER etc). The RML parser read and interpreted
the document's structure along with the RML markups.

One of the first issues I had to resolve was this: Is RML embedded in the
document or is the document embedded in RML? The answer is: neither. A marked
up document is a _union_ of two languages, RML and the native structure of
the document. Together they form a single language which can be interpreted
and used to generate virtually any kind of document.

RML has evolved beyond DecWrite and report generation and is now a scripting
language superficially (quite coincidentally) resembling Python (without the
nasty indenting ;)

I have written a simple XML parser for RML, attempting to conform to the 1.0
Spec for well-formedness. (It also does HTML as a selectable option). I use
the parser exclusively through a SAX-like event interface. (Will do some 
DOM stuff eventually). 

Going back to my insight that a marked-up document is a union of two
languages.
A programming language source file is an ideal candidate for my kind of RML
processing because both languages are highly structured and machine readable.

My goals for producing reference documents from source files are:
1. Avoid marking up anything that can be deduced by syntactic recognition
   of the language itself. Thus it is foolish to make lists of function names
   variables, data types etc since that can all be figured out with more or
   less perfect accuracy by a parser.
2. Look for markups in the comments that reflect some aspect of the document
   organization (Author, dates, purpose) Usually a set of pretty tags would
   be helpful.
3. Automatically (or with slight assist from a pretty tag) recognize free-form
   comments that describe the functionality of the code and organize them
   in some sensible way. Allow structural tags to be embedded in this
free-form
   stuff.

My hunch is that language coders will eventually produce good documentation
when they look at the (say) HTML-generated reference entities created by
their own comments. If it looks like something's missing or out of place they
will eventually learn how to write comments so that they look 'pretty' on the
doc page. But even if they do little or nothing, some useful documentation
(function list, variable names etc) will still be produced and displayed
in a useful manner.

The fuzzy part is the human-written documentation residing in the comment
blocks. But even this often has a regular or predictable structure. For 
example, I use the following template in all of my programs, regardesless of
the language:
/*-----------------------------------------------------------------
 | 
 |
    FILE: template.c
 |
 |   PURPOSE: This module contains dummy routines
for illustration
 |            of the ATK documentation format
 |
 |
ROUTINES: 
 |            foo()              makes foo out of goo
 | 
 |
AUTHOR: John Day/CSI
 | 
 |      DATE: 1-Apr-98
 |
 |  MODIFICATIONS
 |
2-Apr-98 Day  Added, changed ,fixed etc.

+---------------------------------------------------------------*/

I would like to use tokens like 'AUTHOR:' and 'FILE:' as actual markups
mixed with other free-style comments.

So this is the auto-doc markup language for creating manuals out of source
code that I have in mind. And I want XML to play a role here, but not
clear what that role is. (A 'solution' in search of a 'problem') It looks
like using XML to create the actual markup language is not possible,
rather I will need an external program to extract/create the structure
and semantics of the goal document, then use XML as a representation language,
from which HTML, PostScript, RTF, or whatever can be generated. That sounds 
feasible.

I am trying to learn XML by developing my own tools. This has introduced
me to some of the more subtle aspects of XML and has caused me to revise
my opinion of what XML is. (Actually, I think my XML 'evangelists' don't
really know exactly what XML is and are abusing it by proposing it for
object databases and other somewhat mis-appropriate uses). I see XML
strictly as a way of marking up a 'document' to expose its structure and
semantics. Sure, documents are trees, like databases, but doesn't necessarily
imply XML is a good way to implement a database. (Basic necessities such
as query languages don't exist, yet).

So I've been using my XML parser in HTML mode, while trying to think of
a good XML app.

Hope that's enough detail :)
Thanks, Uche, for all that you've done for XML and Python.

John Day
Staff Scientist
CSI Inc. Melbourne, FL


FYI, here's a little RML snippet that grabs a list of key 'related search'
phrases from AltaVista, just to give you an idea of what RML is (and how
is resembles Python):

#!./rml -s 
# FILE: about (RML 'bot' for retrieving keyword phrases)
import
'url.rml';

global
Count=0,Search=off,Indef=false,Related=false;
{*-------------- Set traps
for starting tags --------------------------*}

def startElement(h, event,
tbl)
    if (event == 'table')
        Search = on;
    elseif (event ==
'td')
        Indef = true;
    endif 
enddef {* startElement()
*}


{*-------------- Filter phrases from trapped tags
--------------------*}

def endElement(h, event, tbl)
    if (event ==
'table')
        Search = off;
        Related = false;
    elseif (event
== 'td')
        Indef = false;
        text =
EXPLODE(IMPLODE(LOOKUP_TABLE(tbl, 'pc$data'), ' '),':-');
        if
('Related' in text(1))
            Related = true;
            if (
(LENGTH(text) >= 2))
                Count = Count + 1;
                log
Count+". '"+REST(text(2))+"'";
            endif
            if (
(LENGTH(text) >= 3))
                Count = Count + 1;
                log
Count+". '"+REST(text(3))+"'";
            endif
        endif
    endif


  if (Search and Indef and Related and (event in ['a' 'A']))
        Count
= Count + 1;
        log Count+". '"+IMPLODE(LOOKUP_TABLE(tbl, 'pc$data'),
' ')+"'";
    endif
enddef {* endElement *}

{*---------------------
Run-time stub ------------------------------------*}

ifdef run$flag
  if
($argc==0) 
    log 'usage: about <keyword>';
    return; 
  endif
  url =
'www.altavista.com/cgi-bin/query?pg=q&kl=XX&stype=stext&q='+$1; 
  
  fname
= 'a_bout.html'; # save copy of last HTML file accessed

parseURL(url,fname);
endif

#sample run: get phrases related to 'soup'
rml/work> about soup
1. 'Chicken Soup'
2. 'Cabbage Soup Diet'
3. 'Campbell Soup'
4. 'Soup Nazi'
5. 'tomato soup'
6. 'Talk Soup'
7. 'French Onion Soup'
8. 'Cabbage Soup'
9. 'Chicken Noodle Soup'
10. 'Pumpkin Soup'
11. 'Stone Soup'