[Doc-SIG] formalizing StructuredText

Edward D. Loper edloper@gradient.cis.upenn.edu
Wed, 21 Mar 2001 12:48:02 EST


> Paragraph (not distinguished normally from the other sorts, which *also*
> have special names). If I had to distinguish this, I'd probably call it
> a "paragraph with a blank line before it" (remember, that *might*
> include the other sorts of thing, too).

I think it may be useful to distinguish this (and "paragraph with a blank
line before it" is definitely *not* what you want, since it leaves out
what I would call a paragraph at the beginning of a document, and 
it could potentially include any other basic block, if it happens to
have a blank line before it (which is required for headings, etc.)).

> >     * basic block = paragraph or list item or heading or label (or
> >       table?)
> 
> Paragraph (see above)

I think this is somewhat misleading/confusing..  But I guess that's
up to you to decide..

> > > 3. trailing whitespace is thrown away
> >
> > Trailing whitespace for the string as a whole?  For each basic
> > block?  For each line?  Is this true in literal blocks?
> 
> For each line. True in all places (you can't, in general, see them, so
> there we go).
>
> For literal blocks, newlines are preserved, but I can't see any obvious
> point in preserving trailing spaces.

I guess that seems reasonable.  Within paragraphs, do you collapse
multiple spaces into one space?

> > Agreed.  Although how do you put something at zero indentation?
> > Maybe indent from 1 space over from the preceeding paragraph?
> 
> You don't. I've never wanted to (my problems with HTML normally come
> from trying to do the opposite).

Hm.. I'm not sure I agree with this, but I don't think it's important
enough to get hung up on.  (I would argue that you should be able
to put things in column 0, but that the HTML renderer should probably
indent preformatted regions relative to everything else).

> > > >     "the following is not a url":<hi>
> 
> That's right. In this instance.

So does it get rendered as is (i.e., with two quote signs, one colon 
sign, a less than sign, and a greater than sign)?

> I can't see, in docutils (STminus is another kettle of fish) that error
> detection (apart from paragraph indentation and paragraph label
> detection) is other than a bunch of heuristics, almost certainly one or
> more REs, that point out *possible* problems to a user wanting
> validation. So it becomes a matter of identifying the set of REs we want
> to warn about.

As I (think I) said earlier, it should be possible to do error detection
in a principled way, given a formal definition of ST.  We should be able
to print out *all* problems, not just *possible* problems, if the user
really wants us to.  This seems very important to me if we want to allow
for the possibility of competing implementations of ST.

> > > >     Do *quotes "have to* nest" properly with coloring?
> >
> > But from the point of view of formalizing things, I have two
> > choices here:
> >    1. say that it contains a bold region, and the quotes are just
> >       rendered as quotes
> >    2. say that it's undefined (i.e., an invalid string).
> 
> Undefined isn't invalid - it's undefined. At least to me, even in a
> formal context, that's true (i.e., not "I don't know" but "I shan't
> decide").

I'm calling undefined a subset of invalid.  (invalid=illegal+undefined).

> On the other hand, once I'm sure I've got the order of
> markup/colourising correct, I'll be happy to regard it as so, and then
> you could "freeze" it. But is that a good approach?

The markup-nesting problem doesn't actually seem that difficult to me,
in principle.  I propose that we allow anything to nest within anything,
with the restrictions:
  1. nothing can nest inside a literal, inline, or href url
  2. nothing can nest within itself (even with intervening levels)

So the legal nestings are shown in this tree:

  * literal
  * inline
  * emph
     * literal
     * inline
     * strong
         * literal
         * inline
         * href name
             * inline
             * literal
         * href url
     * href name
         * strong
         * literal
         * inline
     * href url
  * strong
     * literal
     * inline
     * emph
         * literal
         * inline
         * href name
             * inline
             * literal
         * href url
     * href name
         * emph
         * literal
         * inline
     * href url
  * href name
     * literal
     * inline
     * strong
         * literal
         * inline
         * emph
             * inline
             * literal
     * href name
         * emph
         * literal
         * inline
  * href url

Also, spaces must come between * and ** delimiters, so you 
can't say ***this***.

(Footnote markers [like_this] would probably pattern like href urls)

-Edward