[Doc-SIG] formalizing StructuredText

Edward D. Loper edloper@gradient.cis.upenn.edu
Fri, 16 Mar 2001 12:58:51 EST

Tibs noted:
> We need to be careful about that word "whitespace" (which I note you
> sometimes still use to mean "blank lines" as well.

Yeah, I've been playing a bit fast and loose with terminology in my
emails.. :)  Speaking of terminology, I want to make sure that we're
using somewhat consistant terminology.  In particular, I think my
use of the following terms may not coincide with what you call
things.  What are your terms for the following?  

    * inline = region marked with #hashes#.
    * paragraph = a text paragraph; not a list item or a heading or
      a label
    * basic block = paragraph or list item or heading or label (or
    * blank line = (S* NL) | (S* EOS)
    * literal block = region following a '::'.
    * invalid string = string that is not given a meaning by an ST
      variant.  (in the terms used by the STminus proposal, strings
      that are not assigned a structure by a language).

Tibs continued:
> When I am talking, I have some assumptions (which, of course, may not be
> evident):
> 1. by the time discourse occurs, all tabs have gone away

Agreed.  We should probably also discard/transform any whitespace
that isn't space or newline (e.g., form feed, carriage return).

> 2. blank lines are blank lines - white space in them is ignored
>    thrown away (lost for good)

Is this true in literal blocks?
Also, I'm guessing you collapse multiple consecutive blank lines 
into one.

> 3. trailing whitespace is thrown away

Trailing whitespace for the string as a whole?  For each basic
block?  For each line?  Is this true in literal blocks?

> 4. literal paragraphs retain leading whitespace following "the
>    rules" (which say they are actually indented relative to the
>    preceding non-literal paragraph - this makes much more sense
>    in ST than "with respect to the left margin").

Agreed.  Although how do you put something at zero indentation?
Maybe indent from 1 space over from the preceeding paragraph?

> So, at the end of that, the term "whitespace" should be replaced by the
> term "spaces". Newlines (sometimes I call them "line breaks", which may
> be a better term) are a different thing.

So we won't use the term whitespace.  Instead, we'll use the terms
space, newline, and blank line.

> Clearly for a string literal that does not contain a newline, spaces are
> to be transcribed to spaces (probably - flag a rendering issue as to
> whether they're *hard* spaces (the correct number) or *unbreakable*
> spaces (the correct number AND no newlines)).

I vote for unbreakable, but it may be possible to persuade me.

> Equally clearly, if one does not allow newlines in string literals,
> that's the end of the matter. We've done our job.

Which is what I vote for. :)

> Unfortunately for simplicity, I saw that I could choose to lose newlines
> *if I so wished*, and after a bit of thinking I decided I did so wish,
> for the reasons I gave. In *that* case, one has to consider what the
> sequence::
> 	<newline> <indentation>
> means within a literal string, and clearly the only consistent thing
> *for* it to mean is a single space.

Well, it could also mean a single newline (or <BR> in html).  But we 
shouldn't even go there. :)

> > Hm.  ick.  I don't like that.
> Yes, well, that's the problem, and I need to think how much I *do* like
> it, and then argue it out.

Here's my current take on linebreaks in literals.  Feel free to

Advantages of allowing linebreaks in literals:
  * you can have longer literals
  * you can press alt-q in emacs to have it re-word-wrap your
    paragraph, and not think about it (as much; you still have
    to worry about list items, labels (in the future), and maybe
    other things).
  * implementation reasons
  * the meaning of spaces and newlines in literals is not obvious
    to the un-initiated (no matter which meaning we choose).

Advantages of not allowing linebreaks in literals:
  * you force people to use shorter literals
  * spaces in literals are meaningful in an obvious way
  * you're more likely to catch errors, because you're keeping
    things local.
  * it's conceptually simpler (i.e., easier to explain).

Of course, if we say that linebreaks are not allowed in literals,
docutils can still go on allowing them there, while just saying that
it's "making a best guess" where a parser I wrote would probably
flag a warning/error.

> > I don't see why someone would ever really need a very long literal..
> > And if they don't mind it being broken up, they can split it up
> > themselves..
> Hmm. I have done in the past (but as ever, can't remember detailed
> examples).

It seems to me that either:
  1. it's a literal that you don't mind having broken up, so you can
     break it up yourself (although then I question if it's really 
     a single literal?)
  2. it's a literal that you think shouldn't be broken up, so you
     shouldn't break it up in the plaintext -- put it on one line, and
     readers will have to understand that it's more than 75 characters
     because it shouldn't be broken up.

> > Used to.  Doesn't now.  Who knows if/when/how it'll change. :)
> Oh dear.

I find myself saying that a lot when I play with STNG. :)  Hopefully
getting a formal definition will start to change that..

> > Hm.  So no roman numerals in STpy?  ok.
> Aagh - no, you're right. My mistake.
> (although, at a different tack, I don't think "e" is a roman numeral?)

No, it's not, but you get the point. :)  I used that example because
STNG currently allows *any* sequence of letters followed by a dot.

> I'm not yet convinced about individual alphabetics - I *do* tend to use
> that style myself quite a lot.

I think that simplicity should be an important design goal for ST.
But I might let single letters followed by a dot slide..  Esp. if
parsers could give warnings when ordered lists were not ordered in a 
sensible way, as in::

   This is not intended to be an ordered list, says
   I.  But it starts with "I" instead of "A", so it will
   flag an error.

> >     "Here the *name* 'contains' markup":url
> Aagh - it's an order thing, 'cos at the moment URL recognition is done
> by colourising. Given I don't want to worry about "internal markup" yet,
> that *may* mean URLs must be done immediately after literals, and before
> other markup.

Hm.. I'm confused.  So you would get::

  <a href="url">Here the *name* 'contains' markup</a>

?  Or::

  "Here the <I>name</I> <CODE>contains</CODE> markup":url

?  Or something else?  But at any rate, my question was more one of
what ST "should" do, not what it does do..  One other case to 
consider is::

  *"I would prefer this":url* to "*this*":url

> >     "This name spans multiple
> >     lines":url
> Remember my code doesn't see newlines any more inside paragraphs, so
> that's no problem...

But if we decide that literals/code don't span newline...? :)  Still
seems to me that names should be able to span newlines, though.

> >     "the following is not a url":<hi>
> If it happens to parse as a URL, it is, if it happens not to parse as a
> URL, it isn't - either way, it's the writers fault for doing daft
> things.

Yes, but do we get an error because we used '":' in a silly context
(if we're asking the parser to tell us about errors)?

> >     Do *quotes "have to* nest" properly with coloring?
> No, and I don't expect ever to try to make the code worry about it -
> that would get grabbed as one or the other, under *any* scheme I'm ever
> likely to write.

But from the point of view of formalizing things, I have two
choices here:
   1. say that it contains a bold region, and the quotes are just
      rendered as quotes
   2. say that it's undefined (i.e., an invalid string).

> And regarless of whether one *should* be able to have a dot at the end
> of a URL, I think in practice we may need to forbid it so we can have a
> fullstop there instead (as I said, I think STClassic may do that, and
> STNG certainly *did*).

Ok.  But that will need special mention somewhere.  So we don't
include the final dot if it's followed by a space, end of line, or
end of string, right?  But what about ".."?  This seems like it
will be very messy to formalize.. :(

> > > 2. URLs will not be allowed to span multiple lines, 
> > > [...]
> > Agreed..
> (although, of course, as I said above, I currently have this blindness
> about linebreaks - but you may argue me out of that yet 

Of course, since URLs shouldn't have spaces in them anyway, this
isn't a problem.