[Doc-SIG] Structuring: a summary; and an attempt at EBNF..

Edward D. Loper edloper@gradient.cis.upenn.edu
Wed, 18 Apr 2001 15:30:00 EDT


I figured I'd give a summary of all the structuring features that
I think we've agreed on, so we can tentatively take those as
a given.  If anyone objects, please say so..

  1. Paragraphs are left-justified and separated by blank lines.

  2. Literal blocks start with a paragraph that ends with "::"
     and continue to the next line whose indentation is equal to or
     less than that of the paragraph that started them.  Literal
     blocks should be indented and separated by blank lines.

  3. Doctest blocks start with ">>> " and continue to the next blank
     line.  Doctest blocks should be indented and separated by blank
     lines. 

  4. Lists should be indented and separated by blank lines.  List
     items within a list don't need to be separated by blank lines.
     List items start with bullets, which are either "-" or a single
     number followed by a period, like "1." or "12.".

  5. The second and subsequent lines of a list item are indented.
     This includes list items with multiple paragraphs, sublists, etc.

  6. Sections begin with headings, which are underlined with "=", "-", 
     or "~" (for level 1, 2, or 3 headings, respectively).

  7. Colorizing takes place entirely within paragraphs, and does not
     interact with structuring.

In my mind, the major questions left to resolve are:
  1. how to do colorizing?  Two main proposals: like C{this} and like
     `this`/*this*. 
  2. how to do escaping?
  3. do we need any other structuring constructs (e.g., fields,
     directives, footnotes, etc)?  If so, which ones, and how
     should we add them?

=====

Below is my first attempt at an EBNF-like formalism for these rules.
You should probably pay more attention to the "one-minute summary"
above than to the rules below -- I almost certainly didn't get
the rules below quite right (although if you want to point out
ways that I got it wrong, please do! :) ).

IND and DED are indent and dedent (by a sinlge space); I use 
the notation IND[n] to mean n IND tokens.  Note that the rule::

   x = a IND[n] b DED[n] c

is really just shorthand for::

   x = a y c
   y = IND y DED | b

However, I also use the foo[n] notation in one place where it can't
be simplified.  That's because in list items like:

   - this is a list
     item.

     Here's a second paragraph.

there are crossing dependancies.  In particular, the IND/DED need
to match up, but assuming that we want "this is a list item" to 
be result of a "paragraph" production, they can't.  Don't worry
if you don't understand what I just said, I think it should still
be relatively easy to understand the EBNF below.

I assume that, as part of the preprocessing, all indents/dedents
have been changed to IND/DED tokens.  This process ignores blank
lines, which are simply reduced to be empty.

================================================================
The top-level production::
 #  pytext = (BlankLine NL)*
 #           IND[n]
 #             (Para | List | Section | DocTestBlk)
 #             ((COLON COLON NL LitBlk) |
 #              (NL BlankLine NL (Para | List | Section | DocTestBlk)))*
 #           DED[n]
 #           (NL BlankLine)*

(pytext is just a convenient name, we'll probably want another) This
production assumes that the first-line-might-not-be-indented problem
has already been taken care of.  It says that a formatted docstring
consists of any number of blank lines, followed by an indented section
containing at least one paragraph, list, section, or doctest block,
followed by zero or more literal blocks, paragraphs, lists, sections,
or doctest blocks..  And there can be extra blank lines at the end.
The productions "Para", "List", "Section", etc. generally do *not*
include thier trailing NL, because that makes it easier to detect
paragraphs that end with COLON COLON.

Some useful types of lines are:
  - BlankLine: consists only of spaces.
  - TextLine: non-blank line.
    - StartLine: doesn't start with a Python prompt or a bullet
    - ContLine: anything
    - EndLine: doesn't end with "::"; doesn't include trailing spaces?
    - StartEndLine: doesn't start with PyPrompt or Bullet, and 
                    doesn't end with "::".

We can define them as::
 # BlankLine = (empty)
 # TextLine = [^ NL IND DED]+
 # StartLine = (?! PyPrompt | Bullet) TextLine
 # EndLine = [^ NL IND DED]* [^ NL IND DED COLON] [^ NL IND DED] |
 #           [^ NL IND DED]* [^ NL IND DED] [^ NL IND DED COLON] |
 #           [^ NL IND DED]
 # StartEndLine = (?! PyPrompt | Bullet) EndLine

As I said above, paragraphs don't include the trailing newline.
Paragraphs ending in "::" don't include the "::".::
 # SimplePara = StartLine (NL ContLine)* EndLine |
 #              StartEndLine

Lists are indented (n>1)::
 # List = IND[n] LI (BlankLine+ LI)* DED[n]

We need special list-starting paragraphs.  These don't include
trailing newlines, either::
 # LS_IndPara[n] = ContLine NL IND[n] ContLine (NL ContLine)*
 # LS_OneLinePara = EndLine

There are 3 types of list item::
 # LI = LI1 | LI2 | LI3

This production gives the contents of a list item, *after* its first
paragraph::
 # LI_Rest = ((COLON COLON NL LitBlk) |
 #            (NL BlankLine NL (Para | List | DocTestBlk)))+

List Item, form 1: start with a one-line pagraph, then indentation,
contents, and corresponding dedents.  The indentation/contents/dedent
is optional, so this also covers list items with just a one-line para
(no indent)::
 # LI1 = Bullet LS_OneLinePara
 #         (IND[n]
 #            (BlankLine+ (Para | List | DocTestBlock | LitBlk))+
 #          DED[n])?

List Item, form 2: start with a paragrpah containing indentation, then 
contents, then corresponding dedent::
 # LI2 = Bullet IndPara[n]
 #              (BlankLine+ (Para | List | DocTestBlock | LitBlk))+
 #              DED[n]

List Item, form 3: this is used when the bullet's on a line by itself::
 # LI3 = Bullet NL
 #         (IND[n]
 #            (BlankLine+ (Para | List | DocTestBlock | LitBlk))+
 #          DED[n])?

Sections consist of a heading, followed by an indended section
that can contain anything (i.e., epytext)::
 # Section = Heading NL epytext

DocTestBlocks are terminated by blank lines.  They must be indented::
 # DocTestBlk = IND[n] PyPrompt (ContLine NL)+ DED[n]

Literal blocks.  Within the literal block, all indents/dedents must be 
matched::
 # LitBlk = IND LitBlkContents DED
 # LitBlkContents = [^ IND DED]+ | IND LitBlkContents DED

================================================================
Anyway, I'm sure I didn't get that quite right, but it's a
start, anyway.

-Edward