[Doc-SIG] Problems With StructuredText

David Goodger dgoodger@bigfoot.com
Fri, 24 Nov 2000 23:08:03 -0500


==============================
 Problems With StructuredText
==============================
David Goodger (mailto:dgoodger@bigfoot.com)
2000-11-24

StructuredText_ is a great idea, however it does have flaws. Many of these
flaws go back to the original Setext_ (structure enhanced text)
specification (interesting reading!).

.. _StructuredText:
   http://dev.zope.org/Members/jim/StructuredTextWiki/FrontPage
.. _Setext: http://www.bsdi.com/setext

There are several problems, unresolved issues, and areas of controversy
within StructuredText. In order to resolve all these issues, I'd like to
bring them all out into the open, enumerate all the alternatives, and
propose solutions.

Problems below are labelled C for Classic StructuredText, NG for Next
Generation.

 1. No formal specification_. The code *is* the standard. (C, NG)
 2. Difficult to [understand and extend]_. (C, NG)
 3. Block/section structure via indentation_. (C, NG)
 4. No [escaping mechanism]_. (C, NG)
 5. Awkward [bullet list markup]_: 'o'. (C, NG)
 6. Problematic [enumerated list markup]_. (C)
 7. Ambiguous markup for [code blocks]_. (C, NG)
 8. Tables_. (C, NG)
 9. Awkward [inline code]_ markup. (C, NG)
10. Awkward [hyperlink markup}_. (C, NG)
11. Markup must start with whitespace_. (C, NG)

1. Formal Specification
=======================
.. _specification:

The description in the original StructuredText.py has been criticized for
being vague. "The code *is* the standard." Tony "Tibs" Ibbs has been working
on deducing a detailed description from the documentation and code of
StructuredTextNG_. His notes are available at:

    http://www.tibsnjoan.demon.co.uk/STNG-format.html

.. _StructuredTextNG:
   http://dev.zope.org/Members/jim/StructuredTextWiki/StructuredTextNG

The specification should always preceed the code. Otherwise, StructuredText
is a moving target which can never be adopted as a standard.

2. Understanding and Extending the Code
=======================================
.. _understand and extend:

The original StructuredText is a dense mass of sparsely commented code and
inscrutable regular expressions. It was not designed to be extended and is
very difficult to understand. StructuredTextNG has been designed to allow
input (syntax) and output extensions, but its documentation (both internal
[comments & docstrings], and external) is inadequate.

I would like to see Structured Text become truly useful, perhaps even
joining Python's standard library. Therefore it must have clear,
understandable documentation and implementation code.

3. Structure via Indentation
============================
.. _indentation:

Setext_ required that body text be indented by 2 spaces. The original
StructuredText_ and StructuredTextNG_ require that section structure be
indicated through indentation, as 'inspired by Python'. For certain
structures (lists, code blocks, block quotes) indentation naturally
indicates structure/hierarchy. For section structure, indentation is
unnatural, wasteful of horizontal space, and awkward. Rather, the style of
the section title usually indicates its structure.

In the original StructuredText, sections consist of title paragraphs
followed by indented paragraphs and other body elements. Using indentation
is:

- Unnatural -- Most published works use title style (type size, face,
  weight, and position) rather than indentation to indicate hierarchy. When
  indentation is used, it is usually the formatted end-result and is there
  for aesthetic rather than structural purposes.

- Wasteful -- As the left indent is increased, the amound of horizontal
  space available for text decreases, unnecessarily extending the vertical
  length of a document.

- Awkward -- One must think about the formatting as the text is keyed in.
  And when structural changes are made (it is very common during the
  composition of a document to rearrange sections and their hierarchy) we
  must use block-indent and -unindent functions. In order to input documents
  using indentation, relatively advanced text editors must be used.

Python's significant whitespace is a wonderful innovation (even if it wasn't
original to Python), however applying indentation to ordinary written text
is overgeneralization.

Instead, section structure through title style (as exemplified by this
document) is far more natural. In fact, it is already in widespread use in
plain text documents.

4. Escaping Mechanism
=====================
.. _escaping mechanism:

StructuredText needs a mechanism to treat markup-significant characters as
the characters themselves. Currently there is no such mechanism (although
ZWiki uses '!'). What are the candidates?

1. ! (http://dev.zope.org/Members/jim/StructuredTextWiki/NGEscaping)
2. \
3. ~
4. any others?

I believe the best choice for this is the backslash (\). It's the single
most popular escaping character in the world, therefore familiar. Since
characters only need to be escaped under special circumstances, which are
typically those explaining technical programming issues, the use of the
backslash is natural and understandable. Python docstrings can be raw
(prefixed with an 'r', as in 'r""'), which would obviate the need for
gratuitous doubling-up of backslashes.

The rule would be: A backslash followed by any character escapes the
character. The escaped character represents the character itself, and is
prevented from playing a role in any markup interpretation. The backslash is
removed from the output. A literal backslash is represented by two
backslashes in a row.

XXX Allow backslashes preceeding non-markup characters to remain in the
output? That might make describing regexes much easier.

5. Bullet List Markup
=====================
.. _bullet list markup:

StructuredText includes 'o' as a bullet character. This is dangerous and
counter to the language-independent nature of the markup. There are many
languages in which 'o' is a word. For example, in Spanish:

    Llamame a la casa
    o al trabajo.

    (Call me at home or at work.)

And in Japanese (when romanized):

    Senshuu no doyoubi ni tegami
    o kakimashita.

    ([I] wrote a letter on Saturday last week.)

If a paragraph containing an 'o' word wraps such that the 'o' is the first
text on a line, it could be misinterpreted as a bullet list.

I recommend omitting 'o' as a bullet character. '+' could be used instead.

6. Enumerated List Markup
=========================
.. _enumerated list markup:

StructuredText enumerated lists are allowed to begin with a number (sequence
of digits) followed by whitespace. This could have consequences for line
wrapping and writing styles::

    "That bird wouldn't *voom* if you put
    10000 volts through it!"

    1 is all I need.

I recommend requiring something after the number, a period ('.'), a colon
(':'), a dash ('-'), a space and a dash (' -'), a right-parenthesis (')'),
or surrounded with parentheses ('()'). Perhaps this list is excessive. But
forgiving is better than restrictive.

Should the digits/letters/numerals themselves be interpreted, allowing
nested enumerated lists to be created without indentation?

How about nested enumerated lists without indentation via compound
enumerators? Simply count the 'length' (number of sub-enumerators) of the
compound enumerator::

   1. one
   1.a. two
   1.a.I. three
   2.a. two
   2.b.I. three

7. Code Blocks
==============
.. _code blocks:

The StructuredText specification has example code blocks indicated by
'example', 'examples', or '::' ending the preceeding paragraph. STNG only
recognizes '::'; 'example'/'examples' are not implemented. This is good; it
fixes a language-dependent feature. The problem is what to do with the '::'.

I propose that '::' at the end of a paragraph indicate that subsequent
*indented* blocks are treated as example code. No further markup
interpretation is done within code blocks (not even backslash-escapes). If
the '::' is preceeded by whitespace, '::' is omitted from the output; if
'::' was the sole content of a paragraph, the entire paragraph is removed
(no 'empty' paragraph remains). If '::' is preceeded by a non-whitespace
character, '::' is replaced by ':' (i.e., the extra colon is removed).

Thus, a section could begin with a code block as follows::

    Section Title
    -------------
    ::
        print "this is example code"

One possible variation is for meta-documentation (perhaps an extension?):
use triple-colons (':::') to indicate 'take the following code block, mark
it up as a code block, then copy it and mark it up as if it weren't a code
block'. The implementation may insert text in-between, such as 'Marked up
as:', or may alter the formatting (different font, set in a colored box,
whatever).

8. Tables
=========
.. _tables:

The table markup scheme in classic StructuredText was horrible. Its omission
from StructuredTextNG is welcome, and I will not dignify the markup by
repeating it here. However, tables themselves are useful in documentation.
Alternatives:

1. This format is the most natural and obvious. I came up with it (no great
   feat of creation!), and later discovered that it is the format supported
   by the [Emacs table mode]_::

       +------------+------------+---------------------------+
       |  Column 1  |  Column 2  | Column 3 & 4 span (Row 1) |
       +------------+------------+------------+--------------+
       |    Column 1 & 2 span    |  Column 3  | - Column 4   |
       +------------+------------+------------+ - Row 2 & 3  |
       |      1     |      2     |      3     | - span       |
       +------------+------------+------------+--------------+

   Tables are described with a visual outline made up of the characters '-',
   '|', and '+'. The hyphen ('-') is used for horizontal lines (row
   separators), the vertical bar ('|') for vertical lines (column
   separators), and the plus sign ('+') for intersections of horizontal and
   vertical lines. Row and column spans are possible simply by omitting the
   column or row separators, respectively. Each cell contains body elements,
   and may have multiple paragraphs, lists, etc.

.. _Emacs table mode:
   ftp://archive.cis.ohio-state.edu/pub/emacs-lisp/archive/table.el

2. Below is a minimalist possibility. It may be better suited to manual
   input than alternative #1, but there is no Emacs editing mode available.
   One disadvantage is that it resembles section titles; a one-column table
   would look exactly like section titles. It could be a directive-driven
   (extra syntax) extension. ::

         Column 1     Column 2    Column 3 & 4 span (Row 1)
       ============ ============ ===========================
           Column 1 & 2 span       Column 3    - Column 4
       ------------------------- ------------  - Row 2 & 3
             1            2            3       - span
       ============ ============ ============ ==============

   Each row is underlined. The head row is underlined with '=', with spaces
   at column boundaries. If there is no head row, the table begins with a
   top border of equals signs with spaces at column boundaries. Internal row
   separators are underlines of '-', with spaces at column boundaries.
   Column spans have no spaces. Row spans simply lack an underline at the
   row boundary. The bottom boundary of the table consists of '='
   underlines. A blank line is required following a table.

9. Inline Code
==============
.. _inline code:

The current markup for inline code (text left as-is, verbatim, usually in a
monospaced font; HTML <TT>) is single quotes ('code'). The problem with
single quotes is that they are too often used for other purposes, like
apostrophes, quoting text, and string literals.

Alternatives::
    'code'    \'code\'    ''code''    "code"    \"code\"    ""code""
    #code#     @code@      `code`     ^code^    ``code''

The examples below contain inline code, quoted text, and apostrophes. Each
example should evaluate to the following HTML::

    <P>Some <TT>code</TT>, with a 'quote', "double",
    ain't it grand?</P>

    0. Some code, with a quote, double, ain't it grand?
    1. Some \'code\', with a 'quote', "double", ain't it grand?
    2. Some 'code', with a \'quote\', "double", ain\'t it grand?
    3. Some ''code'', with a 'quote', "double", ain't it grand?
    4. Some "code", with a 'quote', \"double\", ain't it grand?
    5. Some \"code\", with a 'quote', "double", ain't it grand?
    6. Some ""code"", with a 'quote', "double", ain't it grand?
    7. Some #code#, with a 'quote', "double", ain't it grand?
    8. Some @code@, with a 'quote', "double", ain't it grand?
    9. Some `code`, with a 'quote', "double", ain't it grand?
    10. Some ^code^, with a 'quote', "double", ain't it grand?
    11. Some ``code'', with a 'quote', "double", ain't it grand?

A more complicated piece of inline code::

    <P>Does <TT>a[b] = 'c' + "d" + `2^3`</TT> work?</P>

    0. Does a[b] = 'c' + "d" + `2^3` work?
    1. Does \'a[b] = 'c' + "d" + `2^3`\' work?
    2. Does 'a[b] = \'c\' + "d" + `2^3`' work?
    3. Does ''a[b] = 'c' + "d" + `2^3`'' work?
    4. Does "a[b] = 'c' + "d" + `2^3`" work?
    5. Does \"a[b] = 'c' + "d" + `2^3`\" work?
    6. Does ""a[b] = 'c' + "d" + `2^3`"" work?
    7. Does #a[b] = 'c' + "d" + `2^3`# work?
    8. Does @a[b] = 'c' + "d" + `2^3`@ work?
    9. Does `a[b] = 'c' + "d" + \`2^3\`` work?
    10. Does ^a[b] = 'c' + "d" + `2\^3`^ work?
    11. Does ``a[b] = 'c' + "d" + `2^3`'' work?

Backquotes (#9) seem to be the best choice. They are unobtrusive and
relatviely rarely used (more rarely than ' or ", anyhow). Backquotes have
the connotation of 'quotes', which other options (like carets, #10) don't.
When used within code, they can be escaped (\`).

Alternative choices are carets (#10) and TeX-style quotes (#11). For
examples of TeX-style quoting, see:

    
http://www.zope.org/Members/jim/StructuredTextWiki/CustomizingTheDocumentPro
cessor

The only uses of backquotes I know are:
(A) As a synonym for repr() in Python.
(B) For command-interpolation in shell scripts.
(C) Used as open-quotes in TeX code (and carried over into plaintext by
    TeXies).

The backslash-escape mechanism would allow A & B inside inline code. TeX
quotes outside inline code (``like this'') could be a special case,
interpreted and marked up as proper quotes. That leaves TeX quotes inside
inline code, which (although ugly) could be handled by escaping with
backslashes::

    line `\`\`this''`!

Let's face it, no mechanism for inline code is perfect, just as no escaping
mechanism is perfect. No matter what we use, complicated expressions will
end up looking ugly. We can only choose the least ugly option.

10. HyperLinks
==============
.. _hyperlink markup:

There are three forms of hyperlink currently in StructuredText_:

1. (Absolute & relative URLs.) Text enclosed by double quotes followed by a
   colon, a URL, and concluded by punctuation plus white space, or just
   white space, is treated as a hyperlink::

       "Python":http://www.python.org/

2. (Absolute URLs only.) Text enclosed by double quotes followed by a comma,
   one or more spaces, an absolute URL and concluded by punctuation plus
   white space, or just white space, is treated as a hyperlink::

       "mail me", mailto:me@mail.com

3. (Endnotes.) Text enclosed by brackets link to an endnote at the end of
   the document: at the beginning of the line, two dots, a space, and the
   same text in brackets, followed by the end note itself::

       Please refer to the fine manual [GVR2000].

       .. [GVR2000] Python Documentation, van Rossum, Drake, et al.,
          http://www.python.org/doc/

The problem with forms 1 and 2 is that they are neither intuitive nor
unobtrusive (they break Goal 2). The brackets in form 3 are too common in
ordinary text (such as [nested] asides and Python lists like [12]).

Alternatives:

0. Have no special markup for hyperlinks.

   A. Except for #1 below?

1. Interpret and mark up hyperlinks as any contiguous text containing '://'
   or ':...@' after an alphanumeric word (absolute URL; exact specification
   to be looked up). To de-emphasize the URL, simply enclose it in
   parentheses:

       Python (http://www.python.org/)

   A. Leave special hyperlink markup as a domain-specific extension.
      Ordinary Structured Text documents would be required to have inline
      hyperlinks. Processed hyperlinks (with the URL hidden) may be
      important for Zope and ZWiki pages, but are they important for general
      uses? I suspect yes.

2. The original Setext_ introduced a mechanism of indirect hyperlinks. A
   source link word ('hot word') in the text is given a trailing
   underscore::

       Here is some text with a hyperlink_ built in.

   The hyperlink itself appears at the end of the document on a line by
   itself, beginning with two dots, a space, the link word with a leading
   underscore, whitespace, and the URL itself::

       .. _hyperlink http://www.123.xyz

   This has the advantage of being readable and relatively unobtrusive.
   Since each source link must match up to a target, the odd variable ending
   in an underscore can be spared being marked up (no such target). The only
   disadvantage is that phrase-links aren't possible without some obtrusive
   syntax. Setext used 'underscores_instead_of_spaces_' for phrase links.

   We could achieve phrase-links if we enclose the link text in double
   quotes ('"like this"_') or in brackets ('[like this]_'). We get obtrusive
   markup, but that is unavoidable. I prefer the bracketed syntax as
   reminiscent of links on many web pages.

   The same markup can also be used for footnotes, removing the problem with
   ordinary bracketed text and Python lists::

       Please refer to the fine manual [GVR2000]_.

       .. _[GVR2000] Python Documentation, van Rossum, Drake, et al.,
          http://www.python.org/doc/

   The two-dots-and-a-space syntax was generalized by Setext for comments,
   which are removed from the processed text. In order to eliminate
   ambiguity with comments and footnotes, I propose that a colon always
   follow the target link word/phrase in indirect hyperlinks (denoting 'maps
   to'). There is no reason to restrict target links to the end of the
   document; they could just as easily be interspersed.

   Internal hyperlinks (hyperlinks from one point to another within a single
   document) can be expressed by a source link as before, and a target link
   with a colon but no URL.

   As an added bonus, we now have a perfect candidate for Structured Text
   directives, a simple extension mechanism: a comment containing a single
   word followed by two colons and whitespace. The interpretation of
   subsequent data on the directive line or following is directive- and/or
   implementation-dependent.

   To summarize::

       .. This is a comment.
       .. version:: 1
       .. The line above is an example of a directive.

       This internal hyperlink will take us to the footnotes_.

       Here is a one-word_ indirect hyperlink.

       Here is [an indirect hyperlink phrase]_.

       This is a footnote [1]_.

       .. _footnotes:
       .. _one-word: http://www.123.xyz
       .. _an indirect hyperlink phrase: http://www.123.xyz
       .. _[1] Footnote text goes here.

   The presence or absence of a colon after the target link differentiates
   an indirect hyperlink from a footnote, respectively. Brackets around a
   target link word or phrase are optional as long as the phrase does not
   contain a colon.

The examples below contain links (URLs & references), and bracketed text. In
HTML, each example should evaluate to::

    <P>A <A HREF="http://spam.org">URL</A>, see <A HREF="#eggs2000">
    [eggs2000]</A> (in Bacon [Publisher]). Also see
    <A HREF="http://eggs.org">http://eggs.org</A>.</P>

    0. A URL http://spam.org, see eggs2000 (in Bacon [Publisher]).
       Also see http://eggs.org.
    1. A "URL":http://spam.org, see [eggs2000] (in Bacon [Publisher]).
       Also see "http://eggs.org":http://eggs.org.
    2. A "URL", http://spam.org, see [eggs2000] (in Bacon [Publisher]).
       Also see "http://eggs.org", http://eggs.org.
    3. A URL_, see [eggs2000]_ (in Bacon [Publisher]).
       Also see http://eggs.org.

The bracketed text '[Publisher]' may be problematic with syntax 1 & 2.
Syntax 3 is definitely the most readable.

Here is the endnote/footnote itself. In HTML, each example should evaluate
to::

    <P><A NAME="eggs2000">[eggs2000]</A> "Spam, Spam, Spam, Eggs, Bacon,
    and Spam"</P>

    0. eggs2000 "Spam, Spam, Spam, Eggs, Bacon, and Spam"
    1. .. [eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"
    2. .. [eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"
    3. .. _[eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"

For style #3, the indirect hyperlink would be entered as follows::

    .. _URL: http:/spam.org

11. Whitespace Delimitation of Markup
=====================================
.. _whitespace:

StructuredText specifies that inline markup begin with whitespace,
precluding such constructs as parenthesized or quoted emphatic text::

    "**What?**" she cried. (*exit stage left*)

The specification for how markup is detected should be refined to allow for
such constructs.