[Doc-SIG] Problems With StructuredText
Sun, 03 Jun 2001 10:34:24 -0400
Problems With StructuredText
Author: David Goodger
Status: second draft
There are several problems, unresolved issues, and areas of controversy
within StructuredText_ (Classic and Next Generation). In order to resolve
all these issues, this analysis brings all of the issues out into the open,
enumerates all the alternatives, and proposes solutions to be incorporated
into the reStructuredText_ specification.
1. No formal specification_. The code *is* the standard.
2. Code is difficult to `understand and extend`_.
3. Section structure via indentation_.
4. No `character escaping mechanism`_.
5. `Blank lines in lists`_.
6. `Bullet list markup`_.
7. Problematic `enumerated list markup`_.
8. Awkward `definition list markup`_.
9. Ambiguous markup for `literal blocks`_.
11. `Delimitation of inline markup`_ (must start with whitespace).
12. Awkward underlining_ markup.
13. Awkward `inline literals`_ markup.
14. Awkward `hyperlink markup`_.
1. Formal Specification
The description in the original StructuredText.py has been criticized for
being vague. For practical purposes, "the code *is* the spec." Tony Ibbs
has been working on deducing a `detailed description`_ from the
documentation and code of StructuredTextNG_. Edward Loper's STMinus_ is
another attempt to formalize a spec.
For this kind of a project, the specification should always preceed the
code. Otherwise, the markup is a moving target which can never be adopted
as a standard. Of course, a specification may be revised during lifetime of
the code, but without a spec there is no visible control and thus no
.. _understand and extend:
2. Understanding and Extending the Code
The original StructuredText_ is a dense mass of sparsely commented code and
inscrutable regular expressions. It was not designed to be extended and is
very difficult to understand. StructuredTextNG_ has been designed to allow
input (syntax) and output extensions, but its documentation (both internal
[comments & docstrings], and external) is inadequate for the complexity of
the code itself.
For reStructuredText to become truly useful, perhaps even part of Python's
standard library, it must have clear, understandable documentation and
implementation code. For the implementation of reStructuredText to be taken
seriously, it must be a sterling example of the potential of docstrings;
the implementation must practice what the specification preaches.
3. Structure via Indentation
Setext_ required that body text be indented by 2 spaces. The original
StructuredText_ and StructuredTextNG_ require that section structure be
indicated through indentation, as "inspired by Python". For certain
structures (outlines, lists, literal blocks, block quotes) indentation
naturally indicates structure or hierarchy. For section structure,
indentation is unnatural and awkward. Rather, the style of the section
title should indicate its structure.
In the original StructuredText, sections consist of one-line title
paragraphs followed by indented paragraphs and other body elements. Using
- Unnatural. Most published works use title style (type size, face, weight,
and position) and/or section/subsection numbering rather than indentation
to indicate hierarchy. When indentation is used, it is usually the
formatted end-result and is there for aesthetic rather than structural
- Awkward. One must think about the formatting as the text is keyed in. And
when structural changes are made (it is very common during the
composition of a document to rearrange sections and their hierarchy) we
must use block-indent and -unindent functions. In order to edit documents
using indentation, relatively advanced text editors must be used.
Python's significant whitespace is a wonderful innovation (even if not
original to Python), however applying indentation to ordinary written text
reStructuredText_ indicates section structure through title style (as
exemplified by this document). This is far more natural. In fact, it is
already in widespread use in plain text documents, including in Python's
standard distribution (such as the toplevel README_ file).
.. _character escaping mechanism:
4. Character Escaping Mechanism
No matter what characters are chosen for markup, some day someone will want
to write documentation *about* that markup or using markup characters in a
non-markup context. Therefore, any complete markup language must have an
escaping or encoding mechanism. For a lightweight markup system, encoding
mechanisms like SGML/XML's '*' are out. So an escaping mechanism is in.
However, with carefully chosen markup, it should be necessary to use the
escaping mechanism only infrequently.
reStructuredText_ needs an escaping mechanism: a way to treat
markup-significant characters as the characters themselves. Currently there
is no such mechanism (although ZWiki uses '!'). What are the candidates?
1. ! (http://dev.zope.org/Members/jim/StructuredTextWiki/NGEscaping)
4. doubling of characters
The best choice for this is the backslash (\). It's "the single most
popular escaping character in the world!", therefore familiar and
unsurprising. Since characters only need to be escaped under special
circumstances, which are typically those explaining technical programming
issues, the use of the backslash is natural and understandable. Python
docstrings can be raw (prefixed with an 'r', as in 'r""'), which would
obviate the need for gratuitous doubling-up of backslashes.
The rule would be: An unescaped backslash followed by any markup character
escapes the character. The escaped character represents the character
itself, and is prevented from playing a role in any markup interpretation.
The backslash is removed from the output. A literal backslash is
represented by an "escaped backslash," two backslashes in a row.
A carefully constructed set of recognition rules for inline markup will
obviate the need for backslash-escapes in almost all cases; see
`Delimitation of Inline Markup`_ below.
When an expression (requiring backslashes and other characters used for
markup) becomes too complicated and therefore unreadable, a literal block
may be used instead. Inside literal blocks, no markup is recognized,
therefore backslashes (for the purpose of escaping markup) become
We could allow backslashes preceeding non-markup characters to remain in
the output. This would make describing regular expressions and other uses
of backslashes easier. However, this would complicate the markup rules and
would be confusing.
.. _Blank lines in lists:
5. Blank Lines in Lists
Oft-requested in Doc-SIG (the earliest reference is dated 1996-08-13) is
the ability to write lists without requiring blank lines between items. In
docstrings, space is at a premium. Authors want to convey their API or
usage information in as compact a form as possible. StructuredText_
requires blank lines between all body elements, including list items, even
when boundaries are obvious from the markup itself.
In reStructuredText, blank lines are optional between list items. However,
in order to eliminate ambiguity, a blank line is required before the first
list item and after the last. Nested lists also require blank lines before
the list start and after the list end.
.. _Bullet list markup:
6. Bullet List Markup
StructuredText_ includes 'o' as a bullet character. This is dangerous and
counter to the language-independent nature of the markup. There are many
languages in which 'o' is a word. For example, in Spanish:
Llamame a la casa
o al trabajo.
(Call me at home or at work.)
And in Japanese (when romanized):
Senshuu no doyoubi ni tegami
([I] wrote a letter on Saturday last week.)
If a paragraph containing an 'o' word wraps such that the 'o' is the first
text on a line, or if a paragraph begins with such a word, it could be
misinterpreted as a bullet list.
In reStructuredText_, 'o' is not used as a bullet character. '-', '*', and
'+' are the possible bullet characters.
.. _enumerated list markup:
7. Enumerated List Markup
StructuredText enumerated lists are allowed to begin with numbers and
letters followed by a period or right-parenthesis, then whitespace. This
has surprising consequences for writing styles. For example, this is
recognized as an enumerated list item by StructuredText::
People will write enumerated lists in all different ways. It is folly to
try to come up with the "perfect" format for an enumerated list, and limit
the docstring parser's recognition to that one format only.
Rather, the parser should recognize a variety of enumerator styles, marking
each block as a potential enumerated list item (PELI), and interpret the
enumerators of adjacent PELIs to decide whether they make up a consistent
If a PELI is labeled with a "1.", and is immediately followed by a PELI
labeled with a "2.", we've got an enumerated list. Or "(A)" followed by
"(B)". Or "i)" followed by "ii)", etc. The chances of accidentally
recognizing two adjacent and consistently labeled PELIs, are acceptably
For an enumerated list to be recognized, the following must be true:
- the list must consist of multiple adjacent list items (2 or more)
- the enumerators must all have the same format
- the enumerators must be sequential
It is also recommended that the enumerator of the first list item be
ordinal-1 ('1', 'A', 'a', 'I', or 'i'), as output formats may not be able
to begin a list at an arbitrary enumeration.
For the future: Should the digits/letters/numerals themselves be
interpreted, allowing nested enumerated lists to be created without
indentation? Via compound enumerators for example? Simply count the
'length' (number of sub-enumerators) of the compound enumerator::
.. _definition list markup:
8. Definition List Markup
StructuredText uses ' -- ' (whitespace, two hyphens, whitespace) on the
first line of a paragraph to indicate a definition list item. The ' -- '
serves to separate the term (on the left) from the definition (on the
Many people use ' -- ' as an em-dash in their text, conflicting with the
StructuredText usage. Although the Chicago Manual of Style says that spaces
should not be used around an em-dash, Peter Funk pointed out that this is
standard usage in German (according to the Duden, the official German
reference), and possibly in other languages as well. The widespread use of
' -- ' precludes its use for definition lists; it would violate the
A simpler, and at least equally visually distinctive construct (proposed by
Guido van Rossum, who incidentally is a frequent user of ' -- ') would do
just as well::
Definition 2, paragraph 1.
Definition 2, paragraph 2.
A reStructuredText definition list item consists of a term and a
definition. A term is a simple one-line paragraph. A definition is a block
indented relative to the term, and may contain multiple paragraphs and
other body elements. No blank line preceedes a definition (this
distinguishes definition lists from block quotes).
.. _literal blocks:
9. Literal Blocks
The StructuredText_ specification has literal blocks indicated by
'example', 'examples', or '::' ending the preceeding paragraph. STNG only
recognizes '::'; 'example'/'examples' are not implemented. This is good; it
fixes an unnecessary language dependency. The problem is what to do with
the sometimes- unwanted '::'.
In reStructuredText_ '::' at the end of a paragraph indicates that
subsequent *indented* blocks are treated as literal text. No further markup
interpretation is done within literal blocks (not even backslash-escapes).
If the '::' is preceeded by whitespace, '::' is omitted from the output; if
'::' was the sole content of a paragraph, the entire paragraph is removed
(no 'empty' paragraph remains). If '::' is preceeded by a non-whitespace
character, '::' is replaced by ':' (i.e., the extra colon is removed).
Thus, a section could begin with a literal block as follows::
print "this is example literal"
The table markup scheme in classic StructuredText was horrible. Its
omission from StructuredTextNG is welcome, and its markup will not be
repeated here. However, tables themselves are useful in documentation.
1. This format is the most natural and obvious. It was independently
invented(no great feat of creation!), and later found to be the format
supported by the `Emacs table mode`_::
| Header 1 | Header 2 | Header 3 | Header 4 |
| Column 1 | Column 2 | Column 3 & 4 span (Row 1) |
| Column 1 & 2 span | Column 3 | - Column 4 |
+------------+------------+------------+ - Row 2 & 3 |
| 1 | 2 | 3 | - span |
Tables are described with a visual outline made up of the characters
'-', '=', '|', and '+':
- The hyphen ('-') is used for horizontal lines (row separators).
- The equals sign ('=') is optionally used as a header separator (as of
version 1.5.24, this is not supported by the Emacs table mode).
- The vertical bar ('|') is used for for vertical lines (column
- The plus sign ('+') is used for intersections of horizontal and
Row and column spans are possible simply by omitting the column or row
separators, respectively. The header row separator must be complete; in
other words, a header cell may not span into the table body. Each cell
contains body elements, and may have multiple paragraphs, lists, etc.
Initial spaces for a left margin are allowed; the first line of text in
a cell determines its left margin.
2. Below is a minimalist possibility. It may be better suited to manual
input than alternative #1, but there is no Emacs editing mode available.
One disadvantage is that it resembles section titles; a one-column table
would look exactly like section & subsection titles. ::
============ ============ ============ ==============
Column 1 Column 2 Column 3 & 4 span (Row 1)
============ ============ ===========================
Column 1 & 2 span Column 3 - Column 4
------------------------- ------------ - Row 2 & 3
1 2 3 - span
============ ============ ============ ==============
The table begins with a top border of equals signs with a space at each
column boundary (regardless of spans). Each row is underlined. Internal
row separators are underlines of '-', with spaces at column boundaries.
The last of the optional head rows is underlined with '=', again with
spaces at column boundaries. Column spans have no spaces in their
underline. Row spans simply lack an underline at the row boundary. The
bottom boundary of the table consists of '=' underlines. A blank line is
required following a table.
Alternative #1 is the choice adopted by reStructuredText.
.. _delimitation of inline markup:
11. Delimitation of Inline Markup
StructuredText specifies that inline markup must begin with whitespace,
precluding such constructs as parenthesized or quoted emphatic text::
"**What?**" she cried. (*exit stage left*)
The `reStructuredText markup specification`_ allows for such constructs and
disambiguates inline markup through a set of recognition rules. These
recognition rules define the context of markup start-strings and
end-strings, allowing markup characters to be used in non-markup contexts
without a problem (or even a backslash). So we can say, "Use asterisks (*)
around words or phrases to *emphasisze* them." The '(*)' will not be
recognized as markup. This reduces the need for markup escaping to the
point where an escape character is *almost* (but not quite!) unnecessary.
StructuredText uses '_text_' to indicate underlining. To quote David Ascher
in his 2000-01-21 Doc-SIG mailing list post, "Docstring grammar: a very
The tagging of underlined text with _'s is suboptimal. Underlines
shouldn't be used from a typographic perspective (underlines were
designed to be used in manuscripts to communicate to the
typesetter that the text should be italicized -- no well-typeset
book ever uses underlines), and conflict with double-underscored
Python variable names (__init__ and the like), which would get
truncated and underlined when that effect is not desired. Note
that while *complete* markup would prevent that truncation
('__init__'), I think of docstring markups much like I think of
type annotations -- they should be optional and above all do no
harm. In this case the underline markup does harm.
Underlining is not part of the reStructuredText specification.
.. _inline literals:
13. Inline Literals
StructuredText's markup for inline literals (text left as-is, verbatim,
usually in a monospaced font; as in HTML <TT>) is single quotes
('literals'). The problem with single quotes is that they are too often
used for other purposes:
- Apostrophes: "Don't blame me, 'cause it ain't mine, it's Chris'.";
- Quoting text:
First Bruce: "Well Bruce, I heard the prime minister use it. 'S'hot
enough to boil a monkey's bum in 'ere your Majesty,' he said, and she
smiled quietly to herself."
In the UK, single quotes are used for dialogue in published works.
- String literals: s = ''
'text' \'text\' ''text'' "text" \"text\" ""text""
#text# @text@ `text` ^text^ ``text'' ``text``
The examples below contain inline literals, quoted text, and apostrophes.
Each example should evaluate to the following HTML::
Some <TT>code</TT>, with a 'quote', "double", ain't it grand?
Does <TT>a[b] = 'c' + "d" + `2^3`</TT> work?
0. Some code, with a quote, double, ain't it grand?
Does a[b] = 'c' + "d" + `2^3` work?
1. Some 'code', with a \'quote\', "double", ain\'t it grand?
Does 'a[b] = \'c\' + "d" + `2^3`' work?
2. Some \'code\', with a 'quote', "double", ain't it grand?
Does \'a[b] = 'c' + "d" + `2^3`\' work?
3. Some ''code'', with a 'quote', "double", ain't it grand?
Does ''a[b] = 'c' + "d" + `2^3`'' work?
4. Some "code", with a 'quote', \"double\", ain't it grand?
Does "a[b] = 'c' + "d" + `2^3`" work?
5. Some \"code\", with a 'quote', "double", ain't it grand?
Does \"a[b] = 'c' + "d" + `2^3`\" work?
6. Some ""code"", with a 'quote', "double", ain't it grand?
Does ""a[b] = 'c' + "d" + `2^3`"" work?
7. Some #code#, with a 'quote', "double", ain't it grand?
Does #a[b] = 'c' + "d" + `2^3`# work?
8. Some @code@, with a 'quote', "double", ain't it grand?
Does @a[b] = 'c' + "d" + `2^3`@ work?
9. Some `code`, with a 'quote', "double", ain't it grand?
Does `a[b] = 'c' + "d" + \`2^3\`` work?
10. Some ^code^, with a 'quote', "double", ain't it grand?
Does ^a[b] = 'c' + "d" + `2\^3`^ work?
11. Some ``code'', with a 'quote', "double", ain't it grand?
Does ``a[b] = 'c' + "d" + `2^3`'' work?
12. Some ``code``, with a 'quote', "double", ain't it grand?
Does ``a[b] = 'c' + "d" + `2^3\``` work?
Backquotes (#9) are the best choice. They are unobtrusive and relatviely
rarely used (more rarely than ' or ", anyhow). Backquotes have the
connotation of 'quotes', which other options (like carets, #10) don't. When
used within literals, they can be escaped (\`).
Analogously with *emph* & **strong**, double-backquotes (#12) could be used
for inline literals. If single-backquotes are used for 'interpreted text'
(context-sensitive domain-specific descriptive markup) such as function
name hyperlinks in Python docstrings, then double-backquotes could be used
for absolute-literals, wherein no processing whatsoever takes place. An
advantage of double-backquotes would be that backslash-escaping would no
longer be necessary for embedded single-backquotes; however, embedded
double-backquotes (in an end-string context) would be illegal.
Alternative choices are carets (#10) and TeX-style quotes (#11). For
examples of TeX-style quoting, see:
Some existing uses of backquotes:
1. As a synonym for repr() in Python.
2. For command-interpolation in shell scripts.
3. Used as open-quotes in TeX code (and carried over into plaintext by
The inline markup start-string and end-string recognition rules defined by
the `reStructuredText markup specification`_ would allow all of these cases
inside inline literals, with very few exceptions. As a fallback, literal
blocks could handle all cases.
Outside of inline literals, the above uses of backquotes would require
backslash-escaping. However, these are all prime examples of text that
should be marked up with inline literals.
If either backquotes or straight single-quotes are used as markup,
TeX-quotes are too troublesome to support, so no special-casing of
TeX-quotes should be done (at least at first). If TeX-quotes have to be
used outside of literals, a single backslash-escaped would suffice: \``TeX
quote''. Ugly, true, but very infrequently used.
Using literal blocks is a fallback option which removes the need for
Here, we can do ``absolutely'' anything `'`'\|/|\ we like!
No mechanism for inline literals is perfect, just as no escaping mechanism
is perfect. No matter what we use, complicated inline expressions involving
the inline literal quote and/or the backslash will end up looking ugly. We
can only choose the least often ugly option.
.. _hyperlink markup:
There are three forms of hyperlink currently in StructuredText_:
1. (Absolute & relative URIs.) Text enclosed by double quotes followed by a
colon, a URI, and concluded by punctuation plus white space, or just
white space, is treated as a hyperlink::
2. (Absolute URIs only.) Text enclosed by double quotes followed by a
comma, one or more spaces, an absolute URI and concluded by punctuation
plus white space, or just white space, is treated as a hyperlink::
"mail me", mailto:email@example.com
3. (Endnotes.) Text enclosed by brackets link to an endnote at the end of
the document: at the beginning of the line, two dots, a space, and the
same text in brackets, followed by the end note itself::
Please refer to the fine manual [GVR2001].
.. [GVR2001] Python Documentation, Release 2.1, van Rossum, Drake,
et al., http://www.python.org/doc/
The problem with forms 1 and 2 is that they are neither intuitive nor
unobtrusive (they break design goals 5 & 2). They overload double-quotes,
which are too often used in ordinary text (potentially breaking design goal
4). The brackets in form 3 are also too common in ordinary text (such as
[nested] asides and Python lists like ).
1. Have no special markup for hyperlinks.
2. A. Interpret and mark up hyperlinks as any contiguous text containing
'://' or ':...@' (absolute URI) or '@' (email address) after an
alphanumeric word. To de-emphasize the URI, simply enclose it in
B. Leave special hyperlink markup as a domain-specific extension.
Hyperlinks in ordinary reStructuredText documents would be required
to be standalone (i.e. the URI text inline in the document text).
Processed hyperlinks (where the URI text is hidden behind the link)
are important enough to warrant syntax.
3. The original Setext_ introduced a mechanism of indirect hyperlinks. A
source link word ('hot word') in the text was given a trailing
Here is some text with a hyperlink_ built in.
The hyperlink itself appeared at the end of the document on a line by
itself, beginning with two dots, a space, the link word with a leading
underscore, whitespace, and the URI itself::
.. _hyperlink http://www.123.xyz
Setext used 'underscores_instead_of_spaces_' for phrase links.
With some modification, alternative 3 best satisfies the design goals. It
has the advantage of being readable and relatively unobtrusive. Since each
source link must match up to a target, the odd variable ending in an
underscore can be spared being marked up (although it should generate a "no
such link target" warning). The only disadvantage is that phrase-links
aren't possible without some obtrusive syntax.
We could achieve phrase-links if we enclose the link text:
1. in double quotes::
2. in brackets::
3. or in backquotes::
Each gives us somewhat obtrusive markup, but that is unavoidable. The
bracketed syntax (#2) is reminiscent of links on many web pages
(intuitive), although it is somewhat obtrusive. Alternative #3 is much less
obtrusive, and is not inconsistent with interpreted text: the trailing
underscore indicates the interpretation of the phrase, as a hyperlink. #3
also disambiguates hyperlinks from footnote references. Alternative #3
The same trailing underscore markup can also be used for footnotes,
removing the problem with ordinary bracketed text and Python lists::
Please refer to the fine manual [GVR2000]_.
.. _[GVR2000] Python Documentation, van Rossum, Drake, et al.,
The two-dots-and-a-space syntax was generalized by Setext for comments,
which are removed from the (visible) processed output. reStructuredText
uses this syntax for comments, footnotes, and link target. For link
targets, in order to eliminate ambiguity with comments and footnotes,
reStructuredText specifies that a colon always follow the link target
word/phrase. The colon denotes 'maps to'. There is no reason to restrict
target links to the end of the document; they could just as easily be
Internal hyperlinks (links from one point to another within a single
document) can be expressed by a source link as before, and a target link
with a colon but no URI. In effect, these targets 'map to' the element
As an added bonus, we now have a perfect candidate for reStructuredText
directives, a simple extension mechanism: a comment containing a single
word followed by two colons and whitespace. The interpretation of
subsequent data on the directive line or following is directive-dependent.
.. This is a comment.
.. The line below is an example of a directive.
.. version:: 1
This is a footnote _.
This internal hyperlink will take us to the footnotes_ area below.
Here is a one-word_ indirect hyperlink.
Here is `an indirect hyperlink phrase`_.
.. _ Footnote text goes here.
.. indirect hyperlink target mappings:
.. _one-word: http://www.123.xyz
.. _an indirect hyperlink phrase: http://www.123.xyz
The presence or absence of a colon after the target link differentiates an
indirect hyperlink from a footnote, respectively. A footnote requires
brackets. Backquotes around a target link word or phrase are required if
the phrase contains a colon, optional otherwise.
Below are examples using no markup, the two StructuredText hypertext
styles, and the reStructuredText hypertext style. Each example contains an
indirect link, a direct link, a footnote/endnote, and bracketed text. In
HTML, each example should evaluate to::
<P>A <A HREF="http://spam.org">URI</A>, see <A HREF="#eggs2000">
[eggs2000]</A> (in Bacon [Publisher]). Also see
<P><A NAME="eggs2000">[eggs2000]</A> "Spam, Spam, Spam, Eggs, Bacon,
1. No markup::
A URI http://spam.org, see eggs2000 (in Bacon [Publisher]).
Also see http://eggs.org.
eggs2000 "Spam, Spam, Spam, Eggs, Bacon, and Spam"
2. StructuredText absolute/relative URI syntax ("text":http://www.url.org)::
A "URI":http://spam.org, see [eggs2000] (in Bacon [Publisher]).
Also see "http://eggs.org":http://eggs.org.
.. [eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"
Note that StructuredText does not recognize standalone URIs, forcing
doubling up as shown in the second line of the example above.
3. StructuredText absolute-only URI syntax ("text", mailto:firstname.lastname@example.org)::
A "URI", http://spam.org, see [eggs2000] (in Bacon [Publisher]).
Also see "http://eggs.org", http://eggs.org.
.. [eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"
4. reStructuredText syntax::
4. A URI_, see [eggs2000]_ (in Bacon [Publisher]).
Also see http://eggs.org.
.. _URI: http:/spam.org
.. _[eggs2000] "Spam, Spam, Spam, Eggs, Bacon, and Spam"
The bracketed text '[Publisher]' may be problematic with StructuredText
(syntax 2 & 3).
reStructuredText's syntax (#4) is definitely the most readable. The text is
separated from the link URI and the footnote, resulting in cleanly readable
.. _Setext: http://www.bsdi.com/setext
.. _reStructuredText: http://structuredtext.sf.net
.. _detailed description: http://www.tibsnjoan.demon.co.uk/STNG-format.html
.. _STMinus: http://www.cis.upenn.edu/~edloper/pydoc/stminus.html
.. _Emacs table mode: http://table.sf.net/
.. _reStructuredText Markup Specification: