[Doc-SIG] New PEP: reStructuredText Standard Docstring Format

Tue, 26 Mar 2002 00:37:52 -0500

Please comment on this new PEP.  I intend to post it to comp.lang.python &
python-dev later this week, but wanted the Doc-SIG's reaction first.  Any
chance at consensus?  Do the "docstring format goals" truly reflect the
goals of the group?  Any questions or answers to add to the Q & A?  Any
glaring omissions?

Barry, could I get a PEP number please?

Thank you.

-- 
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net

PEP: 
Title: reStructuredText Standard Docstring Format
Version: $Revision$
Last-Modified: $Date$
Author: goodger@users.sourceforge.net (David Goodger)
Discussions-To: doc-sig@python.org
Status: Draft
Type: Informational
Created: 2002-03-25
Post-History: 

Abstract

    This PEP proposes that the reStructuredText [1]_ markup be adopted
    as the standard markup format for plaintext documentation in
    Python docstrings, and (optionally) for PEPs and ancillary
    documents as well.  reStructuredText is a rich and extensible yet
    easy-to-read, what-you-see-is-what-you-get plaintext markup
    syntax.

    Only the low-level syntax of docstrings is addressed here.  This
    PEP is not concerned with docstring semantics or processing at
    all.

Goals

    These are the generally accepted goals for a docstring format, as
    discussed in the Python Documentation Special Interest Group
    (Doc-SIG) [2]_:

    1. It must be easy to type with any standard text editor.

    2. It must be readable to the casual observer.

    3. It must not need to contain information which can be deduced
       from parsing the module.

    4. It must contain sufficient information (structure) so it can be
       converted to any reasonable markup format.

    5. It must be possible to write a module's entire documentation in
       docstrings, without feeling hampered by the markup language.

    [[Are these in fact the goals of the Doc-SIG members?  Anything to
    add?]]

    reStructuredText meets and exceeds all of these goals, and sets
    its own goals as well, even more stringent.  See "Features" below.

    The goals of this PEP are as follows:

    1. To establish a standard docstring format by attaining
       "accepted" status (Python community consensus; BDFL
       pronouncement). Once reStructuredText is a Python standard, all
       effort can be focused on tools instead of arguing for a
       standard.  Python needs a standard set of documentation tools.

    2. To address any related concerns raised by the Python community.

    3. To encourage community support.  As long as multiple competing
       markups are out there, the development community remains
       fractured.  Once a standard exists, people will start to use
       it, and momentum will inevitably gather.

    4. To consolidate efforts from related auto-documentation
       projects.  It is hoped that interested developers will join
       forces and work on a joint/merged/common implementation.

    5. (Optional.)  To adopt reStructuredText as the standard markup
       for PEPs.  One or both of the following strategies may be
       applied:

       a) Keep the existing PEP section structure constructs (one-line
          section headers, indented body text).  Subsections can either
          be forbidden or supported with underlined headers in the
          indented body text.

       b) Replace the PEP section structure constructs with the
          reStructuredText syntax.  Section headers will require
          underlines, subsections will be supported out of the box, and
          body text need not be indented (except for block quotes).

       Support for RFC822 headers will be added to the
       reStructuredText parser (unambiguous given a specific context:
       the first contiguous block of a PEP document).  It may be
       desired to concretely specify what over/underline styles are
       allowed for PEP section headers, for uniformity.

    6. (Optional.)  To adopt reStructuredText as the standard markup
       for README-type files and other standalone documents in the
       Python distribution.

Rationale

    The __doc__ attribute is called a documentation string, or
    docstring.  It is often used to summarize the interface of the
    module, class or function.  The lack of a standard syntax for
    docstrings has hampered the development of standard tools for
    extracting docstrings and transforming them into documentation in
    standard formats (e.g., HTML, DocBook, TeX).  There have been a
    number of proposed markup formats and variations, and many tools
    tied to these proposals, but without a standard docstring format
    they have failed to gain a strong following and/or floundered
    half-finished.

    The adoption of a standard will, at the very least, benefit
    docstring processing tools by preventing further "reinventing the
    wheel".

    Throughout the existence of the Doc-SIG, consensus on a single
    standard docstring format has never been reached.  A lightweight,
    implicit markup has been sought, for the following reasons (among
    others):

    1. Docstrings written within Python code are available from within
       the interactive interpreter, and can be 'print'ed.  Thus the
       use of plaintext for easy readability.

    2. Programmers want to add structure to their docstrings, without
       sacrificing raw docstring readability.  Unadorned plaintext
       cannot be transformed ('up-translated') into useful structured
       formats.

    3. Explicit markup (like XML or TeX) is widely considered
       unreadable by the uninitiated.

    4. Implicit markup is aesthetically compatible with the clean and
       minimalist Python syntax.

    Proposed alternatives have included:

    - XML [3]_, SGML [4]_, DocBook [5]_, HTML [6]_, XHTML [7]_

      XML and SGML are explicit, well-formed meta-languages suitable
      for all kinds of documentation.  XML is a variant of SGML.  They
      are best used behind the scenes, because they are verbose,
      difficult to type, and too cluttered to read comfortably as
      source.  DocBook, HTML, and XHTML are all applications of SGML
      and/or XML, and all share the same basic syntax and the same
      shortcomings.

    - TeX [8]_

      TeX is similar to XML/SGML in that it's explicit, not very easy
      to write, and not easy for the uninitiated to read.

    - Perl POD [9]_

      Most Perl modules are documented in a format called POD -- Plain
      Old Documentation.  This is an easy-to-type, very low level
      format with strong integration with the Perl parser.  Many tools
      exist to turn POD documentation into other formats: info, HTML
      and man pages, among others.  However, the POD syntax takes
      after Perl itself in terms of readability.

    - JavaDoc [10]_

      Special comments before Java classes and functions serve to
      document the code.  A program to extract these, and turn them
      into HTML documentation is called javadoc, and is part of the
      standard Java distribution.  However, the only output format
      that is supported is HTML, and JavaDoc has a very intimate
      relationship with HTML, using HTML tags for most markup.  Thus
      it shares the readability problems of HTML.

    - Setext [11]_, StructuredText [12]_

      Early on, variants of Setext (Structure Enhanced Text),
      including Zope Corp's StructuredText, were proposed for Python
      docstring formatting.  Hereafter these variants will
      collectively be call 'STexts'.  STexts have the advantage of
      being easy to read without special knowledge, and relatively
      easy to write.

      Although used by some (including in most existing Python
      auto-documentation tools), until now STexts have failed to
      become standard because:

      - STexts have been incomplete.  Lacking "essential" constructs
        that people want to use in their docstrings, STexts are
        rendered less than ideal.  Note that these "essential"
        constructs are not universal; everyone has their own
        requirements.

      - STexts have been sometimes surprising.  Bits of text are
        marked up unexpectedly, leading to user frustration.

      - SText implementations have been buggy.

      - Most STexts have have had no formal specification except for
        the implementation itself.  A buggy implementation meant a
        buggy spec, and vice-versa.

      - There has been no mechanism to get around the SText markup
        rules when a markup character is used in a non-markup context.

    Proponents of implicit STexts have vigorously opposed proposals
    for explicit markup (XML, HTML, TeX, POD, etc.), and the debates
    have continued off and on since 1996 or earlier.

    reStructuredText is a complete revision and reinterpretation of
    the SText idea, addressing all of the problems listed above.

Features

    Rather than repeating or summarizing the extensive
    reStructuredText spec, please read the originals available from
    http://structuredtext.sourceforge.net/spec/ (.txt & .html files).
    Reading the documents in following order is recommended:

    - An Introduction to reStructuredText [13]_

    - Problems With StructuredText [14]_ (optional, if you've used
      StructuredText; it explains many markup decisions made)

    - reStructuredText Markup Specification [15]_

    - A Record of reStructuredText Syntax Alternatives [16]_ (explains
      markup decisions made independently of StructuredText)

    - reStructuredText Directives [17]_

    There is also a "Quick reStructuredText" user reference [18]_.

    A summary of features addressing often-raised docstring markup
    concerns follows:

    - A markup escaping mechanism.

      Backslashes (``\``) are used to escape markup characters when
      needed for non-markup purposes.  However, the inline markup
      recognition rules have been constructed in order to minimize the
      need for backslash-escapes.  For example, although asterisks are
      used for *emphasis*, in non-markup contexts such as "*" or "(*)"
      or "x * y", the asterisks are not interpreted as markup and are
      left unchanged.  For many non-markup uses of backslashes (e.g.,
      describing regular expressions), inline literals or literal
      blocks are applicable; see the next item.

    - Markup to include Python source code and Python interactive
      sessions: inline literals, literal blocks, and doctest blocks.

      Inline literals use ``double-backquotes`` to indicate program
      I/O or code snippets.  No markup interpretation (including
      backslash-escape [``\``] interpretation) is done within inline
      literals.

      Literal blocks (block-level literal text, such as code excerpts
      or ASCII graphics) are indented, and indicated with a
      double-colon ("::") at the end of the preceding paragraph (right
      here -->)::

          if literal_block:
              text = 'is left as-is'
              spaces_and_linebreaks = 'are preserved'
              markup_processing = None

      Doctest blocks begin with ">>> " and end with a blank line.
      Neither indentation nor literal block double-colons are
      required.  For example::

          Here's a doctest block:

          >>> print 'Python-specific usage examples; begun with ">>>"'
          Python-specific usage examples; begun with ">>>"
          >>> print '(cut and pasted from interactive sessions)'
          (cut and pasted from interactive sessions)

    - Markup that isolates a Python identifier: interpreted text.

      Text enclosed in single backquotes is recognized as "interpreted
      text", whose interpretation is application-dependent.  In the
      context of a Python docstring, the default interpretation of
      interpreted text is as Python identifiers.  The text will be
      marked up with a hyperlink connected to the documentation for
      the identifier given.  Lookup rules are the same as in Python
      itself: LGB namespace lookups (local, global, builtin).  The
      "role" of the interpreted text (identifying a class, module,
      function, etc.) is determined implicitly from the namespace
      lookup.  For example::

          class Keeper(Storer):

              """
              Extend `Storer`.  Class attribute `instances` keeps track
              of the number of `Keeper` objects instantiated.
              """

              instances = 0
              """How many `Keeper` objects are there?"""

              def __init__(self):
                  """
                  Extend `Storer.__init__()` to keep track of
                  instances.  Keep count in `self.instances` and data
                  in `self.data`.
                  """
                  Storer.__init__(self)
                  self.instances += 1

                  self.data = []
                  """Store data in a list, most recent last."""

              def storedata(self, data):
                  """
                  Extend `Storer.storedata()`; append new `data` to a
                  list (in `self.data`).
                  """
                  self.data = data

      Each piece of interpreted text is looked up according to the
      local namespace of the block containing its docstring.

    - Markup that isolates a Python identifier and specifies its type:
      interpreted text with roles.

      Although the Python source context reader is designed not to
      require explicit roles, they may be used.  To classify
      identifiers explicitly, the role is given along with the
      identifier in either prefix or suffix form::

          Use :method:`Keeper.storedata` to store the object's data in
          `Keeper.data`:instance_attribute:.

      The syntax chosen for roles is verbose, but necessarily so (if
      anyone has a better alternative, please post it to the Doc-SIG).
      The intention of the markup is that there should be little need
      to use explicit roles; their use is to be kept to an absolute
      minimum.

    - Markup for "tagged lists" or "label lists": field lists.

      Field lists represent a mapping from field name to field body.
      These are mostly used for extension syntax, such as
      "bibliographic field lists" (representing document metadata such
      as author, date, and version) and extension attributes for
      directives (see below).  They may be used to implement docstring
      semantics, such as identifying parameters, exceptions raised,
      etc.; such usage is beyond the scope of this PEP.

      A modified RFC822 syntax is used, with a colon *before* as well
      as *after* the field name.  Field bodies are more versatile as
      well; they may contain multiple field bodies (even nested field
      lists).  For example::

          :Date: 2002-03-22
          :Version: 1
          :Authors:
              - Me
              - Myself
              - I

      Standard RFC822 header syntax cannot be used for this construct
      because it is ambiguous.  A word followed by a colon at the
      beginning of a line is common in written text.  However, with
      the addition of a well-defined context, such as when a field
      list invariably occurs at the beginning of a document (e.g.,
      PEPs and email messages), standard RFC822 header syntax can be
      used.

    - Markup extensibility: directives and substitutions.

      Directives are used as an extension mechanism for
      reStructuredText, a way of adding support for new block-level
      constructs without adding new syntax.  Directives for images,
      admonitions (note, caution, etc.), and tables of contents
      generation (among others) have been implemented.  For example,
      here's how to place an image::

          .. image:: mylogo.png

      Substitution definitions allow the power and flexibility of
      block-level directives to be shared by inline text.  For
      example::

          The |biohazard| symbol must be used on containers used to
          dispose of medical waste.

          .. |biohazard| image:: biohazard.png

    - Section structure markup.

      Section headers in reStructuredText use adornment via underlines
      (and possibly overlines) rather than indentation.  For example::

          This is a Section Title
          =======================

          This is a Subsection Title
          --------------------------

          This paragraph is in the subsection.

          This is Another Section Title
          =============================

          This paragraph is in the second section.

Questions & Answers

    Q: Is reStructuredText rich enough?

    A: Yes, it is for most people.  If it lacks some construct that is
       require for a specific application, it can be added via the
       directive mechansism.  If a common construct has been
       overlooked and a suitably readable syntax can be found, it can
       be added to the specification and parser.

    Q: Is reStructuredText *too* rich?

    A: No.

       Since the very beginning, whenever a markup syntax has been
       proposed on the Doc-SIG, someone has complained about the lack
       of support for some construct or other.  The reply was often
       something like, "These are docstrings we're talking about, and
       docstrings shouldn't have complex markup."  The problem is that
       a construct that seems superfluous to one person may be
       absolutely essential to another.

       reStructuredText takes the opposite approach: it provides a
       rich set of implicit markup constructs (plus a generic
       extension mechanism for explicit markup), allowing for all
       kinds of documents.  If the set of constructs is too rich for a
       particular application, the unused constructs can either be
       removed from the parser (via application-specific overrides) or
       simply omitted by convention.

    Q: Why not use indentation for section structure, like
       StructuredText does?  Isn't it more "Pythonic"?

    A: Guido van Rossum wrote the following in a 2001-06-13 Doc-SIG
       post:

           I still think that using indentation to indicate sectioning
           is wrong.  If you look at how real books and other print
           publications are laid out, you'll notice that indentation
           is used frequently, but mostly at the intra-section level.
           Indentation can be used to offset lists, tables,
           quotations, examples, and the like.  (The argument that
           docstrings are different because they are input for a text
           formatter is wrong: the whole point is that they are also
           readable without processing.)

           I reject the argument that using indentation is Pythonic:
           text is not code, and different traditions and conventions
           hold.  People have been presenting text for readability for
           over 30 centuries.  Let's not innovate needlessly.

       See "Section Structure via Indentation" in "Problems With
       StructuredText" [14 ]_ for further elaboration.

    Q: Why use reStructuredText for PEPs?  What's wrong with the
       existing standard?

    A: The existing standard for PEPs is very limited in terms of
       general expressibility, and referencing is especially lacking
       for such a reference-rich document type.  PEPs are currently
       converted into HTML, but the results (mostly monospaced text)
       are less than attractive, and most of the value-added potential
       of HTML is untapped.

       Making reStructuredText the standard markup for PEPs will
       enable much richer expression, including support for section
       structure, inline markup, graphics, and tables.  In several
       PEPs there are ASCII graphics diagrams, which are all that
       plaintext documents can support.  Since PEPs are made available
       in HTML form, the ability to include proper diagrams would be
       immediately useful.

       Current PEP practices allow for reference markers in the form
       "[1]" in the text, and the footnotes/references themselves are
       listed in a section toward the end of the document.  There is
       currently no hyperlinking between the reference marker and the
       footnote/reference itself (it would be possible to add this to
       pep2html.py, but the "markup" as it stands is ambiguous and
       mistakes would be inevitable).  A PEP with many references
       (such as this one ;-) requires a lot of flipping back and
       forth.  When revising a PEP, often new references are added or
       unused references deleted.  It is painful to renumber the
       references, since it has to be done in two places and can have
       a cascading effect (insert a single new reference 1, and every
       other reference has to be renumbered; always adding new
       references to the end is suboptimal).  It is easy for
       references to go out of sync.

       PEPs use references for two purposes: simple URL references and
       footnotes.  reStructuredText differentiates between the two.  A
       PEP might contain references like this::

           Abstract

               This PEP proposes a adding frungible doodads [1] to the
               core.  It extends PEP 9876 [2] via the BCA [3]
               mechanism.

           References and Footnotes

               [1] http://www.doodads.org/frungible.html

               [2] PEP 9876, Let's Hope We Never Get Here
                   http://www.python.org/peps/pep-9876.html

               [3] "Bogus Complexity Addition"

       Reference 1 is a simple URL reference.  Reference 2 is a
       footnote containing text and a URL.  Reference 3 is a footnote
       containing text only.  Rewritten using reStructuredText, this
       PEP could look like this::

           Abstract
           ========

           This PEP proposes a adding `frungible doodads`_ to the
           core.  It extends PEP 9876 [#pep9876] via the BCA [#]
           mechanism.

           .. _frungible doodads:
              http://www.doodads.org/frungible.html

           .. [#pep9876] `PEP 9876`__, Let's Hope We Never Get Here

           __ http://www.python.org/peps/pep-9876.html

           .. [#] "Bogus Complexity Addition"

       URLs and footnotes can be defined close to their references if
       desired, making them easier to read in the source text, and
       making the PEPs easier to revise.  The "References and
       Footnotes" section can be auto-generated with a document tree
       transform.  Footnotes from throughout the PEP would be gathered
       and displayed under a standard header.  If URL references
       should likewise be written out explicitly (in citation form),
       another tree transform could be used.

       URL references can be named ("frungible doodads"), and can be
       referenced from multiple places in the document without
       additional definitions.  When converted to HTML, references
       will be replaced with inline hyperlinks (HTML <A> tags).  The
       two footnotes are automatically numbered, so they will always
       stay in sync.  The first footnote also contains an internal
       reference name, "pep9876", so it's easier to see the connection
       between reference and footnote in the source text.  Named
       footnotes can be referenced multiple times, maintaining
       consistent numbering.

       The "#pep9876" footnote could also be written in the form of a
       citation::

           It extends PEP 9876 [PEP9876]_ ...

           .. [PEP9876] `PEP 9876`_, Let's Hope We Never Get Here

       Footnotes are numbered, whereas citations use text for their
       references.

    Q: Wouldn't it be better to keep the docstring and PEP proposals
       separate?

    A: The PEP markup proposal is an option to this PEP.  It may be
       removed if it is deemed that there is no need for PEP markup.
       The PEP markup proposal could be made into a separate PEP if
       necessary.  If accepted, PEP 1, PEP Purpose and Guidelines [19]_,
       and PEP 9, Sample PEP Template [20]_ will be updated.

       It seems natural to adopt a single consistent markup standard
       for all uses of plaintext in Python.

    Q: The existing pep2html.py script converts the existing PEP
       format to HTML.  How will the new-format PEPs be converted to
       HTML?

    A: One of the deliverables of the Docutils project [21]_ will be a
       new version of pep2html.py with integrated reStructuredText
       parsing.  The Docutils project will support PEPs with a "PEP
       Reader" component, including all functionality currently in
       pep2html.py (auto-recognition of PEP & RFC references).

    Q: Who's going to convert the existing PEPs to reStructuredText?

    A: A call for volunteers will be put out to the Doc-SIG and
       greater Python communities.  If insufficient volunteers are
       forthcoming, I (David Goodger) will convert the documents
       myself, perhaps with some level of automation.  A transitional
       system whereby both old and new standards can coexist will be
       easy to implement (and I pledge to implement it if necessary).

    Q: Why use reStructuredText for README and other ancillary files?

    A: The same reasoning used for PEPs above applies to README and
       other ancillary files.  By adopting a standard markup, these
       files can be converted to attractive cross-referenced HTML and
       put up on python.org.  Developers of Python projects can also
       take advantage of this facility for their own documentation.

References and Footnotes

    [1] http://structuredtext.sourceforge.net/

    [2] http://www.python.org/sigs/doc-sig/

    [3] http://www.w3.org/XML/

    [4] http://www.oasis-open.org/cover/general.html

    [5] http://docbook.org/tdg/en/html/docbook.html

    [6] http://www.w3.org/MarkUp/

    [7] http://www.w3.org/MarkUp/#xhtml1

    [8] http://www.tug.org/interest.html

    [9] http://www.perldoc.com/perl5.6/pod/perlpod.html

    [10] http://java.sun.com/j2se/javadoc/

    [11] http://docutils.sourceforge.net/mirror/setext.html

    [12] http://dev.zope.org/Members/jim/StructuredTextWiki/FrontPage

    [13] An Introduction to reStructuredText
         http://structuredtext.sourceforge.net/spec/introduction.txt

    [14] Problems with StructuredText
         http://structuredtext.sourceforge.net/spec/problems.txt

    [15] reStructuredText Markup Specification
         http://structuredtext.sourceforge.net/spec/reStructuredText.txt

    [16] A Record of reStructuredText Syntax Alternatives
         http://structuredtext.sourceforge.net/spec/alternatives.txt

    [17] reStructuredText Directives
         http://structuredtext.sourceforge.net/spec/directives.txt

    [18] Quick reStructuredText
         http://structuredtext.sourceforge.net/docs/quickref.html

    [19] PEP 1, PEP Guidelines, Warsaw, Hylton
         http://www.python.org/peps/pep-0001.html

    [20] PEP 9, Sample PEP Template, Warsaw
         http://www.python.org/peps/pep-0009.html

    [21] http://docutils.sourceforge.net/

    [22] PEP 216, Docstring Format, Zadka
         http://www.python.org/peps/pep-0216.html

Copyright

    This document has been placed in the public domain.

Acknowledgements

    Some text is borrowed from PEP 216, Docstring Format, by Moshe
    Zadka [22]_.

    Special thanks to all members past & present of the Python Doc-SIG.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: