[Doc-SIG] PEP: DPS Generic Implementation Details

David Goodger dgoodger@bigfoot.com
Sun, 03 Jun 2001 10:32:21 -0400

PEP: ???
Title: DPS Generic Implementation Details
Version: $Revision$
Author: dgoodger@bigfoot.com (David Goodger)
Discussions-To: doc-sig@python.org
Status: Draft
Type: Standards Track
Created: 31-May-2001


    This PEP documents generic implementation details for a Python
    Docstring Processing System (DPS).


    This document has been placed in the public domain.


    This document borrows ideas from the archives of the Python Doc-SIG
    [1]. Thanks to all members past & present.

Project Website

    A SourceForge project has been set up for this work at


    Docstring Extraction Rules

    1. If the '__all__' variable is present in the module being documented,
       only identifiers listed in '__all__' are examined for docstrings. In
       the absense of '__all__', all identifiers are examined, except those
       whose names are private (names begin with '_' but don't begin and
       end with '__').

    2. Docstrings are string literal expressions, and are recognized in the
       following places within Python modules:

       a) At the beginning of a module, class definition, or function
          definition, after any comments. This is the standard for Python
          __doc__ attributes.

       b) Immediately following a simple assignment at the top level of a
          module, class definition, or __init__ method definition, after
          any comments. See "Attribute Docstrings" below.

       c) Additional string literals found immediately after the docstrings
          in (a) and (b) will be recognized, extracted, and concatenated.
          See "Additional Docstrings" below.

    3. Python modules must be parsed by the docstring processing system,
       not imported. There are security reasons for not importing untrusted
       code. Also, docstrings are to be recognized in places where the
       bytecode compiler ignores string literal expressions (2b and 2c
       above), meaning importing the module will lose these docstrings. Of
       course, standard Python parsing tools such as the 'parser' library
       module should be used.

    Since attribute docstrings and additional docstrings are not recognized
    by the Python bytecode compiler, no namespace pollution or performance
    degradation will result from their use. (The initial parsing of a
    module may take a slight performance hit.)

    Attribute Docstrings

    XXX A description of attribute docstrings would be appropriate in the
    Docstring Conventions PEP.

    (This is a simplified version of PEP 224 [2] by Marc-Andre Lemberg.)

    A string literal immediately following an assignment statement is
    interpreted by the docstring extration machinery as the docstring of
    the target of the assignment statement, under the following conditions:

    1. The assignment must be in one of the following contexts:

       a) At the top level of a module (i.e., not inside a loop or
          conditional): a module attribute.

       b) At the top level of a class definition: a class attribute.

       c) At the top level of a class' '__init__' method definition: an
          instance attribute.

       Since each of the above contexts are at the top level (i.e., just
       inside the outermost suite of a definition), it may be necessary to
       place dummy assignments for attributes assigned conditionally or in
       a loop. Blank lines may be used after attribute docstrings to
       emphasize the connection between the assignment and the docstring.

    2. The assignment must be to a single target, not to a list or a tuple
       of targets.

    3. The form of the target:

       a) For contexts 1a and 1b above, the target must be a simple
          identifier (not a dotted identifier, a subscripted expression, or
          a sliced expression).

       b) For context 1c above, the target must be of the form
          'self.attrib', where 'self' matches the '__init__' method's first
          parameter (the instance parameter) and 'attrib' is a simple
          indentifier as in 3a.


        g = 'module attribute (global variable)'
        """This is g's docstring."""

        class AClass:

            c = 'class attribute'
            """This is AClass.c's docstring."""

            def __init__(self):
                self.i = 'instance attribute'
                """This is self.i's docstring."""

    Additional Docstrings

    XXX A description of additional docstrings would be appropriate in the
    Docstring Conventions PEP.

    Many programmers would like to make extensive use of docstrings for API
    documentation. However, docstrings do take up space in the running
    program, so some of these programmers are reluctant to 'bloat up' their
    code. Also, not all API documentation is applicable to interactive
    environments, where __doc__ would be displayed.

    The docstring processing system's extraction tools will concatenate all
    string literal expressions which appear at the beginning of a
    definition or after a simple assignment. Only the first strings in
    definitions will be available as __doc__, and can be used for brief
    usage text suitable for interactive sessions; subsequent string
    literals and all attribute docstrings are ignored by the Python
    bytecode compiler and may contain more extensive API information.


        def function(arg):
            """This is __doc__, function's docstring."""
            This is an additional docstring, ignored by the bytecode
            compiler, but extracted by the docstring processing system.

    Issue: This breaks 'from __future__ import' statements in Python 2.1 for
    multiple module docstrings. Resolution?

    1. Should we search for docstrings after a __future__ statement? Very

    2. Redefine __future__ statements to allow multiple preceeding string

    3. Or should we not even worry about this? There shouldn't be
       __future__ statements in production code, after all. Modules with
       __future__ statements will have to put up with the single-docstring

    Choice of Docstring Format

    Rather than force everyone to use a single docstring format, multiple
    input formats are allowed by the processing system. A special variable,
    __docformat__, may appear at the top level of a module before any
    function or class definitions. Over time or through decree, a standard
    format or set of formats should emerge.

    The __docformat__ variable is a string containing the name of the
    format being used, a case-insensitive string matching the input
    parser's module or package name (i.e., the same name as required to
    'import' the module or package), or a registered alias. If no
    __docformat__ is specified, the default format is 'plaintext' for now;
    this may be changed to the standard format once determined.

    The __docformat__ string may contain an optional second field,
    separated from the format name (first field) by a single space: a
    case-insensitive language identifier as defined in RFC 1766 [3]. A
    typical language identifier consists of a 2-letter language code from
    ISO 639 [4] (3-letter codes used only if no 2-letter code exists; RFC
    1766 is currently being revised to allow 3-letter codes). If no
    language identifier is specified, the default is 'en' for English. The
    language identifier is passed to the parser and can be used for
    language-dependent markup features.

    DPS Structure

    - package 'dps'

      - function 'dps.main()' (in 'dps/__init__.py')

      - package 'dps.parsers'

        - module 'dps.parsers.model'; see 'Input Parser API' below.

      - package 'dps.formatters'

        - module 'dps.formatters.model'; see 'Output Formatter API' below.

      - package 'dps.languages'

        - module 'dps.languages.en' (English)

        - others to be added

      - utility modules: 'dps.statemachine'

    Command-Line Interface

    XXX To be determined.

    System Python API

    XXX To be determined.

    Input Parser API

    Each input parser is a module or package exporting a 'Parser' class,
    with the following interface:

        class Parser:

            def __init__(self, inputstring, errors='warn', language='en'):
                """Initialize the Parser instance."""

            def parse(self):
                """Return a DOM tree, the parsed input string."""

    XXX This needs a lot of work. What is required for this API?

    A model 'Parser' class implementing the full interface along with
    utility functions can be found in the 'dps.parsers.model' module.

    Output Formatter API

    Each output formatter is a module or package exporting a 'Formatter'
    class, with the following interface:

        class Formatter:

            def __init__(self, domtree, language='en', showwarnings=0):
                """Initialize the Formatter instance."""

            def format(self):
                Return a formatted string representation of the DOM tree.

    XXX This also needs a lot of work. What is required for this API?

    A model 'Formatter' class implementing the full interface along with
    utility functions can be found in the 'dps.formatters.model' module.

    Language Module API

    Language modules will contain language-dependent strings and mappings.
    They will be named for their language identifier (as defined in 'Choice
    of Docstring Format' above), converting dashes to underscores.

    XXX Specifics to be determined.

    Intermediate Data Structure

    A single intermediate data structure is used internally by the
    docstring processing system. This data structure is a DOM tree whose
    schema is documented in an XML DTD (eXtensible Markup Language Document
    Type Definition), which comes in three parts:

    - the Python Plaintext Document Interface DTD, ppdi.dtd [5],

    - the Generic Plaintext Document Interface DTD, gpdi.dtd [6],

    - and the OASIS Exchange Table Model, soextbl.dtd [7].

    The DTD defines a rich set of elements, suitable for any input syntax
    or output format. The input parser and the output formatter share the
    same intermediate data structure. The processing system may do
    transformations on the data from the input parser before passing it on
    to the output formatter. The DTD retains all information necessary to
    reconstruct the original input text, or a reasonable facsimile thereof.

    XXX Specifics (about the DOM tree) to be determined.

    Output Management

    XXX To be determined.

    Type of output: filesystem only, or in-memory data structure too?
    File/directory naming & structure conventions. In-memory data structure
    should follow filesystem naming; file/directory == leaf/node. Use a
    directory hierarchy rather than long file names (long file names were
    one of the reasons pythondoc couldn't run on MacOS).

References and Footnotes

    [1] http://www.python.org/sigs/doc-sig/

    [2] http://python.sf.net/peps/pep-0224.html

    [3] http://www.rfc-editor.org/rfc/rfc1766.txt

    [4] http://lcweb.loc.gov/standards/iso639-2/englangn.html

    [5] http://docstring.sf.net/spec/ppdi.dtd

    [6] http://docstring.sf.net/spec/ppdi.dtd

    [7] http://docstring.sf.net/spec/soextblx.dtd

Local Variables:
mode: indented-text
indent-tabs-mode: nil