[Doc-SIG] PEP: DPS Generic Implementation Details
Sun, 03 Jun 2001 10:32:21 -0400
Title: DPS Generic Implementation Details
Author: email@example.com (David Goodger)
Type: Standards Track
This PEP documents generic implementation details for a Python
Docstring Processing System (DPS).
This document has been placed in the public domain.
This document borrows ideas from the archives of the Python Doc-SIG
. Thanks to all members past & present.
A SourceForge project has been set up for this work at
Docstring Extraction Rules
1. If the '__all__' variable is present in the module being documented,
only identifiers listed in '__all__' are examined for docstrings. In
the absense of '__all__', all identifiers are examined, except those
whose names are private (names begin with '_' but don't begin and
end with '__').
2. Docstrings are string literal expressions, and are recognized in the
following places within Python modules:
a) At the beginning of a module, class definition, or function
definition, after any comments. This is the standard for Python
b) Immediately following a simple assignment at the top level of a
module, class definition, or __init__ method definition, after
any comments. See "Attribute Docstrings" below.
c) Additional string literals found immediately after the docstrings
in (a) and (b) will be recognized, extracted, and concatenated.
See "Additional Docstrings" below.
3. Python modules must be parsed by the docstring processing system,
not imported. There are security reasons for not importing untrusted
code. Also, docstrings are to be recognized in places where the
bytecode compiler ignores string literal expressions (2b and 2c
above), meaning importing the module will lose these docstrings. Of
course, standard Python parsing tools such as the 'parser' library
module should be used.
Since attribute docstrings and additional docstrings are not recognized
by the Python bytecode compiler, no namespace pollution or performance
degradation will result from their use. (The initial parsing of a
module may take a slight performance hit.)
XXX A description of attribute docstrings would be appropriate in the
Docstring Conventions PEP.
(This is a simplified version of PEP 224  by Marc-Andre Lemberg.)
A string literal immediately following an assignment statement is
interpreted by the docstring extration machinery as the docstring of
the target of the assignment statement, under the following conditions:
1. The assignment must be in one of the following contexts:
a) At the top level of a module (i.e., not inside a loop or
conditional): a module attribute.
b) At the top level of a class definition: a class attribute.
c) At the top level of a class' '__init__' method definition: an
Since each of the above contexts are at the top level (i.e., just
inside the outermost suite of a definition), it may be necessary to
place dummy assignments for attributes assigned conditionally or in
a loop. Blank lines may be used after attribute docstrings to
emphasize the connection between the assignment and the docstring.
2. The assignment must be to a single target, not to a list or a tuple
3. The form of the target:
a) For contexts 1a and 1b above, the target must be a simple
identifier (not a dotted identifier, a subscripted expression, or
a sliced expression).
b) For context 1c above, the target must be of the form
'self.attrib', where 'self' matches the '__init__' method's first
parameter (the instance parameter) and 'attrib' is a simple
indentifier as in 3a.
g = 'module attribute (global variable)'
"""This is g's docstring."""
c = 'class attribute'
"""This is AClass.c's docstring."""
self.i = 'instance attribute'
"""This is self.i's docstring."""
XXX A description of additional docstrings would be appropriate in the
Docstring Conventions PEP.
Many programmers would like to make extensive use of docstrings for API
documentation. However, docstrings do take up space in the running
program, so some of these programmers are reluctant to 'bloat up' their
code. Also, not all API documentation is applicable to interactive
environments, where __doc__ would be displayed.
The docstring processing system's extraction tools will concatenate all
string literal expressions which appear at the beginning of a
definition or after a simple assignment. Only the first strings in
definitions will be available as __doc__, and can be used for brief
usage text suitable for interactive sessions; subsequent string
literals and all attribute docstrings are ignored by the Python
bytecode compiler and may contain more extensive API information.
"""This is __doc__, function's docstring."""
This is an additional docstring, ignored by the bytecode
compiler, but extracted by the docstring processing system.
Issue: This breaks 'from __future__ import' statements in Python 2.1 for
multiple module docstrings. Resolution?
1. Should we search for docstrings after a __future__ statement? Very
2. Redefine __future__ statements to allow multiple preceeding string
3. Or should we not even worry about this? There shouldn't be
__future__ statements in production code, after all. Modules with
__future__ statements will have to put up with the single-docstring
Choice of Docstring Format
Rather than force everyone to use a single docstring format, multiple
input formats are allowed by the processing system. A special variable,
__docformat__, may appear at the top level of a module before any
function or class definitions. Over time or through decree, a standard
format or set of formats should emerge.
The __docformat__ variable is a string containing the name of the
format being used, a case-insensitive string matching the input
parser's module or package name (i.e., the same name as required to
'import' the module or package), or a registered alias. If no
__docformat__ is specified, the default format is 'plaintext' for now;
this may be changed to the standard format once determined.
The __docformat__ string may contain an optional second field,
separated from the format name (first field) by a single space: a
case-insensitive language identifier as defined in RFC 1766 . A
typical language identifier consists of a 2-letter language code from
ISO 639  (3-letter codes used only if no 2-letter code exists; RFC
1766 is currently being revised to allow 3-letter codes). If no
language identifier is specified, the default is 'en' for English. The
language identifier is passed to the parser and can be used for
language-dependent markup features.
- package 'dps'
- function 'dps.main()' (in 'dps/__init__.py')
- package 'dps.parsers'
- module 'dps.parsers.model'; see 'Input Parser API' below.
- package 'dps.formatters'
- module 'dps.formatters.model'; see 'Output Formatter API' below.
- package 'dps.languages'
- module 'dps.languages.en' (English)
- others to be added
- utility modules: 'dps.statemachine'
XXX To be determined.
System Python API
XXX To be determined.
Input Parser API
Each input parser is a module or package exporting a 'Parser' class,
with the following interface:
def __init__(self, inputstring, errors='warn', language='en'):
"""Initialize the Parser instance."""
"""Return a DOM tree, the parsed input string."""
XXX This needs a lot of work. What is required for this API?
A model 'Parser' class implementing the full interface along with
utility functions can be found in the 'dps.parsers.model' module.
Output Formatter API
Each output formatter is a module or package exporting a 'Formatter'
class, with the following interface:
def __init__(self, domtree, language='en', showwarnings=0):
"""Initialize the Formatter instance."""
Return a formatted string representation of the DOM tree.
XXX This also needs a lot of work. What is required for this API?
A model 'Formatter' class implementing the full interface along with
utility functions can be found in the 'dps.formatters.model' module.
Language Module API
Language modules will contain language-dependent strings and mappings.
They will be named for their language identifier (as defined in 'Choice
of Docstring Format' above), converting dashes to underscores.
XXX Specifics to be determined.
Intermediate Data Structure
A single intermediate data structure is used internally by the
docstring processing system. This data structure is a DOM tree whose
schema is documented in an XML DTD (eXtensible Markup Language Document
Type Definition), which comes in three parts:
- the Python Plaintext Document Interface DTD, ppdi.dtd ,
- the Generic Plaintext Document Interface DTD, gpdi.dtd ,
- and the OASIS Exchange Table Model, soextbl.dtd .
The DTD defines a rich set of elements, suitable for any input syntax
or output format. The input parser and the output formatter share the
same intermediate data structure. The processing system may do
transformations on the data from the input parser before passing it on
to the output formatter. The DTD retains all information necessary to
reconstruct the original input text, or a reasonable facsimile thereof.
XXX Specifics (about the DOM tree) to be determined.
XXX To be determined.
Type of output: filesystem only, or in-memory data structure too?
File/directory naming & structure conventions. In-memory data structure
should follow filesystem naming; file/directory == leaf/node. Use a
directory hierarchy rather than long file names (long file names were
one of the reasons pythondoc couldn't run on MacOS).
References and Footnotes