[Python-Dev] Triple-quoted strings and indentation
Andrew Durdin
adurdin at gmail.com
Wed Jul 6 11:45:52 CEST 2005
Here's the draft PEP I wrote up:
Abstract
Triple-quoted string (TQS henceforth) literals in Python preserve
the formatting of the literal string including newlines and
whitespace. When a programmer desires no leading whitespace for
the lines in a TQS, he must align all lines but the first in the
first column, which differs from the syntactic indentation when a
TQS occurs within an indented block. This PEP addresses this
issue.
Motivation
TQS's are generally used in two distinct manners: as multiline
text used by the program (typically command-line usage information
displayed to the user) and as docstrings.
Here's a hypothetical but fairly typical example of a TQS as a
multiline string:
if not interactive_mode:
if not parse_command_line():
print """usage: UTIL [OPTION] [FILE]...
try `util -h' for more information."""
sys.exit(1)
Here the second line of the TQS begins in the first column, which
at a glance appears to occur after the close of both "if" blocks.
This results in a discrepancy between how the code is parsed and
how the user initially sees it, forcing the user to jump the
mental hurdle in realising that the call to sys.exit() is actually
within the second "if" block.
Docstrings on the other hand are usually indented to be more
readable, which causes them to have extraneous leading whitespace
on most lines. To counteract the problem, PEP 257 [1] specifies a
standard algorithm for trimming this whitespace.
In the end, the programmer is left with a dilemma: either to align
the lines of his TQS to the first column, and sacrifice readability;
or to indent it to be readable, but have to deal with unwanted
whitespace.
This PEP proposes that TQS's should have a certain amount of
leading whitespace trimmed by the parser, thus avoiding the
drawbacks of the current behaviour.
Specification
Leading whitespace in TQS's will be dealt with in a similar manner
to that proposed in PEP 257:
"... strip a uniform amount of indentation from the second
and further lines of the [string], equal to the minimum
indentation of all non-blank lines after the first line. Any
indentation in the first line of the [string] (i.e., up to
the first newline) is insignificant and removed. Relative
indentation of later lines in the [string] is retained."
Note that a line within the TQS that is entirely blank or consists
only whitespace will not count toward the minimum indent, and will
be retained as a blank line (possibly with some trailing whitespace).
There are several significant differences between this proposal and
PEP 257's docstring parsing algorithm:
* This proposal considers all lines to end at the next newline in
the source code (whether escaped or not); PEP 257's algorithm
only considers lines to end at the next (necessarily unescaped)
newline in the parsed string.
* Only literal whitespace is counted; an escape such as \x20
will not be counted as indentation.
* Tabs are not converted to spaces.
* Blank lines at the beginning and end of the TQS will *not* be
stripped.
* Leading whitespace on the first line is preserved, as is
trailing whitespace on all lines.
Rationale
I considered several different ways of determining
the amount of whitespace to be stripped, including:
1. Determined by the column (after allowing for expanded tabs) of
the triple-quote:
myverylongvariablename = """\
This line is indented,
But this line is not.
Note the trailing newline:
"""
+ Easily allows all lines to be indented.
- Easily leads to problems due to re-alignment of all but
first line when mixed tabs and spaces are used.
- Forces programmers to use a particular level of
indentation for continuing TQS's.
- Unclear whether the lines should align with the triple-
quote or immediately after it.
- Not backward compatible with most non-docstrings.
2. Determined by the indent level of the second line of the
string:
myverylongvariablename = """\
This line is not indented (and has no leading newline),
But this one is.
Note the trailing newline:
"""
+ Allows for flexible alignment of lines.
+ Mixed tabs and spaces should be fine (as long as they're
consistent).
- Cannot support an indent on the second line of the
string (very bad!).
- Not backward compatible with most non-docstrings.
3. Determined by the minimum indent level of all lines after the
first:
myverylongvariablename = """\
This line is indented,
But this line is not.
Note the trailing newline:
"""
+ Allows for flexible alignment of lines.
+ Mixed tabs and spaces should be fine (as long as they're
consistent).
+ Backward compatible with all docstrings and a majority of
non-docstrings
- Support for indentation on all lines not immediately
obvious
Overall, solution 3 provided the best balance of features, and
(importantly) had the best backward compatibility. I thus
consider it the most suitable.
Examples
The examples here are set out in pairs: the first of each pair
shows how the TQS must be currently written to avoid indentation
issues; the second shows how it can be written using this proposal
(although some variation is possible). All examples are taken or
adapted from the Python standard library or another real source.
1. Command-line usage information:
def usage(outfile):
outfile.write("""Usage: %s [OPTIONS] <file> [ARGS]
Meta-options:
--help Display this help then exit.
--version Output version information then exit.
""" % sys.argv[0])
#------------------------#
def usage(outfile):
outfile.write("""Usage: %s [OPTIONS] <file> [ARGS]
Meta-options:
--help Display this help then exit.
--version Output version information then exit.
""" % sys.argv[0])
2. Embedded Python code:
self.runcommand("""if 1:
import sys as _sys
_sys.path = %r
del _sys
\n""" % (sys.path,))
#------------------------#
self.runcommand("""\
if 1:
import sys as _sys
_sys.path = %r
del _sys
\n""" % (sys.path,))
3. Unit testing
class WrapTestCase(BaseTestCase):
def test_subsequent_indent(self):
# Test subsequent_indent parameter
expect = '''\
* This paragraph will be filled, first
without any indentation, and then
with some (including a hanging
indent).'''
result = fill(self.text, 40,
initial_indent=" * ",
subsequent_indent=" ")
self.check(result, expect)
#------------------------#
class WrapTestCase(BaseTestCase):
def test_subsequent_indent(self):
# Test subsequent_indent parameter
expect = '''\
* This paragraph will be filled, first
without any indentation, and then
with some (including a hanging
indent).\
'''
result = fill(self.text, 40,
initial_indent=" * ",
subsequent_indent=" ")
self.check(result, expect)
Example 3 illustrates how indentation of all lines (by 2 spaces)
is achieved with this proposal: the position of the closing
triple quote is used to determine the minimum indentation for the
whole string. To avoid a trailing newline in the string, the
final newline is escaped. Example 2 avoids the need for this
construction by placing the first line (which is not indented) on
the line after the triple-quote, and escaping the leading
newline.
Backwards Compatibility
Uses of TQS's fall into two broad categories: those where
indentation is significant, and those where it is not. Those in
the latter (larger) category, which includes all docstrings, will
remain effectively unchanged under this proposal. Docstrings in
particular are usually trimmed according to the rules in PEP 257
before their value is used; the trimmed strings will be the same
under this proposal as they are now.
Of the former category, the majority are those which have at least
one line beginning in the first column of the source code; these
will be entirely unaffected if left alone, but may be reformatted
to increase readability (see example 1 above). However a small
number of strings in this first category depend on all lines (or
all but the first) being indented. Under this proposal, these
will need to be edited to ensure that the intended amount of
whitespace is preserved. Examples 2 and 3 above show two
different ways to reformat the strings for these cases. Note that
in both examples, the overall indentation of the code is cleaner,
producing more readable code.
Some evidence may be desired to support the claims made above
regarding the distribution of the different uses of TQS's. I have
begun some analysis to produce some statistics for these; while
still incomplete, I have some initial results for the Python 2.4.1
standard library (these figures should not be off by more than a
small margin):
In the standard library (some 396,598 lines of Python code), there
are 7,318 occurrences of TQS's, an average rate of one per 54
lines. Of these, 6,638 (90.7%) are docstrings; the remaining 680
(9.3%) are not. A further examination shows that
only 64 (0.9%) of these have leading indentation on all lines (the
only case where the proposed solution is not backward compatible).
These must be manually checked to determine
whether they will be affected; such a check reveals only 7-15
TQS's (0.1%-0.2%) that actually need to be edited.
Although small, the impact of this proposal on compatibility is
still more than negligible; if accepted in principle, it might be
better suited to be initially implemented as a __future__ feature,
or perhaps relegated to Python 3000.
Implementation
An implementation for this proposal has been made; however I have
not yet made a patch file with the changes, nor do the changes yet
extend to the documentation or other affected areas.
References
[1] PEP 257, Docstring Conventions, David Goodger, Guido van Rossum
http://www.python.org/peps/pep-0257.html
Copyright
This document has been placed in the public domain.
More information about the Python-Dev
mailing list