[Doc-SIG] docstring markup: assorted thoughts..
Edward D. Loper
edloper@gradient.cis.upenn.edu
Fri, 30 Mar 2001 01:37:13 EST
I haven't been able to spend much time on doc-sig this week (and
probably won't be able to spend much time this weekend, either). I've
been reading, but having to hold myself back from responding, because
otherwise I'd never get my other work done. :) But I did want to
to make my positions on a few things clear... All that follows is
just my opinion; feel free to disagree (preferably vocally rather
than silently :) ).
1. I strongly believe that docstring markup should not include
*any* "heuristic" rules. Put another way, I believe that if the
markup will distinguish case A from case B, there should be a
single, simple mechanism for distinguishing them. Examples of
heuristic rules are "paragraphs that don't have normal indentation
are treated as literals," and "ordered lists should be detected by
one of umpteen cases, and only recognized when 2 elements with
consective numbers occur in a row". These rules work on the
principle of making misuse "improbable." They are a Bad Idea,
because inevitably they will bite you. What should you do instead?
Say things like: "literal paragraphs are marked with form XYZ;
non-literal paragraphs must have normal indetation; otherwise, it's
an error, and tools should complain".. (There's nothing wrong
with defining syntax errors in a markup language!)
1.a. As a sidenote, this applies to URL detection, too. In
particular, I think that we should take 1 of two courses:
either give an explicit markup for URLs, or have the markup
language say *nothing* about how to detect URLs, except maybe
that tools may try to do it if they want. In particular, the
*markup language specification* should not say things like "A
url is anything matching big-ugly-regexp XYZ."
I *believe* from what Guido's been saying that he would agree with
me on this (that the markup should not have heuristic rules).. Care
to confirm that?
For me, context-sensitive use of punctuation is a borderline case..
For example, saying that '*' is a delimiter if it has whitespace
one one side and non-whitespace on the other, but that it's an
asterisk otherwise.. The most reasonable case for this can be
made for apostrophes, which are used for 'quoting' and for
contractions; 'but it's presumably easy to tell them apart.' I'm
not *terribly* happy with context-sensitive punctuation, but I
could certainly live with it.
2. There's often an attitude of "let's start off with something
simple, and then add rules/heuristics later". This is a Bad
Idea. Once you establish rules, people start using them. That
makes it much harder to change them. For example, it's not nice to
first tell people that '#' is not a markup character, let them
write lots of docs, and then later tell them that you've decided
that it's a markup character after all.. When we design this
markup language, we should do it with the future in mind..
3. As I said in
<http://mail.python.org/pipermail/doc-sig/2001-March/001594.html>,
I believe that the most fundamental feature that the markup
language must have is the ability to distinguish natural language
text from other text. The way I have been envisioning ST fitting
into this is: ANYTHING that's not natural language should be
quoted. Thus, the fact that < and > and * etc. are used for markup
is not a problem, because they're never really used in natural
language. If you want to use them, you quote them, like:
'x*y>z'.
However, I get the impression that most people (including Guido?)
think that quoting everything that's not natural language is too
difficult. (There's also the somewhat orthoganal issue of how to
escape your quote character, but let's ignore that for now).
That's an opinion I can respect, although I *personally* would be
quite willing to quote all non-natural-language text when writing
docstrings.. And I *personally* really don't care what character
we use to do that quoting (apostrophy, backquote, hash, etc.), as
long as the contexts it's used in are not contexts that that
character would ever be used in for NL (=natural langauage).
Perhaps most people would argue that there are really 3 categories
here: natural language text; "literal text"; and everything
else.. Where "everything else" is not NL, but should still be
rendered in non-monospace, etc. With this view, it *does* make
more sense to try to severly limit how many "special" characters
there are, since "everything else" is likely to use every character
you can think of, and then some.
So I think that we should do one of the following:
* decide that we don't mind quoting all non-natural-language
text, and pick a quote character. Try to keep markup characters
to a minimum, but not worry too much if we end up with more
than 2 or 3 total.
* decide that we want the 3 categories, and much more carefully
pick a quote character, or maybe one or two "markup characters"
that will be used for all inline markup (e.g., '<>' in POD).
Note that, for the most part, we all agree that using '::' to mark
the beginning of literal blocks, and indentation to mark their end,
is acceptable.. So really we just need to worry about how to mark
inline non-natural-language text and/or inline literal text.
4. Once we figure out how to mark inline non-NL text, we can think
some about my second most fundamental feature: the ability to flag
semantic fields, like descriptions of specific parameters, or of
the return value, or of exceptions that can be raised. Obviously,
these semantic fields will be very useful to tools. Javadoc does
this with forms like::
@param p A planet
@return The size of the given planet
Most people think we should use trailing colons instead of leading
at-signs, so we might have something like::
param q: The question of life, the universe and everything
return: The answer to life, the universe, and everything
(or 'arg' or 'argument' or whatever we decide) Then there's just
the question of marking the scope of such expressions.. I believe
that the most reasonable thing to do is to use indentation, since
we're using "python style colons" anyway. So you can say either::
param x: description of x...
or::
param x:
description of x...
more description of x..
* maybe even a bulleted list.
of course, we'll have to make some decisions about where blank
lines are required, etc., and how to distinguish this from the use
of ':' in natural language.. (and no, I won't be happy with any
heuristic rules for doing this. :) ). But I believe that this is
on the right track. Another alternative, if we decide that we like
description lists, is something like::
Parameters:
x -- description of x
y -- description of y
But I think that we should *only* do that if you can also do it in
text, because otherwise people will get confused.
5. Currently, I'm leaning towards making the markup pretty
lightweight.. I.e., it wouldn't contain too many other features.
Features it might contain inlcude lists, some sort of sectioning
mechanism, some sort of emphasis mechanism, some way of doing
endnotes, and some way of marking URLs.
-----
In case people (incl Guido?) are interested, the following page
contain some essays/etc. I wrote on formatted doc strings..
www.cis.upenn.edu/~edloper/pydoc
Also, the PEP I've been writing gives (what I hope is) a clear
definition of what we have been coming up with over the last
month.. Unfortunately, it's not complete (it's currently about 700
lines, and probably about 1/2 complete..), and I've put it on hold for
now, both because I've become busy, and because I think we may end up
wanting to change a lot of it. But, as I've said before, if anyone
wants a copy of what it says so far, send me email and I'll be happy
to send you a copy.