[Doc-SIG] docstring markup: assorted thoughts..

Edward D. Loper edloper@gradient.cis.upenn.edu
Fri, 30 Mar 2001 01:37:13 EST

I haven't been able to spend much time on doc-sig this week (and
probably won't be able to spend much time this weekend, either).  I've
been reading, but having to hold myself back from responding, because
otherwise I'd never get my other work done. :)  But I did want to
to make my positions on a few things clear...  All that follows is 
just my opinion; feel free to disagree (preferably vocally rather 
than silently :) ).

1. I strongly believe that docstring markup should not include
   *any* "heuristic" rules.  Put another way, I believe that if the
   markup will distinguish case A from case B, there should be a
   single, simple mechanism for distinguishing them.  Examples of
   heuristic rules are "paragraphs that don't have normal indentation
   are treated as literals," and "ordered lists should be detected by
   one of umpteen cases, and only recognized when 2 elements with
   consective numbers occur in a row".  These rules work on the
   principle of making misuse "improbable."  They are a Bad Idea,
   because inevitably they will bite you.  What should you do instead?
   Say things like: "literal paragraphs are marked with form XYZ;
   non-literal paragraphs must have normal indetation; otherwise, it's
   an error, and tools should complain"..  (There's nothing wrong 
   with defining syntax errors in a markup language!)

   1.a. As a sidenote, this applies to URL detection, too.  In
        particular, I think that we should take 1 of two courses:
        either give an explicit markup for URLs, or have the markup
        language say *nothing* about how to detect URLs, except maybe
        that tools may try to do it if they want.  In particular, the
        *markup language specification* should not say things like "A
        url is anything matching big-ugly-regexp XYZ."

   I *believe* from what Guido's been saying that he would agree with
   me on this (that the markup should not have heuristic rules).. Care 
   to confirm that?

   For me, context-sensitive use of punctuation is a borderline case..
   For example, saying that '*' is a delimiter if it has whitespace
   one one side and non-whitespace on the other, but that it's an 
   asterisk otherwise..  The most reasonable case for this can be 
   made for apostrophes, which are used for 'quoting' and for 
   contractions; 'but it's presumably easy to tell them apart.'  I'm
   not *terribly* happy with context-sensitive punctuation, but I
   could certainly live with it.

2. There's often an attitude of "let's start off with something
   simple, and then add rules/heuristics later".  This is a Bad
   Idea.  Once you establish rules, people start using them.  That
   makes it much harder to change them.  For example, it's not nice to 
   first tell people that '#' is not a markup character, let them
   write lots of docs, and then later tell them that you've decided
   that it's a markup character after all..  When we design this
   markup language, we should do it with the future in mind..

3. As I said in
   I believe that the most fundamental feature that the markup
   language must have is the ability to distinguish natural language
   text from other text.  The way I have been envisioning ST fitting
   into this is: ANYTHING that's not natural language should be
   quoted.  Thus, the fact that < and > and * etc. are used for markup 
   is not a problem, because they're never really used in natural
   language.  If you want to use them, you quote them, like: 

   However, I get the impression that most people (including Guido?)
   think that quoting everything that's not natural language is too
   difficult.  (There's also the somewhat orthoganal issue of how to
   escape your quote character, but let's ignore that for now).
   That's an opinion I can respect, although I *personally* would be
   quite willing to quote all non-natural-language text when writing
   docstrings..  And I *personally* really don't care what character
   we use to do that quoting (apostrophy, backquote, hash, etc.), as
   long as the contexts it's used in are not contexts that that
   character would ever be used in for NL (=natural langauage).

   Perhaps most people would argue that there are really 3 categories
   here: natural language text; "literal text"; and everything
   else.. Where "everything else" is not NL, but should still be
   rendered in non-monospace, etc.  With this view, it *does* make
   more sense to try to severly limit how many "special" characters
   there are, since "everything else" is likely to use every character 
   you can think of, and then some.  

   So I think that we should do one of the following:
       * decide that we don't mind quoting all non-natural-language
         text, and pick a quote character.  Try to keep markup characters
         to a minimum, but not worry too much if we end up with more
         than 2 or 3 total.
       * decide that we want the 3 categories, and much more carefully 
         pick a quote character, or maybe one or two "markup characters"
         that will be used for all inline markup (e.g., '<>' in POD).

   Note that, for the most part, we all agree that using '::' to mark
   the beginning of literal blocks, and indentation to mark their end, 
   is acceptable.. So really we just need to worry about how to mark
   inline non-natural-language text and/or inline literal text.

4. Once we figure out how to mark inline non-NL text, we can think
   some about my second most fundamental feature: the ability to flag
   semantic fields, like descriptions of specific parameters, or of
   the return value, or of exceptions that can be raised.  Obviously,
   these semantic fields will be very useful to tools.  Javadoc does
   this with forms like::

       @param p A planet
       @return The size of the given planet

   Most people think we should use trailing colons instead of leading
   at-signs, so we might have something like::

       param q: The question of life, the universe and everything
       return: The answer to life, the universe, and everything

   (or 'arg' or 'argument' or whatever we decide) Then there's just
   the question of marking the scope of such expressions..  I believe
   that the most reasonable thing to do is to use indentation, since
   we're using "python style colons" anyway.  So you can say either::

       param x: description of x...


       param x: 

           description of x...
           more description of x..
                * maybe even a bulleted list.

   of course, we'll have to make some decisions about where blank
   lines are required, etc., and how to distinguish this from the use
   of ':' in natural language..  (and no, I won't be happy with any
   heuristic rules for doing this. :) ).  But I believe that this is
   on the right track.  Another alternative, if we decide that we like
   description lists, is something like::

           x -- description of x
           y -- description of y

   But I think that we should *only* do that if you can also do it in
   text, because otherwise people will get confused.

5. Currently, I'm leaning towards making the markup pretty
   lightweight.. I.e., it wouldn't contain too many other features.
   Features it might contain inlcude lists, some sort of sectioning 
   mechanism, some sort of emphasis mechanism, some way of doing
   endnotes, and some way of marking URLs.


In case people (incl Guido?) are interested, the following page
contain some essays/etc. I wrote on formatted doc strings..  


Also, the PEP I've been writing gives (what I hope is) a clear
definition of what we have been coming up with over the last
month.. Unfortunately, it's not complete (it's currently about 700
lines, and probably about 1/2 complete..), and I've put it on hold for
now, both because I've become busy, and because I think we may end up
wanting to change a lot of it.  But, as I've said before, if anyone
wants a copy of what it says so far, send me email and I'll be happy
to send you a copy.