Docstring grammar: a very revised proposal
Over the last week, I've been incorporating all of the comments that various folks made in the DOC-SIG about the docstring grammar proposal, and writing code to provide a reference implementation. It's been going well, and I was polishing up to post it. In preparation of a discussion at IPC8's DevDay, and anticipating (perhaps mistakenly) some resistance to "yet another structured text format" from Jim Fulton, I went to look at StructuredText to find out what aspects of it I had problems with in the past. Just as a point of historical interest, I have long been a fan of the idea behind StructuredText, and have used a fairly old version of it to generate much of my website automatically. I stopped doing so because I found it lacking for the things I was trying to get it to do, unintuitive at times, and too hard to modify for my (somewhat exotic) needs. I just looked again at StructuredText's rules (not code) within the context of "Python code documentation" exclusively (a much more narrow domain than website management), and found it better-suited than I remembered, with three kinds of problems: - StructuredText is too generic -- in markup terms, it's much like SGML or XML without a DTD (or with a very weak one). So while it's an OK markup, it's too weak as it stands. - StructuredText has some markup rules which I think are wrong, in the sense that they fail the "naturalness" test which we set out to use in our alternative markup or they cause "bad" side-effects when a docstring is not fully marked up. - Missing features (not surprising -- the aim is different) I'll describe each of the kinds of problems in detail, but first, let's look at the rules which are followed by the *latest* version of StructuredText.py, which is available at the Zope CVS site. I include the docstring for StructuredText.py here: Structured Text Manipulation Parse a structured text string into a form that can be used with structured formats, like html. Structured text is text that uses indentation and simple symbology to indicate the structure of a document. A structured string consists of a sequence of paragraphs separated by one or more blank lines. Each paragraph has a level which is defined as the minimum indentation of the paragraph. A paragraph is a sub-paragraph of another paragraph if the other paragraph is the last preceding paragraph that has a lower level. Special symbology is used to indicate special constructs: - A single-line paragraph whose immediately succeeding paragraphs are lower level is treated as a header. - A paragraph that begins with a '-', '*', or 'o' is treated as an unordered list (bullet) element. - A paragraph that begins with a sequence of digits followed by a white-space character is treated as an ordered list element. - A paragraph that begins with a sequence of sequences, where each sequence is a sequence of digits or a sequence of letters followed by a period, is treated as an ordered list element. - A paragraph with a first line that contains some text, followed by some white-space and '--' is treated as a descriptive list element. The leading text is treated as the element title. - Sub-paragraphs of a paragraph that ends in the word 'example' or the word 'examples', or '::' is treated as example code and is output as is. - Text enclosed single quotes (with white-space to the left of the first quote and whitespace or puctuation to the right of the second quote) is treated as example code. - Text surrounded by '*' characters (with white-space to the left of the first '*' and whitespace or puctuation to the right of the second '*') is emphasized. - Text surrounded by '**' characters (with white-space to the left of the first '**' and whitespace or puctuation to the right of the second '**') is made strong. - Text surrounded by '_' underscore characters (with whitespace to the left and whitespace or punctuation to the right) is made underlined. - Text encloded by double quotes followed by a colon, a URL, and concluded by punctuation plus white space, *or* just white space, is treated as a hyper link. For example: "Zope":http://www.zope.org/ is ... Is interpreted as '<a href="http://www.zope.org/">Zope</a> is ....' Note: This works for relative as well as absolute URLs. - Text enclosed by double quotes followed by a comma, one or more spaces, an absolute URL and concluded by punctuation plus white space, or just white space, is treated as a hyper link. For example: "mail me", mailto:amos@digicool.com. Is interpreted as '<a href="mailto:amos@digicool.com">mail me</a>.' - Text enclosed in brackets which consists only of letters, digits, underscores and dashes is treated as hyper links within the document. For example: As demonstrated by Smith [12] this technique is quite effective. Is interpreted as '... by Smith <a href="#12">[12]</a> this ...'. Together with the next rule this allows easy coding of references or end notes. - Text enclosed in brackets which is preceded by the start of a line, two periods and a space is treated as a named link. For example: .. [12] "Effective Techniques" Smith, Joe ... Is interpreted as '<a name="12">[12]</a> "Effective Techniques" ...'. Together with the previous rule this allows easy coding of references or end notes. - A paragraph that has blocks of text enclosed in '||' is treated as a table. The text blocks correspond to table cells and table rows are denoted by newlines. By default the cells are center aligned. A cell can span more than one column by preceding a block of text with an equivalent number of cell separators '||'. Newlines and '|' cannot be a part of the cell text. For example: |||| **Ingredients** || || *Name* || *Amount* || ||Spam||10|| ||Eggs||3|| is interpreted as:: <TABLE BORDER=1 CELLPADDING=2> <TR> <TD ALIGN=CENTER COLSPAN=2> <strong>Ingredients</strong> </TD> </TR> <TR> <TD ALIGN=CENTER COLSPAN=1> <em>Name</em> </TD> <TD ALIGN=CENTER COLSPAN=1> <em>Amount</em> </TD> </TR> <TR> <TD ALIGN=CENTER COLSPAN=1>Spam</TD> <TD ALIGN=CENTER COLSPAN=1>10</TD> </TR> <TR> <TD ALIGN=CENTER COLSPAN=1>Eggs</TD> <TD ALIGN=CENTER COLSPAN=1>3</TD> </TR> </TABLE> ------- This markup has nice features: - it's very similar to the core of what we were discussing (no big surprise, since my proposal was mostly stolen from earlier StructuredText =). - The (new to me) mechanism for dealing with references to URLs and to other bits of text is quite elegant. Ok, now onto the problems with the above markup rules, IMO of course. StructuredText as XML without a DTD: Unlike the grammar which was discussed on the doc-sig, StructuredText does not allow us to specify special syntax for special kinds of text, such as: signature blocks/tooltip descriptions, doctest.py code, and keyworded paragraphs. This is a lack of feature rather than a flaw, and could maybe be fixed by adding a postprocessor which would be Python-internal-doc specific. StructuredText as a markup with some flaws for inline doc: The use of single quotes to markup inline code (as in 'x') can be surprising. Many current docstrings use 'x' to refer to the *string* containing the character x, not the variable x. In StructuredText, the quotes would dissappear in the rendering. With practice, the current scheme could be used but users would have to learn to write '"x"' to have their intent carry through to the renderer. This is probably my biggest problem with StructuredText because I don't know how to fix it while maintaining compatibility. Related questions for Jim or Ken are 1) How does StructuredText parse ''? 2) How can one have a single quote in verbatim text? The tagging of underlined text with _'s is suboptimal. Underlines shouldn't be used from a typographic perspective (underlines were designed to be used in manuscripts to communicate to the typesetter that the text should be italicized -- no well-typeset book ever uses underlines), and conflict with double-underscored Python variable names (__init__ and the like), which would get truncated and underlined when that effect is not desired. Note that while *complete* markup would prevent that truncation ('__init__'), I think of docstring markups much like I think of type annotations -- they should be optional and above all do no harm. In this case the underline markup does harm. The requirement that a paragraph end with the word example or examples or :: goes against my natural style, as I often do not want such word or punctuation before a "displayed" paragraph. Furthermore, the spec currently doesn't say how the renderer is supposed to process the :: -- is it displayed as two colons, one, or none? If the two colons are not displayed by the renderer, then my objection is diminished, although I would have preferred a markup which is local to the paragraph which is affected, not the previous one (cut and paste errors follow too easily). In some versions in the past at least, both colons were displayed. I'll leave that as an open question, as additional markup could provide an alternative which would suit me. Missing features: The definition of references is well-designed for referencing URLs. The docstring proposal needs to address referencing other code elements (methods of current class, other classes, other methods of other classes, builtin modules, imported modules, etc.) This is also more of a lack of feature than a real flaw, and defining the namespaces for lookup of 'reference targets' would probably go most of the way towards fixing the spec for our needs. The DOC-SIG folks strongly wanted to be able to have list items not require blank lines in between them. This is not hard to do from scratch (I've got code to do it). I suspect it could be added to StructuredText as a postprocessing step. (From experience, such list items should be required to be indented relative to the previous line, to avoid spurious bulletization.) I think many folks liked the idea of "tagged" paragraphs, as in: Author: Guido van Rossum and Release-History: 0.1: June 1920 0.2: July 1919 etc. Again, postprocessing smarts for this could be added while keeping backwards compatibility. Many of the features related to these tagged paragraphs (internationalization of keywords, etc.) could be dealt within the postprocessor. Technical points: StructuredText.py uses the regex module, which is deprecated. StructuredText.py currently rarely produces an exception. The output may not be what you expected, but it will produce output. For a rigorous docstring markup, I'd expect to see much stricter rules enforced, which raises technical questions (the line numbers for example are not maintained as part of the StructuredText.py parsing process I suspect). This requirement would probably have the most impact on the current source. Assuming this analysis is correct, I am inclined to scrap my original proposal, shelve my current prototype code, and work instead with Jim Fulton and whoever else to discuss whether and how to modify StructuredText's format and code to extend it to the needs which the DOC-SIG expressed. The value in this approach is: - StructuredText.py (the code) exists and works. (prior art) - There is a fair bit of code which uses StructuredText's markup. (user base) - Much of the work would be postprocessing of StructuredText.py analysis. (modularity) - The features of the current markup which are deemed problematic could be disabled with a runtime switch if the StructuredText community disagrees with my assessment. (configurability) An alternative course of action is to take StructuredText.py and modify and rename it (aka code fork) This is of course less desirable, but would be the efficient way of dealing with either irreconciliable differences of opinion or differing needs (It is possible that the Zope folks do not want to see their code burdened with e.g. line number maintenance code because of performance or other reasons). We also need to know whether the Zope folks would *want* to push StructuredText.py into the Python standard library, which I have always assumed to be a goal for whatever tool we come up with. A final alternative is for me to revise my current (unpublished) manuscript to incorporate some of the things I like about StructuredText which we didn't have (the reference naming scheme mostly), and keep a parallel track in spec and code. Just as a point of note: I do maintain that keeping StructuredText as it currently is without postprocessing is inadequate for intelligent docstring markup, and do not consider that a solution to the problem at hand, although it's a fine everyday strategy in the absence of a solution. I would like to discuss which approach should be followed at the DevDay session. Feel free to email or post feedback between now and the conference, whether or not you'll be there. Assuming a discussion does take place, I'll try to post a summary post-conference. Cheers, David Ascher
[David writes back on Sat 22-Jan-00]
In preparation of a discussion at IPC8's DevDay,
Well, I think we all agree on how that went ;-)
and anticipating (perhaps mistakenly) some resistance to "yet another structured text format" from Jim Fulton, I
In the short discussion on that topic, it seemed that Jim had no great objections. So, as DevDay didnt achieve much, I think we should run with this grammar. My comments, generally ignoring what Ken already commented on:
StructuredText as XML without a DTD:
Im happy to ignore StructuredText for out-of-line doc for now. The critical issue we have to deal with is docstrings. Did we ever determine what OReilly's position on XML and particular DTDs is?
The requirement that a paragraph end with the word example or examples or :: goes against my natural style, as I often do not
IMO, its not too bad, as it is only used to introduce single paragraph code-block (if I read the rules correctly?)
If the two colons are not displayed by the renderer, then my objection is diminished, although I would have preferred a markup which is local to the paragraph which is affected
Agreed.
Missing features:
The definition of references is well-designed for referencing URLs. The docstring proposal needs to address referencing other code elements (methods of current class, other classes, other methods of other classes, builtin modules, imported modules, etc.) This is also more of a lack of feature than a real flaw, and defining the namespaces for lookup of 'reference targets' would probably go most of the way towards fixing the spec for our needs.
Definately agreed, and IMO quite a critical issue. In the absense of any other complaints or comments about this proposal, I suggest that this proposal is what we run with for doc-strings. The key problem is getting tools and code to work with this revised proposal, and exactly what this means to StructuredText as used today by the Zope guys. The key feature of this proposal is that it can already give us something to work with. As mentioned a few times on DevDay, we need a blessed doc-string format so people can start writing docstrings in the knowledge tools will follow. So, does anyone else have any comments on David's proposal? Can we work towards a new set of "Structured Text for DocStrings" rules that people can start using for their docstrings? This SIG has been going for ages, and has been amazingly short on results (a critisism that obviously includes myself!). It really is time to get something happening, and this is an excellent start! Mark.
Did we ever determine what OReilly's position on XML and particular DTDs is?
I did ask Frank Willison (editor in chief of O'Reilly, and for those of you who don't know, a big Python fan) in the hall after the doc-sig extravaganza. He says that they think they have a working system now which works on DocBook, but that it was damn hard to setup the system. The problem was that the DocBook Book tried to tackle every corner of the spec, which was just really hard. This is from memory and was encoded when I was pretty exhausted, so I won't swear to the exactitude of the recall. I am not expert enough on large-scale text processing to fully understand the implications of specific DTD choices, but I think that the impact of a specific DTD choice is a minor one in the absence of XML editors for the masses (i.e., Fred did the work before, and realistically, Fred's going to be doing the work in the foreseeable future at least -- Fred should choose what he wants to use). He's already done the hard part of regularizing the markup and I'm sorry I didn't get a chance to buy him a beer at IPC8.
So, does anyone else have any comments on David's proposal? Can we work towards a new set of "Structured Text for DocStrings" rules that people can start using for their docstrings?
I think I need to put up working code before folks really get a feel for it, along with a cleaned up description of the format (different strokes for different folks). Not everyone followed every thread of the last couple of months, and I apologize for not having put forth a revised proposal sooner. I'll ping Ken to find out what the status of their internal reworking of StructuredText-the-code is. FWIW, I don't care much whether we use their code or not. I care more about having a common 'base grammar' and then specific extensions for docstrings (and they can have specific extensions for other things as they see fit). --david It's interesting how the percentage of time that I can allocate to Python has grown proportionally less than the number of Python-related items on my todo list =).
Mark Hammond writes:
Im happy to ignore StructuredText for out-of-line doc for now. The critical
Structured text (as evolved by David Ascher & others in this forum) is for embedded documentation. I only intend to support one out-of-line format, which will be in SGML or XML (I'm leaning toward SGML today).
issue we have to deal with is docstrings. Did we ever determine what OReilly's position on XML and particular DTDs is?
Frank Willison told me that they like to see books written in some subset of DocBook, where the specific subset depends on the topic; they want the author to use semantic markup appropriate to the subject. My impression was that the *specific* subset wasn't a real concern. I don't know that this is of real concern to us. If someone wants to feed Python's standard documentation to a publisher, DocBook can be used as an output format.
This SIG has been going for ages, and has been amazingly short on results (a critisism that obviously includes myself!). It really is time to get something happening, and this is an excellent start!
Results? What's that? ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives
participants (3)
-
David Ascher -
Fred L. Drake, Jr. -
Mark Hammond