[Doc-SIG] DPS DTDs

Thu, 13 Sep 2001 01:00:52 -0400

Tony J Ibbs (Tibs) wrote:
> I'll try to explain my point of view, since I'm not sure I
> see yours(!).

Thanks, and 'back atcha'. Writing this intro after writing the bulk
below, I think we may simply be looking at this stuff from different
angles, seeing different silhouettes of the same thing.

Distilled, what I'm saying is:

    I see a fundamental difference between an object representing 'a
    module' and an object representing 'a module's documentation'. The
    trees of the different types of objects may resemble each other in
    shape at first, but the nature of the nodes is very different.

    The tree resulting from the analysis of Python source (the 'parse
    tree') is specific to the 'Python source' input mode of the DPS,
    and will not be seen outside of this context. Therefore there's no
    reason to codify the schema in the generic DTD; outside of the
    PySource input mode, it isn't useful. This parse tree's schema
    certainly should be documented, possibly as a DTD, but separately
    from the document tree DTD.

Please relax and enjoy this message, safe in the knowledge that it's
just idle discussion. Please don't let me stop you from doing your
thing in your own way. I'm sure it will be useful no matter how things
end up.

> To me it doesn't make sense to have an artificial boundary at any
> level - so a docstring that is not parsed would be <literal> text
> (is that the right tag?),

'literal_block' actually. 'literal' is the inline element.

If you represent docstrings this way, how will you distinguish real
literal_blocks from unparsed raw docstrings?

> Now, in ``dps/specs`` you have ``ppdi.dtd``, which certainly
> *looks* to me as if it is doing the same job as I am doing with my
> <py_xxx> tags - that is, extending the DPS nodes tree "outwards"
> from the docstring into the Python code.

ppdi.dtd is not meant to extend the DPS nodes tree outwards into the
Python code, but to provide specialized elements useful for
*documenting* Python code. It's a subtle distinction but important
IMO. Let's take a simple example::

    # module 'example.py'
    a = 1
    """alphas"""
    b = 2
    """betas"""
    def f(n):
        """Return the f of `n`."""
        return some_expression_involving(n)
    class A:
        """A classy class."""
        def __init__(self):
        """Set up instance attributes."""
            self.count = 0
            """Keep track of things."""

The parse tree might end up looking like this (using indentation to
show structure)::

    [module]
        [name 'example']
        [attribute]
            [name 'a']
            [value 1]
            [docstring """alphas"""]
        [attribute]
            [name 'b']
            [value 2]
            [docstring """betas"""]
        [function]
            [name 'f']
            [parameters]
                [parameter]
                    [name 'n']
            [docstring """Return the f of `n`."""]
        [class]
            [name 'A']
            [docstring """A classy class."""]
            [method]
                [name '__init__']
                [parameters]
                    [self-parameter]
                        [name 'self']
                [docstring """Set up instance attributes."""]
                [attribute]
                    [name 'count']
                    [value 0]
                    [docstring """Keep track of things."""]

This parse tree gets transformed into the following document tree
(again using indentation, so we can omit many end-tags)::

    <document>
        <title>Module <module>example.py</>
        <section>
            <title>Module Attributes
            <module_attribute_section>
                <module_attribute>a
                <initial_value>1
                <paragraph>alphas
            <module_attribute_section>
                <module_attribute>b
                <initial_value>2
                <paragraph>betas
        <section>
            <title>Functions
            <function_section>
                <function>f
                <parameter_list>
                    <parameter_item>
                        <parameter>n
                <paragraph>Return the f of <parameter>n</>.
    etc.

None of the parse tree objects survive intact to the document tree.

The parse tree objects allow us to group together the appropriate
docstrings, and give us further Python-specific information. That
information is then transformed into a DPS nodes tree. If you think
of the original docstrings on the parse tree as 'fruit', then the
collation process is like the fruit growing into trees of their own,
getting nutrients (stuff like attribute names and default values)
from the 'roots'. Think of the roots as the parse tree upside-down.
The trunk of the doc tree meets the top of the parse tree; the
parse tree nourishes and generates the doc tree.

Kinda cool analogy!

(The tree above is just my preliminary idea of what the final DPS
tree should look like for a Python module. For instance, the
'<section><title>Module <module>xxx' could easily become
'<module_section><module>xxx'. In the end, these specialized elements
may disappear, leaving generic sections and titles in their wake.)

(Hmm. Since the .pformat() of DPS trees uses indentation also, we
could omit the end-tags. Would shorten the test data considerably, and
reduce confusion with XML, which is good. I like this. Implementing
it... now.)

> My point was simply that I am not, particularly, following that DTD -
> but I obviouslY (trivially, since I can output XML!) am following
> *some* (virtual) DTD. And it would be nice to write that down (be it
> as DTD or XML Schema or whatever) at some point.

Any tree-shaped data structure (among others) can be represented in
XML and therefore be indicated by a DTD.

Sure, write it down, even as a DTD if you like, but I don't see it
going into the existing DTD in dps/spec/, because it's not general
enough. It's internal documentation for the pysource mode. *That's*
the point I was trying to make that started this discussion.

Perhaps it's just a question of degree. I'm seeing the tree closer to
the final generic document representation, you're seeing it closer to
the original parse tree. Sound about right?

> (hmm - and I just realised *why* - if the two components (inner and
> outer, for want of better term) are *discontinuous* in structure,
> then it makes it harder to write a Formatter/Writer - it would need
> to know about the Python bits and the docstring bits independently

I don't think the output formatter should ever see any evidence of the
parse tree. (I must explain that I'm seriously considering a fourth
component, the 'style' for lack of a better term, that takes the
output of the input mode and parser and transforms it into the final
doc tree. The input mode and output style may require more than what
dps.nodes provides. The output styles for an input mode may be so
tightly coupled as to be specific to that input mode.) By the time
the doc tree gets to the formatter, it's a simple 'take this
dps.nodes doc tree structure and change it to your native format'. No
serious transformations involved.

Not having even *begun* to implement any of this, I don't know if
this idea is reasonable or feasible.

> > The two types of tree represent fundamentally different
> > information.
> 
> I see we disagree - it's all document (erm - serialisation of
> information).

Yeah, but serialisation of Python code vs. serialisation of *document*
of Python code.

> I really think we might be talking past each other,

Probably :-)

> because what I'm doing is so simple and obvious that I find it
> hard to call it hypergeneralisation - I'm not losing anything, and
> I'm gaining quite a lot.
>
> I'm using the compiler parse tree to hold the parse tree, and
> generating documenation (as part of a DPS node tree) from it. That's
> obvious to me. Are we just confusing each other with words?

Could be!

> > In dps/spec/ppdi.dtd you'll see the "Additional Structural
> > Elements"...
> 
> They are clearly a start on holding the information one needs
> to report on in a document. They didn't do enough for me, which is
> why I'm not using them.

Fair enough. It seems to me you're representing an intermediate between the
parse tree and the final document tree. I just don't see the need.

> And you had some, erm, interesting function definitions.

Oh, I see what you mean, ones like this? ::

    def standalone_uri(self, text, lineno, pattern=inline.patterns.uri,
                       whole=inline.groups.uri.whole,
                       email=inline.groups.uri.email):

I was using the 'Stuff' class to hierarchically group related
constants without polluting the namespace. This 'Stuff'
dotted-attribute collection idiom is useful and, I think, successful.

-- 
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net