[Doc-SIG] How to traverse a document object

Wed, 24 Oct 2001 23:46:10 -0400

[David]
>> The document tree is meant to be an specific document/DTD/schema
>> implementation only, not a generic DOM.

[Paul]
> I'm not sure I understand what you mean here. ...
> (what do you mean by a "generic DOM"?)

DOM is a generic XML data structure. It contains an ``Element`` class (among
others), whose instances represent all elements. If you want to store a
``list`` element, it would be an ``Element`` instance whose ``tagName``
attribute was set to "list". It's not very useful from an object-oriented
programming point of view; you have to switch on the ``tagName`` attribute
instead of using polymorphism.

Another way to put it is to use XML itself as a model. A proper XML fragment
might look like this::

    <list>
        <item>
            <paragraph>
                Item one.
            </paragraph>
        </item>
    </list>

You could just as easily represent the above with a single element,
"element", in this abomination::

    <element tagName="list">
        <element tagName="item">
            <element tagName="paragraph">
                Item one.
            </element>
        </element>
    </element>

Think of each element as a class instance in the data structure, where the
tag name is equivalent to the class, and you'll see the difference. Proper
XML is to the abomination what an application-specific class library is to
DOM.

DOM *is* useful though. The reason DOM *is* used is because it *is* generic.
You don't have to write up an application-specific class library just to
represent an arbitrary data structure. The dps.nodes classes can only
represent a DPS doc tree, nothing else. DOM can represent *any* XML
instance.

> Question: This document model isn't a "real" XML DOM (by my reading of
> your comments). So we end up reinventing technologies like DOM tree
> walkers, etc. My naive reaction is "why aren't we using the XML DOM,
> then, so we get this sort of thing for free"?

It's free, yes, but the cost is too high. It depends on how you want to
build the data structure, and what you want to do with the data structure
once it's complete. In most XML-processing applications, you parse an
already-existing XML file to a data structure, for which DOM is a valid
choice. The reStructuredText parser is *building* a document tree piecemeal,
and it's easier and more powerful to say ``node = nodes.list()`` than it is
to say ``node = minidom.Element("list")``, especially when you can customize
the ``nodes.list`` class with specialized behaviour.

As for processing the data structure once complete, I haven't done much yet
but I'm sure there will be advantages if it's made up of custom objects.

> (The biggest sign of a problem, I suspect, would be if the asdom() method was
> heavily used, implying that people habitually generated a "real" DOM to handle
> this thing - presumably because of failings in the DPS DOM).

(Let's call it the DPS doc tree, to avoid misunderstandings.)

Even if ``asdom()`` is called every time, it's still a win as far as I'm
concerned. It's dirt easy to turn a DPS doc tree into a DOM tree, and the
effort involved in coding the ``asdom()`` transformation has paid off many
times over in the simplicity of the doc tree creation code, like ``node =
nodes.list()``.

> I view blockquotes as basically a way of displaying a quotation, or something
> similar, without the "..." around it.

Correct.

> As such, it would be a pretty rare thing.

I've used block quotes many times. For example, see my 2001-10-17 Doc-SIG
post, "horizontal rules & text divisions". Used block quotes twice.

> This isn't the sort of thing I'd want to use often...

I decided to include block quotes in reStructuredText early on.
StructuredText and Setext didn't have them; they both used simple
indentation for structural purposes (sections). I believe block quotes are a
generally useful construct. I'm *always* quoting stuff. I may not be a
typical user, but then again I have an "in" with the guy who wrote the spec.
;-)

> Maybe the blockquote element in the DPS model should be redefined (or
> just better defined) to clarify the intended use.

I'm starting to write a document defining the roles of each of the DPS doc
tree elements, independently of the markup syntax. I've just barely begun.
It's available at http://docstring.sourceforge.net/spec/doctree.txt.

> I can see a number of possibilities:
> 
> - It is intended for block quotations, and so the HTML <blockquote> and
>   LaTeX quote environments are appropriate. I can't see this form
>   getting much use.

I can. And that is the indended role.

> - It is for general text which is indented on both left and right. ...
> - It is for text indented on the left only. ...

These are presentation issues, not descriptive markup ones.

> One thing that has already become clear to me is that it will be *far*
> easier to write output processors for "structural" markup languages
> (HTML, (La)TeX, DocBook, Texinfo, etc) than for "layout" oriented
> languages (PDF, PostScript, etc). Sufficiently so that I doubt it will
> ever be realistic to go direct to such formats - you'd have to implement
> line and page breaking algorithms, etc, etc.

If we have a TeX Writer, PostScript and PDF are almost free. Can't wait!

-- 
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net