Last week I promised on the Python list to describe the current status of the conversion to SGML/XML. Here it is! I'm currently able to parse all the LaTeX markup and generate either XML or SGML. The structure of the output is very similar to the input structure, but a number of minor improvements are made. The improvements are mostly very localized and have more to do with keeping the markup from getting to complex and disjointed, and eliminating some bogosities. I am not at all decided on a DTD to use. I see three options: 1. DocBook -- this has been developed and heavily use-tested by a number of corporate users, and is known to be good for technical documentation. There are tools and stylesheets available to convert from DocBook to HTML and printed formats. We'd probably need to specialize it, but it's designed for that. Konrad Hinsen has already developed one customization that he's using to document Python modules, and there's an initiative to create a common extension for documenting OO constructs. I've asked Konrad for some sample documentation so I can see how it actually works out. My concern with DocBook is that the markup may be a bit on the "heavy" side; I don't want the document source to be so markup-heavy that I'm the only one to work on them. 2. Create something similar to what we had in LaTeX, but with fewer warts. This is appealing because the conversion would be done sooner. However, new stylesheets would be needed, slowing down the usefulness of the result. It would also be the easiest to adopt for people already familiar with the current markup. 3. Create something entirely new and specific to Python. Clearly, this offers a lot of work to all the volunteers. We'd need requirements analysis, DTD design, stylesheets, and probably lots of things I haven't thought of. However, it also means we can limit the weight of the markup in the source, which might be a major advantage in getting people to use it. But *everyone* would have to learn it (well, everyone that writes documentation at any rate). This offers a great deal of opportunity to "get it right" for Python, but also a lot of rope. (You know what rope is used for, right?) I'd like to see some discussion on what should be done and what needs to be done. From where I sit, the most important thing is to make sure we can maintain a high level of semantic markup (hopefully making further improvements over what we already have), with generation of hypertext (HTML, info, whatever) being the next most important thing. Typeset documents are a requirement, but aren't as high up the list. I'm not terribly concerned about how XML/SGML-->foo conversion processes are implemented, with the caveat being that I need to be able to understand them without a massive learning curve. Clearly, Python code is a major option for tools (surprised?), but I can easily deal with using Java tools (with or without JPython), DSSSL processors (just don't expect me to maintain Jade/OpenJade!), XSL, CSS, and whatnot. I'd like to get away from having any Perl scripts involved, not because I think Perl is Evil, but because I'm not a Perl hacker. (Don't get me wrong; I make no claim that Perl is not Evil! ;) Comments, suggestions, volunteers? -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives
On Thu, Aug 26, 1999 at 05:14:29PM -0400, Fred L. Drake, Jr. wrote:
Last week I promised on the Python list to describe the current status of the conversion to SGML/XML. Here it is!
I'm currently able to parse all the LaTeX markup and generate either XML or SGML. The structure of the output is very similar to the input structure, but a number of minor improvements are made. The improvements are mostly very localized and have more to do with keeping the markup from getting to complex and disjointed, and eliminating some bogosities.
Excellent!
I am not at all decided on a DTD to use. I see three options:
1. DocBook -- this has been developed and heavily use-tested by a number of corporate users, and is known to be good for technical documentation. There are tools and stylesheets available to convert from DocBook to HTML and printed formats. We'd probably need to specialize it, but it's designed for that. Konrad Hinsen has already developed one customization that he's using to document Python modules, and there's an initiative to create a common extension for documenting OO constructs. I've asked Konrad for some sample documentation so I can see how it actually works out. My concern with DocBook is that the markup may be a bit on the "heavy" side; I don't want the document source to be so markup-heavy that I'm the only one to work on them.
I personally am not a fan of this, since it seems like it could limit the contributors to those willing to learn DocBook, which, at a glance, looks much more complicated than learning a standard way to produce python docs.
2. Create something similar to what we had in LaTeX, but with fewer warts. This is appealing because the conversion would be done sooner. However, new stylesheets would be needed, slowing down the usefulness of the result. It would also be the easiest to adopt for people already familiar with the current markup.
This sounds appealing. [...]
I'd like to see some discussion on what should be done and what needs to be done. From where I sit, the most important thing is to make sure we can maintain a high level of semantic markup (hopefully making further improvements over what we already have), with generation of hypertext (HTML, info, whatever) being the next most important thing. Typeset documents are a requirement, but aren't as high up the list.
From my perspective, what's most important is a *simple*, well-documented and authoritative documentation markup. The more people who can easily
produce docs for new code, the more documentation their will be, and a standard would facilitate sharing more documentation in everyone's favorite formats. With some kind of flexible-but-not-too-complex dtd, I'd probably work on producing python docs in all the formats that I'd like to see, such as vim tags and man pages (not that i liked the recent rant about the latter on c.l.p, but I would like and use and produce or help produce these formats if the dtd structure is simple and the authoritative text easy to parse) scott
I'm currently able to parse all the LaTeX markup and generate either XML or SGML. The structure of the output is very similar to
Excellent!
I am not at all decided on a DTD to use. I see three options:
Im pretty ignorant about this. I did look into the docbook DTD, and it is indeed complex. However, it also appears that much of it is optional. It seems that if a reasonable subset of the docbook features could be used and documented for use in Python it would be simpler to use, and save reinventing the wheel. A big benefit of DocBook that I see is that it may be possible to get professional printers to print hard-copies. However, our own custom DTD would also be fine, as in reality it is only the Python community that will use it, and also provide the tools. An interesting possibility is to use the new PDF routines developed by Andy et al. In conjunction with the XML tools, I believe it would be fairly simple to generate a very pretty PDF version of the docs - which would be very cool. Further, a standard DTD would definately encourage me (and hopefully others) to use it. I have been thinking about this for some time, and I feel confident that I could massage my documentation tools to generate whatever DTD we decide. This would provide advantages to all users, as a single suite of tools could be used to provide consistent and professional documentation for many extensions... I just realised I havent said much at all - really just offering encouragement that this is great news, definately the right direction, and I will definately utilize this for my own stuff. Mark.
Mark Hammond <mhammond@skippinet.com.au> wrote:
I am not at all decided on a DTD to use. I see three options:
Im pretty ignorant about this. I did look into the docbook DTD, and it is indeed complex. However, it also appears that much of it is optional.
It seems that if a reasonable subset of the docbook features could be used and documented for use in Python it would be simpler to use, and save reinventing the wheel.
me too! </F>
On Thu, 26 Aug 1999, Fred L. Drake, Jr. wrote:
I am not at all decided on a DTD to use. I see three options: <DocBook, isomorphic to current LaTeX or a new one>
I want to suggest a thesis that the markup used (Language+ XML DTD/LaTeX style) has little effect on the ease, as long as a. There are few optional features b. There are good templates ready Personally, when I started to write Python docs, I knew LaTeX but not the specific Python style. I started from the templates, and looked for similar things in other docs. My LaTeX knowledge confused me, actually: I used math to heavy for the HTML conversion work well. This shows that DocBook is a bad idea /because/ people know it, and would have /too much/ freedom for any hope of uniformity. I vote for a roll-our-own style. As soon as we can get a conversion ready, there will be plenty of templates ready. More, a roll-our-own, as opposed to the LaTeX style, could reflect the structure of a Python source file more easily (for examples, not seperating the __init__ method from the rest of the methods, and putting the generic class description in it). This is also a bit of an egoism, since it would make the vapourware PythonML->POD easier. Just my 0.02$ -- Moshe Zadka <mzadka@geocities.com>. INTERNET: Learn what you know. Share what you don't.
... offering encouragement that this is great news, definately the right direction, and I will definately utilize this for my own stuff.
Three Cheers for Fred ! As a matter of clarification ... supposing we take any one of the three options (DocBook, inertia, revolution), they'll all be parsed down to XML, so: to what extent can we rely on being able to generate DocBook *from* the XML we've produced using either of the other options ? After all, XML to XML translations are supposed to all be natural and easy. If that'll be easy, we can get all of the benefits of DocBook out of either of the others. Me, I'm naturally predisposed towards interesting times, so I go for revolution every time: and I trust this community to produce a nice simple way of marking up text that I'll be happy to use as soon as I have some documentation to write. KISS, Eddy. -- Keep It Straightforward, Simpleton.
Edward Welbourne writes:
As a matter of clarification ... supposing we take any one of the three options (DocBook, inertia, revolution), they'll all be parsed down to XML, so: to what extent can we rely on being able to generate DocBook *from* the XML we've produced using either of the other options ? After all, XML to XML translations are supposed to all be natural and easy.
Edward, I don't see any technical limitations, but I'd be very wary of assuming it would be "easy" or that *I* would implement the transformations. Writing good XSL won't be any easier than writing good LaTeX styles, and you'd have to write it in XML to boot. (Ever played with XSL? It's powerful... but tedious.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives
Moshe> This shows that DocBook is a bad idea /because/ people know it, Moshe> and would have /too much/ freedom for any hope of uniformity. Moshe> I vote for a roll-our-own style. Well, how about we roll our own and it just happens to be a strict subset of docbook? You document it as the Pythonic Way To Write Documentation, and buried deep in some appendix it says (in six-point font): This DTD is a subset of DocBook. That said, can you just start whacking useless appendages off of DocBook? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/~skip/ 847-971-7098 | Python: Programming the way Guido indented...
On Fri, 27 Aug 1999, Skip Montanaro wrote:
Moshe> I vote for a roll-our-own style.
Well, how about we roll our own and it just happens to be a strict subset of docbook? You document it as the Pythonic Way To Write Documentation, and buried deep in some appendix it says (in six-point font):
This DTD is a subset of DocBook.
Not good, for exactly the reason I outlined earlier: some bozo will try to write DocBook, just like this bozo tried to write LaTeX. We'll need to extend DocBook anyway, for primitives like <class>, <function>, etc. Personally, I do not want anything like <chapter>, <section>, or any such cruft in /library docs/ (obviously, these are needed for other kinds of docs: more on this later). So, the only thing we will be left with from DocBook will be things like (don't know the exact names, guessing...) <emph>, etc. So, its better to roll our own, stealing from DocBook whatever we can. Thus, we get (as much as possible) easy conversion for both old Python-doccers, and old DocBook-heads. That said, I think we need a completely different system for the rest of the docs: 1. The tutorial is simply a book about Python, and as such should be written like every other technical book. Moreover, Guido is (currently) the sole maintainer so he has last say. 2. The extending/embedding manual is similar. 3. The Python/C API needs a much better solution anyway: while the basic API is good, the documentation is pretty horrible. I do think we might need a specific XML DTD for that document, but once again, Guido has final say because he'll (probably) be writing it. However, most module documentations will /not/ be written by Guido. In fact, the main goal should be that a module and the documentation are written by the same guy at the same time. Hence, the tool to write the library reference has the following design goals: 1. Low barrier for entry: not higher then for writing Python modules. 2. Tools to help with it: syntax checkers, and maybe even creators. I dream of a program which will turn the following code class MyClass: def __init__(self, n): self.n=n def foo(): print n Into the following document <class name="MyClass"> XXX Describe class here! <methods> <method name="__init__"> <arguments> <argument name="n"> XXX Insert description here! </arguments> XXX Insert descritption here! </method> . . </class> 3. A formidable array of 2XXX convertors: 2html, 2txt, 2man, 2windowshelp, 2info, 2docbook<0.5 wink> I think a new Pythonic-one-way-to-do-it minimalistic DTD is the way to go.
That said, can you just start whacking useless appendages off of DocBook?
<hang-head-in-shame> Where can I get the DTD? Only heard about it, never saw it... </hang-head-in-shame> -- Moshe Zadka <mzadka@geocities.com>. INTERNET: Learn what you know. Share what you don't.
Moshe Zadka writes:
Not good, for exactly the reason I outlined earlier: some bozo will try to write DocBook, just like this bozo tried to write LaTeX.
It's hard to predict what's needed for good documentation; I am *not* of a mind to avoid having support for very general documentation constructs. We want to have a single DTD to keep the learning curve and tool support under control, so we can't really be too stingy in designing the markup.
We'll need to extend DocBook anyway, for primitives like <class>, <function>, etc. Personally, I do not want anything like <chapter>, <section>, or any such cruft in /library docs/ (obviously, these are
There's more in the library documentation than module sections; this even gets me in trouble sometimes. But it is *very* important to keep in mind that library documentation can and should contain much more than basic reference material.
So, its better to roll our own, stealing from DocBook whatever we can. Thus, we get (as much as possible) easy conversion for both old Python-doccers, and old DocBook-heads.
That said, I think we need a completely different system for the rest of the docs:
1. The tutorial is simply a book about Python, and as such should be written like every other technical book. Moreover, Guido is (currently) the sole maintainer so he has last say.
Guido has the last say about everything he does, of course. On the other hand, he's not the only person who maintains the documentation. He's certainly not the one who does the most of the work on it. This makes it sound like a DocBook project.
2. The extending/embedding manual is similar.
DocBook, with appropriate OO extensions, would be a very good match for the extending & embedding manual as well.
3. The Python/C API needs a much better solution anyway: while the basic API is good, the documentation is pretty horrible. I do think we might need a specific XML DTD for that document, but once again, Guido has final say because he'll (probably) be writing it.
Guido is the author of the original version of the document, but he is not the maintainer. That seems to be my job (which I consider a good thing ;). This is very much a kind of document that DocBook was designed to handle. The OO support needs to be present, but that should be doable as a normal DocBook extension. The organizational and completeness problems with the API reference are orthagonal to the DTD issue; we just haven't had the time. I try to add to and enhance the document as specific questions come up, but can't seem to find enough time. (Things should get better once the conversion is done, but not by a whole lot!)
However, most module documentations will /not/ be written by Guido. In fact, the main goal should be that a module and the documentation are written by the same guy at the same time. Hence, the tool to write the library reference has the following design goals:
Yes; this is one of the two most important issues. The other (which is somewhat at odds with this) is that whatever DTD we select be usable for very high grade documentation that's much more elaborate than basic module documentation.
1. Low barrier for entry: not higher then for writing Python modules.
This is unattainable. The biggest barriers to entry for documentation writing are motivation and natural language. Few people are really good with their own native language, esp. in its written form. Explaining things to others through the written word is very difficult. Python is much easier to learn!
2. Tools to help with it: syntax checkers, and maybe even creators. I dream of a program which will turn the following code
This is relatively easy once you have a format, and I fully intend to do something like this. Konrad Hinsen has done some work with Daniel Larson's pythondoc to generate DocBook with his own Python extension; I'm sure something similar could be done with whatever form we choose.
3. A formidable array of 2XXX convertors: 2html, 2txt, 2man, 2windowshelp, 2info, 2docbook<0.5 wink>
Yes. Again, this is relatively easy. I'd like to point out that to make a switch, the only output we need to care about it HTML. All others will follow as they are needed, so a handful should be available quickly. To call the XML version of the documentation the reference version, I will require an HTML conversion and one typeset version.
I think a new Pythonic-one-way-to-do-it minimalistic DTD is the way to go.
A DTD that's too minimal will not be strong enough for writing the documentation. A good DTD that's workable for all the documents is my personal requirement: only one DTD. More than one increases the learning curve for all authors and maintainers. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives
On Fri, 27 Aug 1999, Fred L. Drake, Jr. wrote:
It's hard to predict what's needed for good documentation; I am *not* of a mind to avoid having support for very general documentation constructs. We want to have a single DTD to keep the learning curve and tool support under control, so we can't really be too stingy in designing the markup.
I don't think I agree here. Look at POD: it's a wonderful form of documentation for the CPAN modules, and it's a very minimalistic markup. (Of course, it will never be the case that all modules have high quality documentation: we can only solve the technical problems.)
There's more in the library documentation than module sections; this even gets me in trouble sometimes. But it is *very* important to keep in mind that library documentation can and should contain much more than basic reference material.
OK, you're right here. So let me put my original point: A /module/ documentation should have a simple DTD...etc. The library reference should have a more general DTD, which will include a <module> element. Thus we get the best of all worlds: a general documentation format of the sections of the library reference, and a simple format for the every day module documentation. Of course, we'd need <example> and <walk-through> elements, but we'll get to that when we design the DTD.
1. The tutorial is simply a book about Python, and as such should be written like every other technical book. Moreover, Guido is (currently) the sole maintainer so he has last say.
Guido has the last say about everything he does, of course. On the other hand, he's not the only person who maintains the documentation. He's certainly not the one who does the most of the work on it. This makes it sound like a DocBook project.
Of course, that should be re.sub("Guido", "Guido/Fred"), but this doesn't detract from my point: while many people will offer suggestions, having a central maintainer (and a high barrier for entry) is /not/ a bottleneck. I have no problem with that being a DocBook project, especially as I will never have to write anything there<0.5 wink>. Ditto for the next 2 [about the bad Python/C API docs]
The organizational and completeness problems with the API reference are orthagonal to the DTD issue;
Of course: they would have been better as text<wink>. But I do think that part of the problem is organizational, so it's not completely orthogonal to the DTD issue.
we just haven't had the time. I try to add to and enhance the document as specific questions come up, but can't seem to find enough time. (Things should get better once the conversion is done, but not by a whole lot!)
We all appreciate your work, of course.
1. Low barrier for entry: not higher then for writing Python modules.
This is unattainable. The biggest barriers to entry for documentation writing are motivation and natural language. Few people are really good with their own native language, esp. in its written form. Explaining things to others through the written word is very difficult. Python is much easier to learn!
Well, you totally missed me here: of course we can't teach people English nor good style. What I meant was that learning the DTD should be no harder then Python, so just in case we have a super-Python programmer which also won the Pulitzer but doesn't have time to read a 500 page "How to Document in Python for Idiots", he'll be able to write documentation for his modules.
... To call the XML version of the documentation the reference version, I will require an HTML conversion and one typeset version.
Which will probably mean 2python-latex, because that's the easiest way.
A DTD that's too minimal will not be strong enough for writing the documentation. A good DTD that's workable for all the documents is my personal requirement: only one DTD. More than one increases the learning curve for all authors and maintainers.
I disagree: 90% of the authors will only write library reference for their (or other people's) modules. We need, first and foremost, to cater to them. And besides, I do not believe that a single DTD which "does everything" is better then a small set of syncronised DTDs (I definitely do /not/ want to remember that emphasis is <em> in the module docs and <emph> in the tutorial DTD, but we can easily keep that from happening. -- Moshe Zadka <mzadka@geocities.com>. INTERNET: Learn what you know. Share what you don't.
[Fred Drake]
A DTD that's too minimal will not be strong enough for writing the documentation. A good DTD that's workable for all the documents is my personal requirement: only one DTD. More than one increases the learning curve for all authors and maintainers.
I want to apologise in advance because I have not got the time right now to fully justify what I am about to say. Please forgive this transgression -- I will find time and post a justification as soon as I can! DocBook is not the answer. If anything, DocBook is the question. I am a strong believe in micro-document SGML/XML architectures. i believe a micro-document approach better suites the Python doc project. It has advantages on many fronts - authoring, production, maintenance, content re-use. Here is what I suggest: We need N *small* DTDs where N is the number of different *types* of information that make up the Python docs. e.g. ModuleDoc, HowToDoc and so on. Each one if these is an "information object" and parses to the DTD for that class of object. Using a simple "collection" DTD, information objects are assembled into hierarchical structures for management and publishing purposes:- <collection> <level> <title>Library</title> <title>String Services</title> <object uri="xyx"/> <object uri="abc"/> <object uri="def"/> </level> </level> </collection> Bottom line: One big DTD is not the way to go in my opinion. We need N tiny DTDs - one for each class of information. We then use a simple assembly DTD such as above to gather together information objects for publishing purposes. I cannot close without pointing out that this microdocument architecture approach is very well suited to processing with Python. I have built Python based publishing systems using it. Whilst down-translating to, say, HTML only two small documents need to be loaded into Python -- the collection file and the information object being rendered. Also, this architecture supports semantic naming of information objects which is very, very useful for cross-reference creation and management. Also, it is a no-brainer Python script to convert from a micro-document collection to a monolithic DTD such as docbook so that we can piggy-back on the existing docbook downtranslates:-) yours-in-an-awful-rush-because-I-am-supposed-to-finish-"XML- processing-with-Python"-for-Prentice-Hall-this-weekend-ly, Sean P.S. The Pyxie Open Source project that I will be kicking off with this book will have Python software that can be used right away to prototype a micro-document based Python Doc architecture. <Sean uri="http://www.digitome.com/sean.html"> Developers Day co-Chair WWW9, April 2000, Amsterdam <uri>http://www.www9.org</uri> </Sean>
Fred L. Drake, Jr. wrote:
My concern with DocBook is that the markup may be a bit on the "heavy" side; I don't want the document source to be so markup-heavy that I'm the only one to work on them.
I think that we should use a variant of Docbook and use some SGML minimizations supported by sgmllib (or that COULD be supported by sgmllib). WE can trivially use sgmllib+sax to normalize minimized SGML to XML.
3. Create something entirely new and specific to Python.
How is this different from porting over what we have? Hasn't it evolved to be pretty Python specific? Paul Prescod
Paul Prescod writes:
I think that we should use a variant of Docbook and use some SGML minimizations supported by sgmllib (or that COULD be supported by sgmllib). WE can trivially use sgmllib+sax to normalize minimized SGML
So you favor SGML over XML? That had been my original thought, but I shifted as more & better XML tools became available. I am not tied to XML, however. I said:
3. Create something entirely new and specific to Python.
Which Paul questioned:
How is this different from porting over what we have? Hasn't it evolved to be pretty Python specific?
What we have is fairly Python-specific, but there's still a lot of legacy which I'd love to get rid of. I'm not at all convinced that it's terribly *good*. This conversion effort would be an excellent time to use a better-designed structure. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives
Fred L. Drake, Jr. wrote:
Paul Prescod writes:
I think that we should use a variant of Docbook and use some SGML minimizations supported by sgmllib (or that COULD be supported by sgmllib). WE can trivially use sgmllib+sax to normalize minimized SGML
So you favor SGML over XML?
I favor a well-defined subset of SGML using basically the XML features plus end-tag minimization. That can massively cut down on the annoyance factor. If we ship a simple PyML to DBXML script with Python nobody will complain that we are doing something "non-standard".
That had been my original thought, but I shifted as more & better XML tools became available.
Well in the Python universe any XML tool that uses SAX or the DOM *is* an SGML tool. Admittedly the Java world is not as open-minded as we are but that's their problem. Jade certainly has no problem with either XML or SGML. I think that trying to stick carefully to DocBook is probably too much work. We should design a Python-ic variant -- just like we do with APIs. We can use a transformation to "get to" the standard version (just as we use classes/functions for abstraction in APIs) Paul Prescod
[Paul writes]
I favor a well-defined subset of SGML using basically the XML features plus end-tag minimization. That can massively cut down on the annoyance factor.
...
I think that trying to stick carefully to DocBook is probably too much work. We should design a Python-ic variant -- just like we do with APIs.
This seems to be converging on agreement. Fred - what is the next step - assuming Paul's statement (or a slight refinement thereof) is accepted, how do we move forward? How do we design the DTD? Does anyone have enough experience with this stuff that they could make a first pass? Mark.
Paul Prescod writes:
I favor a well-defined subset of SGML using basically the XML features plus end-tag minimization. That can massively cut down on the annoyance factor.
If we ship a simple PyML to DBXML script with Python nobody will complain that we are doing something "non-standard".
I'd be happy to call it DocBook-with-minimazation and have it really be SGML. Skip the translation to XML.
I think that trying to stick carefully to DocBook is probably too much work. We should design a Python-ic variant -- just like we do with APIs. We can use a transformation to "get to" the standard version (just as we use classes/functions for abstraction in APIs)
At this point, I'm not convinced that DocBook is terribly valuable for this. One of the goals remains to make authoring easy, and DocBook simple has too many long names for things. That's really unfortunate, given the rise in the standing of DocBook in the open source community; I generally consider that a good thing. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives
participants (9)
-
Edward Welbourne -
Fred L. Drake, Jr. -
Fredrik Lundh -
Mark Hammond -
Moshe Zadka -
Paul Prescod -
Scott Cotton -
Sean Mc Grath -
Skip Montanaro