[Twisted-Python] Lore and generating reStructuredText (Lore2Sphinx)
tl;dr A Lore plugin won't work for generating Sphinx source files, at least not by itself. Itamar posted some notes from the Twisted BoF session that was held at PyCon last weekend, and one of the things in it was the following line: - lore output plugin that generates ReST via docutils parse tree objects, then write code to run sphinx on this output I wasn't there, so I don't know the exact context that this was referring to, but let me try to explain a little bit about why this won't work (at least not as written). reStructuredText, as some of you may know, creates it's output by first creating an intermediate representation of a document called a "node tree", which is a tree of "nodes" which represents the various elements in a document (text, paragraphs, lists, list items, etc.). reStructuredText also has a construct called a "directive", which is some markup which tells the docutils reST parser to create a bunch of these nodes. Directives are awesome and are a big reason why reStructuredText is so much more powerful than other lightweight markup languages like markdown, textile, etc. They serve as extension point and allow users to create their own markup constructs without changing the actual parser. The key thing is that a directive is not itself a type of node. Rather it 's just a markup construct. This means that once a reStructuredText document goes through the docutils parser, the information about the directives is lost, because they have been transformed into a bunch of nodes. For example there's a container directive, which looks like this: Title ===== .. container:: I'm a content paragraph! Yay! When processed this creates a nodetree that looks something like this (in docutils "pseudoxml" representation: <document ids="title" names="title" source="test.rst" title="Title"> <title> Title <container> <paragraph> I'm a content paragraph! Yay! It is entirely coincidental that the container directive and the <container> node are named the same thing. Don't let this confuse you. The point is that the directive goes away and is replaced by a bunch of nodes (more specifically, the node tree is transformed in some way...I suppose a directive could remove nodes, but I don't think I've ever seen that done). We can see this using another example: Here's some markup: Title ===== .. warning:: I'm a content paragraph! Yay! and here's the pseudoxml representation of the nodetree: <document ids="title" names="title" source="test.rst" title="Title"> <title> Title <container> <paragraph> I'm a content paragraph! Yay! Notice that the node trees look exactly the same. Now this is not quite true, as there's probably some attributes on the actual Python nodes that might be used to distinguish them when writing output which aren't displayed here...they certainly get rendered into HTML differently. But the point is that the directive itself is GONE and you have no real way of recreating it from the node tree. I think this problem also happens with custom text roles, which is another extension mechanism in reST, but I haven't looked too deeply into that. Since you really, really want to have directives in your output (in fact you have to have them if you want to use Sphinx, which makes heavy use of them), you can't really generate Sphinx-capable source files using _only_ the nodetree representation. I suppose you might be able to do something where you try to detect where the directive _should_ go and try to insert it during the rendering step, but such a thing would be an egregious kludge, would take a lot of effort, and I can't imagine it would work very well, if at all. Another option would be to fork the distutils parser and change it so that it could create "directive nodes" or something, but I certainly would not recommend such a course. (If you think maintaining Lore is a pain, you ain't seen nothin' yet. And one thing this project has driven home to me is that no software only needs to be maintained "for a little while".) I'm not saying that the proposed plugin for lore is a bad idea...I think it would be pretty cool. You'd be able to send lore out to all of the various formats supported by docutils, and who doesn't want to write their next s5 presentaton in Lore, right? :) But it won't do the job that it was being put forward for in the note Itamar mentioned. So what about building some software that generates some other representation of the source document, and then renders that as reStructuredText? Well this is the best idea I've come up with (or heard) and is in fact exactly what lore2sphinx-ng_ (which is not intended to be a separate thing, it's just an experimental fork of lore2sphinx) and rstgen_ do. lore2spinx-ng creates the representation from lore sources (which is also a tree of "nodes", though they aren't called that), and rstgen defines the nodes, and renders them into reStructuredText source. The only problem is that these aren't done yet, though the work done so far looks very promising (in terms of actually being able to do the job reliably someday). If anyone has bothered to read this far and is interested in helping out, please feel free to fork the repos and lend a hand. Also feel free to contact me either on this list or directly if you have any questions. I apologize in advance for the current state of the code, which is a bit messy (especially lore2sphinx-ng, which still has a bunch of cruft from the "old"/"current" version that I haven't gotten around to removing yet). .. _lore2sphinx-ng: https://bitbucket.org/khorn/lore2sphinx-ng .. _rstgen:https://bitbucket.org/khorn/rstgen -- Kevin Horn
On Thu, Mar 21, 2013 at 9:17 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
tl;dr A Lore plugin won't work for generating Sphinx source files, at least not by itself.
Itamar posted some notes from the Twisted BoF session that was held at PyCon last weekend, and one of the things in it was the following line:
- lore output plugin that generates ReST via docutils parse tree objects, then write code to run sphinx on this output
I wasn't there, so I don't know the exact context that this was referring to, but let me try to explain a little bit about why this won't work (at least not as written).
reStructuredText, as some of you may know, creates it's output by first creating an intermediate representation of a document called a "node tree", which is a tree of "nodes" which represents the various elements in a document (text, paragraphs, lists, list items, etc.). reStructuredText also has a construct called a "directive", which is some markup which tells the docutils reST parser to create a bunch of these nodes.
Directives are awesome and are a big reason why reStructuredText is so much more powerful than other lightweight markup languages like markdown, textile, etc. They serve as extension point and allow users to create their own markup constructs without changing the actual parser.
The key thing is that a directive is not itself a type of node. Rather it 's just a markup construct. This means that once a reStructuredText document goes through the docutils parser, the information about the directives is lost, because they have been transformed into a bunch of nodes.
For example there's a container directive, which looks like this:
Title =====
.. container::
I'm a content paragraph! Yay!
When processed this creates a nodetree that looks something like this (in docutils "pseudoxml" representation:
<document ids="title" names="title" source="test.rst" title="Title"> <title> Title <container> <paragraph> I'm a content paragraph! Yay!
It is entirely coincidental that the container directive and the <container> node are named the same thing. Don't let this confuse you. The point is that the directive goes away and is replaced by a bunch of nodes (more specifically, the node tree is transformed in some way...I suppose a directive could remove nodes, but I don't think I've ever seen that done).
We can see this using another example:
Here's some markup:
Title =====
.. warning::
I'm a content paragraph! Yay!
and here's the pseudoxml representation of the nodetree:
<document ids="title" names="title" source="test.rst" title="Title"> <title> Title <container> <paragraph> I'm a content paragraph! Yay!
Notice that the node trees look exactly the same. Now this is not quite true, as there's probably some attributes on the actual Python nodes that might be used to distinguish them when writing output which aren't displayed here...they certainly get rendered into HTML differently. But the point is that the directive itself is GONE and you have no real way of recreating it from the node tree.
I think this problem also happens with custom text roles, which is another extension mechanism in reST, but I haven't looked too deeply into that.
Since you really, really want to have directives in your output (in fact you have to have them if you want to use Sphinx, which makes heavy use of them), you can't really generate Sphinx-capable source files using _only_ the nodetree representation.
I suppose you might be able to do something where you try to detect where the directive _should_ go and try to insert it during the rendering step, but such a thing would be an egregious kludge, would take a lot of effort, and I can't imagine it would work very well, if at all.
Another option would be to fork the distutils parser and change it so that it could create "directive nodes" or something, but I certainly would not recommend such a course. (If you think maintaining Lore is a pain, you ain't seen nothin' yet. And one thing this project has driven home to me is that no software only needs to be maintained "for a little while".)
I'm not saying that the proposed plugin for lore is a bad idea...I think it would be pretty cool. You'd be able to send lore out to all of the various formats supported by docutils, and who doesn't want to write their next s5 presentaton in Lore, right? :) But it won't do the job that it was being put forward for in the note Itamar mentioned.
So what about building some software that generates some other representation of the source document, and then renders that as reStructuredText? Well this is the best idea I've come up with (or heard) and is in fact exactly what lore2sphinx-ng_ (which is not intended to be a separate thing, it's just an experimental fork of lore2sphinx) and rstgen_ do. lore2spinx-ng creates the representation from lore sources (which is also a tree of "nodes", though they aren't called that), and rstgen defines the nodes, and renders them into reStructuredText source.
The only problem is that these aren't done yet, though the work done so far looks very promising (in terms of actually being able to do the job reliably someday). If anyone has bothered to read this far and is interested in helping out, please feel free to fork the repos and lend a hand. Also feel free to contact me either on this list or directly if you have any questions. I apologize in advance for the current state of the code, which is a bit messy (especially lore2sphinx-ng, which still has a bunch of cruft from the "old"/"current" version that I haven't gotten around to removing yet).
.. _lore2sphinx-ng: https://bitbucket.org/khorn/lore2sphinx-ng .. _rstgen:https://bitbucket.org/khorn/rstgen
-- Kevin Horn
I screwed up the example above, due to misnaming a file and running rst2pseudoxml.py on the wrong thing. It should actually look something like this: <document ids="title" names="title" source="test.rst" title="Title"> <title> Title <warning> <paragraph> I'm a content paragraph! Yay! and this: <document ids="title" names="title" source="test.rst" title="Title"> <title> Title <admonition classes="admonition-hooray"> <title> hooray! <paragraph> I'm a content paragraph! Yay! But the point still holds. Directive info goes away after parsing. -- Kevin Horn
On Mar 21, 2013, at 7:17 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
Notice that the node trees look exactly the same. Now this is not quite true, as there's probably some attributes on the actual Python nodes that might be used to distinguish them when writing output which aren't displayed here...they certainly get rendered into HTML differently. But the point is that the directive itself is GONE and you have no real way of recreating it from the node tree.
The directive isn't "gone"; it turns into the attributes on the Python nodes that you're talking about. Presumably that's what's used to render it into HTML. I believe it was Doug Hellmann who indicated to Jean-Paul that this was possible. Perhaps you mean "there's no public API for constructing the node tree representation of an arbitrary directive"? -glyph
On Sat, Mar 23, 2013 at 9:57 PM, Glyph <glyph@twistedmatrix.com> wrote:
On Mar 21, 2013, at 7:17 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
Notice that the node trees look exactly the same. Now this is not quite true, as there's probably some attributes on the actual Python nodes that might be used to distinguish them when writing output which aren't displayed here...they certainly get rendered into HTML differently. But the point is that the directive itself is GONE and you have no real way of recreating it from the node tree.
The directive isn't "gone"; it turns into the attributes on the Python nodes that you're talking about. Presumably that's what's used to render it into HTML. I believe it was Doug Hellmann who indicated to Jean-Paul that this was possible.
While this is true for some built-in docutils directives, there is no guarantee that this will be the case. A directive basically says "call a Python callable according to a certain interface, and put the returned nodes here." If the directive in question uses a callable that returns nodes with attributes set in a certain way, then you have some breadcrumbs to figure out how those nodes were created, but there's nothing that says that the nodes will definitely be set up that way. For example, you could have a directive that has a ReST list as content, and changes the items in the list into some kind of link or something. maybe it looks something like this (not a real/valid nodetree...): <list> <link>... <link>... <link>... How can you tell that this was created by a directive? You can't, because it could just as easily have been a list full of links to begin with. This is why rstgen has it's own node definitions, as it is focused on what source constructs should be generated, rather than what the docutils output should look like. Of course it's possible that the docutils nodes that we would actually need from the Twisted docs are all introspectible *enough* that you could maybe just build a docutils doctree and make good enough guesses to create output which included directives. It might even be easy to make those guesses. But you'd still be guessing, and would fail in the general case. Also, you'd still need to write the code to render those nodes into valid ReST, which is really the hard part of the process. Also, you'd need to "parse" everything inside the "directive node" (or whichever node you've decided represents the directive) in order to turn it into directive arguments, options and contents. Reading exarkun's expansion of the notes Itamar posted [1]_ it looks like another idea proposed was to generate Sphinx (or maybe Sphinx-looking) output directly from Lore, which could maybe work, but I think would also be a lot of work, for a lot less benefit. However, if what you really want is to have a Lore plugin that generates RestructuredText, then why not have a lore plugin that generates a rstgen tree, which rstgen will already know how to render into ReST? Other than the obvious objection that rstgen isn't done yet, this seems the best solution to me. Of course I may be biased. :) Even if you didn't want to use rstgen itself, though, I still think you're better off creating some tree-like structure that is *not* a docutils document tree, and then have that structure render itself into ReST. BTW, Doug Hellman almost certainly knows more about the internals of docutils than I do, so maybe he's right and there is a way to (relatively) easily generate ReST from a docutils tree including the directives. But I don't think it is.
Perhaps you mean "there's no public API for constructing the node tree representation of an arbitrary directive"?
This is certainly true, but I think it doesn't go far enough in describing the issue .. [1] https://twistedmatrix.com/trac/wiki/Fellowship2013/Priorities -- Kevin Horn
On Mar 25, 2013, at 9:16 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
How can you tell that this was created by a directive? You can't, because it could just as easily have been a list full of links to begin with.
But, I don't care if it was created by a directive or not. I think we're talking about two different things. What you seem to be talking about is using Sphinx to do source-to-source Lore-to-ReST transformation. In that case, you're (sort of) right, in that information is lost when you invoke directives. If we did this, and it worked, it would just be a slightly better way to implement lore2sphinx; we'd still need to manage the transition in largely the same way. What *I'm* talking about is just using Lore source as an input to Sphinx, and going straight to the output HTML. In order to do this, we just need to construct the right tree and actually *invoke* the directive callables at the right time. They produce whatever output they want to produce, and we hand that back to Sphinx, and it outputs some docs. With this strategy, we just switch to sphinx by switching our build process; we don't switch input formats. Then, if someone wants to use Lore they can, if they want to use ReST they can, and we can migrate on an as-needed basis; there's no need for a single big format migration for us to start using Sphinx. -glyph
On Mon, Mar 25, 2013 at 6:29 PM, Glyph <glyph@twistedmatrix.com> wrote:
On Mar 25, 2013, at 9:16 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
How can you tell that this was created by a directive? You can't, because it could just as easily have been a list full of links to begin with.
But, I don't care if it was created by a directive or not.
I think we're talking about two different things.
What you seem to be talking about is using Sphinx to do source-to-source Lore-to-ReST transformation. In that case, you're (sort of) right, in that information is lost when you invoke directives. If we did this, and it worked, it would just be a slightly better way to implement lore2sphinx; we'd still need to manage the transition in largely the same way.
What *I'm* talking about is just using Lore source as an input to Sphinx, and going straight to the output HTML. In order to do this, we just need to construct the right tree and actually *invoke* the directive callables at the right time. They produce whatever output they want to produce, and we hand that back to Sphinx, and it outputs some docs. With this strategy, we just switch to sphinx by switching our build process; we don't switch input formats. Then, if someone wants to use Lore they can, if they want to use ReST they can, and we can migrate on an as-needed basis; there's no need for a single big format migration for us to start using Sphinx.
Hmmm. We are indeed talking about two different things. What you describe is probably technically possible, but I still don't think it's a very good approach. It seems to me that it would be very brittle and error prone. You'd need to: - figure out the node output of every directive you were trying to replicate, with every type of (tedious in the best case, possibly very tricky for some directives, but maybe not too bad) - figure out a system to "inject" those nodes at the right time (not sure how difficult this would be) - figure out how Sphinx modifies the doctrees to do all it's linking, index generation, toctree handling, etc. (I think this is the hardest part) - figure out a way to get Sphinx to take doctrees as input (I'm guessing you'd serialize them in whatever way Sphinx does when it does it's caching, stick them in the cache, and then get Sphinx to build from the cache?) What about changing the build process to use some kind of tool that goes over the doc files, and if a source file is in lore format, it translates it using rstgen, and if it's already a rst file, it just copies it into the Sphinx project as-is? Then just build the Sphinx project. I think this would be a lot less hassle than trying to decipher and replicate a bunch of docutils and Sphinx internals, and would really only be a minor change to the way that the lore2sphinx command line tool already works. It just processes each lore file and sends the output to an output directory, so you'd just need to modify it to skip the processing and only copy rst files. With the new refactoring of lore2sphinx into lore2sphinx-ng, I think this is possible and would yield acceptable results. Of course someone still would need to finish lore2sphinx-ng and rstgen, but that's either going to have to happen anyway, or some other tool would have to be built that mucks with doctrees. Thoughts? -- Kevin Horn
On Mar 26, 2013, at 7:03 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
- figure out a way to get Sphinx to take doctrees as input (I'm guessing you'd serialize them in whatever way Sphinx does when it does it's caching, stick them in the cache, and then get Sphinx to build from the cache?)
This is the only part of the process I believe is actually necessary. All the other stuff you wrote assumes that this can't be made to work :). But as I understand it, this is specifically what JP asked Doug. The whole point is that we want to go straight from Lore->some docutils data structure. If we have to emit intermediary ReST, it's almost as bad as having to do the whole source translation in the first place. -glyph
On Tue, Mar 26, 2013 at 3:41 PM, Glyph <glyph@twistedmatrix.com> wrote:
On Mar 26, 2013, at 7:03 AM, Kevin Horn <kevin.horn@gmail.com> wrote:
- figure out a way to get Sphinx to take doctrees as input (I'm guessing you'd serialize them in whatever way Sphinx does when it does it's caching, stick them in the cache, and then get Sphinx to build from the cache?)
This is the only part of the process I believe is actually necessary. All the other stuff you wrote assumes that this can't be made to work :). But as I understand it, this is specifically what JP asked Doug.
I don't think this (only needing to figure out the last part) is really the case: - figure out the node output of every directive you were trying to replicate, with every type of (tedious in the best case, possibly very tricky for some directives, but maybe not too bad) You need this to know what nodes to create in your tree. This doesn't seem too bad, until you realize that a number of the Sphinx-specific directives you *absolutely must have* (or at least the nodes they create, depend on the Sphinx build environment. So you need to either re-create the build environment, or you need to re-create all of these directives in your own code. - figure out a system to "inject" those nodes at the right time (not sure how difficult this would be) This one is probably not too bad, since you could probably get away with a bare minimum of just sticking your (for example) toctree nodes right after your main heading or something. And you could probably get away with something similar for index entries or whatever. - figure out how Sphinx modifies the doctrees to do all it's linking, index generation, toctree handling, etc. Probably not too much needs to be done here directly, as I *think* that Sphinx does all this after it builds the doctrees, so if you can get the doctrees into Sphinx you're probably fine. Don't quote me on that, though. So I no longer think this is the hardest part. But I think you'd still have to have a decent understanding of how these bits work internally to generate your nodes correctly. So a learning curve, though probably no actual code to write specifically for this. - figure out a way to get Sphinx to take doctrees as input (I'm guessing you'd serialize them in whatever way Sphinx does when it does it's caching, stick them in the cache, and then get Sphinx to build from the cache?) Then you have to do this bit. The "obvious" way to do this is to create your doctrees and then pickle them, like Sphinx does when it caches parsed documents. Then make Sphinx build its output from these "cached" files (which I don't think it will currently do, but it can probably be made to do it). The whole point is that we want to go straight from Lore->some docutils
data structure.
Why? What does this buy us? To me it seems more complicated, requires more work, depends on *internal* APIs of a separate project (actually 2 separate projects), and doesn't seem to gain very much if anything. What's the reasoning here? Keep in mind that I'm without the benefit of whatever discussion on this took place at PyCon, so maybe I'm just missing something. If it's just an incremental transition, then I think we can get that without resorting to relying on the guts of two fairly complicated systems. If we have to emit intermediary ReST, it's almost as bad as having to do
the whole source translation in the first place.
I don't see how emitting intermediary ReST, which at least has a spec (granted the spec is ugly to look at, but it's pretty complete) is any worse than emitting intermediary doctrees, which could change out from under us. Summing up a bit: Generating ReST is a challenging problem, no doubt. But it's the _only_ challenging problem if we go the source translation route. If we go the doctrees route, I don't understand the advantage gained, and I'm concerned about dealing with the internals of docutils and Sphinx. (also it's more work, and I'm lazy :P ) -- Kevin Horn
Kevin Horn wrote: […]
Why? What does this buy us? To me it seems more complicated, requires […]
If it's just an incremental transition, then I think we can get that without resorting to relying on the guts of two fairly complicated systems.
Although I'm blissfully ignorant of the deeper, darker details of docutils and sphinx (and hope to remain so), I feel compelled to point out that an incremental transition is more than a “just”. You can start reaping the rewards of the new system sooner and with less risk than an all-or-nothing transition, it reduces merge conflicts for work-in-progress doc branches, etc. So I'd say incremental transition is closer to a “must” than a “just”! -Andrew.
On Tue, Mar 26, 2013 at 10:43 PM, Andrew Bennetts <andrew@bemusement.org>wrote:
Kevin Horn wrote: […]
Why? What does this buy us? To me it seems more complicated, requires […]
If it's just an incremental transition, then I think we can get that without resorting to relying on the guts of two fairly complicated systems.
Although I'm blissfully ignorant of the deeper, darker details of docutils and sphinx (and hope to remain so), I feel compelled to point out that an incremental transition is more than a “just”. You can start reaping the rewards of the new system sooner and with less risk than an all-or-nothing transition, it reduces merge conflicts for work-in-progress doc branches, etc.
So I'd say incremental transition is closer to a “must” than a “just”!
-Andrew.
Yes, that was "just" in the sense of "only", rather than in the sense of "merely". :) -- Kevin Horn
participants (3)
-
Andrew Bennetts
-
Glyph
-
Kevin Horn