making python man pages looks hard

Dan Connolly connolly at
Fri Aug 20 09:32:23 CEST 1999

I'm having fun watching tchrist learn python. I learned perl from Tom
back in '90 or so. Anyway... I said my piece on perl v.s python back in '96,
and I don't have much to add since then:

Subject: Re: Python, Tcl and Perl, oh my! (was Re: tcl vs. perl)
    Date: 1996/06/26 
    Newsgroups: comp.lang.perl.misc, comp.lang.tcl,, comp.lang.python

But, having spent most of my career translating technical
documentation from one format to another, the gripe about a
lack of man pages for python got me thinking.
I happen to be a *big fan* of the python documentation[2] as is,
but I don't see any reason why it shouldn't be available as
man pages too. Or at least... I didn't.


Then I looked at the source[3]


The source format is LaTeX. Converting from LaTeX is hard/messy:

For example, I conjecture that it is impossible to write a program that
will extract the third word from a TeX document. It would be an easy
task for 80% of the TeX documents out there -- just skip over some
formatting stuff and grab the third bunch of characters surrounded by
whitespace. But that "formatting stuff" might be a program that
generates 100 words from the hypenation dictionary. So the simple
lexical scan of the TeX source would find a word that is not third
word of the document when printed. 

This may seem like an obscure and unimportant problem, but I
assure you that the problem of converting TeX tables to FrameMaker
MIF is just as unsolvable. 

So while "programmable" document formats have the advantage that
features can be added on a per-document basis, they suffer the
disadvantage that these features cannot be recovered by the
machine and translated in an automated fashion. 

excerpted from Toward Closure on HTML 

The author of one of the python doc tools seems to agree:

# Why not start from LaTeX rather than HTML?
# I could hack latex2html itself to produce Texinfo instead, or fix up
# (which already translates LaTeX to Teinfo).
#  Pros:
#   * has high-level information such as index entries, original formatting
#  Cons:
#   * those programs are complicated to read and understand
#   * those programs try to handle arbitrary LaTeX input, track catcodes,
#     and more:  I don't want to go to that effort.  HTML isn't as powerful
#     as LaTeX, so there are fewer subtleties.
#   * the result wouldn't work for arbitrary HTML documents; it would be
#     nice to eventually extend this program to HTML produced from Docbook,
#     Frame, and more.

excerpt from
# -- Convert HTML documentation to Texinfo format
# Michael Ernst <mernst at>
# Time-stamp: <1999-01-12 21:34:27 mernst>
part of [3]

The bewildering array of scripts, tools, and hacks used to
generate the HTML version of the python docs is frightening!
It suggests to me that *very few people* maintain the
python docs. That's good for consistency, but it's sort
of a cathedral[3] approach: there's a sharp line between
the "blessed" modules and Everything Else.


It would be fairly easy to convert the HTML to nroff... I think there
are tools that do that... rosettaman or something? Yes... it
seems to have an option to convert back to roff format.

The trick would be dividing up the sections. The python HTML docs
aren't self-contained like the perl man pages.

I took a quick look at the python doc-sig, but I didn't find much
relevant info... they seem to be focussed on a javadoc
work-alike... hmm... maybe that is relevant; is the python
library reference source expected to move into docstrings?
That would be cool.

Anyway... I had hoped to contribute more, but this looks harder
than I expected, and I'm done for the day.

parting shot: CPAN is cool, but I find it frustrating that I can't
read the documentation for a module without downloading
and upacking the module. for example, I can browse
the list of modules,

but say I find one I'm interested in:

::OT_PPP        RdpO Control Open Transport PPP / Remote Access   CNANDOR

the only link goes to the Author contact info. Gee thanks.

"Uniform Resource Identifiers (URIs, aka URLs) are short strings that identify
resources in the web: documents, images, downloadable files, services,
electronic mailboxes, and other resources. They make resources available under
a variety of naming schemes and access methods such as HTTP, FTP, and
Internet mail addressable in the same simple way. They reduce the tedium of "log
in to this server, then issue this magic command ..." down to a single click."

Dan Connolly

More information about the Python-list mailing list