[Tutor] man pages parsing (still)

Tiago Saboga tiagosaboga at terra.com.br
Mon Sep 11 17:05:01 CEST 2006

Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
> Tiago Saboga wrote:
> > I'm still there, trying to parse man pages (I want to gather a list of
> > all options with their help strings). I've tried to use regex on both the
> > formatted output of man and the source troff files and I discovered what
> > is already said in the doclifter man page: you have to do a number of
> > hints, and it's really not simple. So I'm know using doclifter, and it's
> > working, but is terribly slow. Doclifter itself take around a second to
> > parse the troff file, but my few lines of code take 25 seconds to parse
> > the resultant xml. I've pasted the code at http://pastebin.ca/166941
> > and I'd like to hear from you how I could possibly optimize it.
> How big is the XML? 25 seconds is a long time...I would look at
> cElementTree (implementation of ElementTree in C), it is pretty fast.
> http://effbot.org/zone/celementtree.htm

It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of 
course, if it's the only solution... 25 (28, in fact, for the cp man page) 
isn't really acceptable.

> In particular iterparse() might be helpful:
> http://effbot.org/zone/element-iterparse.htm

Ok, I'll look that.

> I would also try specifying a buffer size in the call to os.popen2(), if
> the I/O is unbuffered or the buffer is small that might be the bottleneck.

What's appropriate in that case? I really don't understand how I should 
determine a buffer size. Any pointers?



More information about the Tutor mailing list