[Tutor] need help generating table of contents

Peter Otten __peter__ at web.de
Fri Aug 24 11:55:21 EDT 2018


Albert-Jan Roskam wrote:

> Hello,
> 
> I have Ghostscript files with a table of contents (toc) and I would like 
to use this info to generate a human-readable toc. The problem is: I can't 
get the (nested) hierarchy right.
> 
> import re
> 
> toc = """\
> [ /PageMode /UseOutlines
>   /Page 1
>   /View [/XYZ null null 0]
>   /DOCVIEW pdfmark
> [ /Title (Title page)
>   /Page 1
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Document information)
>   /Page 2
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Blah)
>   /Page 3
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Appendix)
>   /Page 16
>   /Count 4
>   /View [/XYZ null null 0]
>   /OUT pdfmark
>     [ /Title (Sub1)
>       /Page 17
>       /Count 4
>       /OUT pdfmark
>     [ /Title (Subsub1)
>       /Page 17
>       /OUT pdfmark
>     [ /Title (Subsub2)
>       /Page 18
>       /OUT pdfmark
>     [ /Title (Subsub3)
>       /Page 29
>       /OUT pdfmark
>     [ /Title (Subsub4)
>       /Page 37
>       /OUT pdfmark
>     [ /Title (Sub2)
>       /Page 40
>       /OUT pdfmark
>     [ /Title (Sub3)
>       /Page 49
>       /OUT pdfmark
>     [ /Title (Sub4)
>       /Page 56
>       /OUT pdfmark
> """    
> print('\r\n** Table of contents\r\n')
> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
> indent = 0
> start = True
> for title, page, count in re.findall(pattern, toc, re.DOTALL):
>     title = (indent * ' ') + title
>     count = int(count or 0)
>     print(title.ljust(79, ".") + page.zfill(2))
>     if count:
>         count -= 1
>         start = True
>     if count and start:
>         indent += 2
>         start = False
>     if not count and not start:
>         indent -= 2
>         start = True
> 
> This generates the following TOC, with subsub2 to subsub4 dedented one 
level too much:

> What is the best approach to do this?
 
The best approach is probably to use some tool/library that understands 
postscript. However, your immediate problem is that when there is more than 
one level of indentation you only keep track of the "count" of the innermost 
level. You can either use a list of counts or use recursion and rely on the 
stack to remember the counts of the outer levels.

The following reshuffle of your code seems to work:

print('\r\n** Table of contents\r\n')
pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'

def process(triples, limit=None, indent=0):
    for index, (title, page, count) in enumerate(triples, 1):
        title = indent * 4 * ' ' + title
        print(title.ljust(79, ".") + page.zfill(2))
        if count:
            process(triples, limit=int(count), indent=indent+1)
        if limit is not None and limit == index:
            break

process(iter(re.findall(pattern, toc, re.DOTALL)))





More information about the Tutor mailing list