indexing and searching pdf files

David Mertz, Ph.D. mertz at
Fri Sep 27 03:11:05 CEST 2002

Rajarshi Guha <rajarshi at> wrote previously:
|I have a load of pdf files and I would like to index them so that I can
|serach them for keywords. I was thinking of using pdftotext to generate
|the textfile and then create the index from that.
|My question - is there already something like this with python?
|Another question which is slightly off topic is, does anybody know of any
|articles/pages that talk about indexing text files efficiently - index
|generaion algorithms etc?

Carlo Bifulco created a program called PdfSearch do just what you want.
Its homepage is at:

I have not tried the latest version, but I remember the project because
Carlo uses my module as part of his project.

In terms of my module, I wrote an article about full text
indexing at:

This was extended to cover XPATH indexing of XML documents in:

The latter isn't your interest, but in concept it is somewhat similar to
Carlo's extension.

mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!

More information about the Python-list mailing list