indexing and searching pdf files

David Mertz, Ph.D. mertz at gnosis.cx
Thu Sep 26 21:11:05 EDT 2002


Rajarshi Guha <rajarshi at presidency.com> wrote previously:
|I have a load of pdf files and I would like to index them so that I can
|serach them for keywords. I was thinking of using pdftotext to generate
|the textfile and then create the index from that.
|My question - is there already something like this with python?
|Another question which is slightly off topic is, does anybody know of any
|articles/pages that talk about indexing text files efficiently - index
|generaion algorithms etc?

Carlo Bifulco created a program called PdfSearch do just what you want.
Its homepage is at:

    http://pdfsearch.sourceforge.net/

I have not tried the latest version, but I remember the project because
Carlo uses my indexer.py module as part of his project.

In terms of my indexer.py module, I wrote an article about full text
indexing at:

    http://gnosis.cx/publish/programming/charming_python_15.html

This was extended to cover XPATH indexing of XML documents in:

    http://gnosis.cx/publish/programming/xml_matters_10.html

The latter isn't your interest, but in concept it is somewhat similar to
Carlo's extension.

--
mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------





More information about the Python-list mailing list