indexing and searching pdf files
David Mertz, Ph.D.
mertz at gnosis.cx
Thu Sep 26 21:11:05 EDT 2002
Rajarshi Guha <rajarshi at presidency.com> wrote previously:
|I have a load of pdf files and I would like to index them so that I can
|serach them for keywords. I was thinking of using pdftotext to generate
|the textfile and then create the index from that.
|My question - is there already something like this with python?
|Another question which is slightly off topic is, does anybody know of any
|articles/pages that talk about indexing text files efficiently - index
|generaion algorithms etc?
Carlo Bifulco created a program called PdfSearch do just what you want.
Its homepage is at:
http://pdfsearch.sourceforge.net/
I have not tried the latest version, but I remember the project because
Carlo uses my indexer.py module as part of his project.
In terms of my indexer.py module, I wrote an article about full text
indexing at:
http://gnosis.cx/publish/programming/charming_python_15.html
This was extended to cover XPATH indexing of XML documents in:
http://gnosis.cx/publish/programming/xml_matters_10.html
The latter isn't your interest, but in concept it is somewhat similar to
Carlo's extension.
--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------
More information about the Python-list
mailing list