indexing and searching pdf files

Thu Sep 26 19:07:32 EDT 2002

>>>>> "Rajarshi" == Rajarshi Guha <rajarshi at presidency.com> writes:

    Rajarshi> Hi, I have a load of pdf files and I would like to index
    Rajarshi> them so that I can serach them for keywords. I was
    Rajarshi> thinking of using pdftotext to generate the textfile and
    Rajarshi> then create the index from that.

As far as I have been able to tell, the pdftotext approach is about as
good as any.  Darrell Gallion was kind enough to forward a Plex parser
to me that he had been working on which I'm sure he would send to you
if you asked: dgallion1 at yahoo.com.  I haven't had a chance to look at
it yet.

You could also consider exposing your pdf dirs to a web server, and
induce google to crawl it.  You could then use their advanced pdf
searching, cacheing, html-izing and indexing capabilities by doing an
advanced search restricted to your server.  You could use pygoogle and
do the same if you wanted to stay within the fold.

JDH