indexing and searching pdf files

John Hunter jdhunter at ace.bsd.uchicago.edu
Thu Sep 26 19:07:32 EDT 2002


>>>>> "Rajarshi" == Rajarshi Guha <rajarshi at presidency.com> writes:

    Rajarshi> Hi, I have a load of pdf files and I would like to index
    Rajarshi> them so that I can serach them for keywords. I was
    Rajarshi> thinking of using pdftotext to generate the textfile and
    Rajarshi> then create the index from that.

As far as I have been able to tell, the pdftotext approach is about as
good as any.  Darrell Gallion was kind enough to forward a Plex parser
to me that he had been working on which I'm sure he would send to you
if you asked: dgallion1 at yahoo.com.  I haven't had a chance to look at
it yet.

You could also consider exposing your pdf dirs to a web server, and
induce google to crawl it.  You could then use their advanced pdf
searching, cacheing, html-izing and indexing capabilities by doing an
advanced search restricted to your server.  You could use pygoogle and
do the same if you wanted to stay within the fold.

JDH




More information about the Python-list mailing list