indexing and searching pdf files
fperez528 at yahoo.com
Fri Sep 27 02:05:00 CEST 2002
Rajarshi Guha wrote:
> My question - is there already something like this with python?
> Another question which is slightly off topic is, does anybody know of any
> articles/pages that talk about indexing text files efficiently - index
> generaion algorithms etc?
It doesn't directly discuss algorithms, but the following page is an
_excellent_ overview of the problem:
Please post here any findings you may make on the python/indexing front. This
is a problem I expect to have to deal with in a few months, so having it
solved for me ahead of time would be most pleasant :)
Full-text indexing of PostScript would also be nice, with provisions for
automatic indexing of gzipped and bz2 compressed files. While we are at it,
it would be nice to index code in at least C, C++, Fortran, Mathematica and
Python and build tables of classes/functions defined in each file for the
Mmmhhh, what else? Ah, if a .ps/.pdf has an associated .lyx/.tex file, that
should be indexed instead, with the abstract, author, keywords, etc. fields
A search interface with a basic webserver would be enough.
That would be a start.
Not asking too much, am I?
More information about the Python-list