Any module or library for full-text indexing?

Russell Turpin noone at do.not.use
Tue May 9 14:11:37 EDT 2000


I'm looking for a Python module that does full text 
indexing, ie, that extracts a set of significant words 
from a text document, and searches for a candidate word 
in a list of words so extracted. The module should 
solve the following problems:

COMMON WORD MANAGEMENT. No one wants to index on common 
words such as "the," "of," and "what." Ideally, a module 
that does full-text indexing would have some tool for 
managing the set of words that are defined as "common."
Words not commonly in a dictionary, such as "Noam" and 
"Chomsky," are significant and should be indexed.

COGNATES. The module should have some way of identifying 
variations of the same word when searching the index, 
ie, "goose" would also match on "geese," "mouse" on
"mice," and "456" on "four-hundred fifty-six." This 
requires the module to have or make use of a language 
dictionary in some form. (I would be more than happy 
with noun cognates. Yeah, the number example is hard,
and not required.)

The package does not need to implement a persistence 
mechanism, nor manage the indices and their referents. In 
other words, the core functions I am looking for are:

   extract_significant: text -> word_list
   find: word, word_list -> set of hits

These would be trivial functions if not for the 
linguistic aspects as described above, and it is 
precisely these problem for which I'm hoping to find a 
solution. Of course, if the module goes further, that 
is great.

If there is no existing Python module for this, I would
be interested in any C package that could be adapted
toward this end. In this case, I would try to wrap the
C package as a Python module, and make it available for
other Python programmers. 

If there is no C package, I'll consider anything that 
can run on a Linux box.

If there is no package that does this, I'll go out
on the glacier and eat ice worms.

Thanks!

Russell



More information about the Python-list mailing list