looking for text search engine

Andrew Dalke dalke at acm.org
Sun Apr 8 20:21:52 EDT 2001


Hello,

  I'm looking for a text search engine I can use for the
biopython.org project.  I've looked around but couldn't
really find one which met my requirements, so I was hoping
people here could suggest solutions for me.

  This would be use for bioinformatics database formats
like GenBank (http://www.ncbi.nlm.nih.gov/Genbank/),
SWISS-PROT (http://www.expasy.ch/sprot/) or PDB
(http://www.rcsb.org/).

  What we have so far is a parser generator which can
identify the semantically important regions of the various
formats. These would pull out things like "author", "organism",
"accession number" and "description", possibly with the help of some
Python code or with XSLT.  (See the conference proceedings
from the last Python conference for details.)

People would like to be able to search that data. Some of the searches
are easy to implement, like a search for the identifier "100K_RAT."
Others are a bit more complicated, like searching for the author named
"Smith", as compared to "Smith, D. C." which would be the full text of
an author field. Finally, people would like boolean and phrase search
capabilities, like searching "'hoof and mouth disease' or 'foot and
mouth disease'" in the description field. Some of the fields may allow
stemming and some may not, although stemming is not a requirement.

All this data is record oriented and I have full access to the
original data files (which may be compressed, so I wouldn't want to
require access to the data to do the search).

The data is not in a form which can be handled directly by an indexing
engine. For example, every record may have the word "GENE" because
that's part of the format definition, but I don't want a search for
"gene" to return all records, or ignore all records because "gene" got
put on a stop list. ("Gene V" is the name of a specific gene.)

Instead, I can convert the input into a proper form and either call an
API to say "here's the text for a new 'description' field" or convert
the text into appropriate XML.

I would like to be able to update the database so modified forms of a
record replace old entries. (There is a guaranteed unique key for each
record.) This is needed for GenBank records which distribute a delta
file every day but only do full releases every few months.

Finally, it needs to work on Linux, Solaris and IRIX and hopefully MS
Windows and Mac (at least OS X). Python interface a plus, but I've
done plenty of interoperability with command-line programs before, and
with calling C APIs using Python.

Oh, and did I mention that fast is good? Ideally, simple lookups for
some fields, like record identifier or aliases, should be blisteringly
fast, while more complex keyword searches should be a fraction of a
second. The system I have now, which only does exact match word
searches is built on Sleepycat's BerkelyDB and does the lookup part
quite quickly.

I know I'm not asking for all that much, am I? :)

I tried looking around for existing software which does this. I've
used Glimpse before, but that was when I didn't worry about being able
to search for given subfields. (The price of a couple thousand dollars
is okay with me, so long as I can test it out before committing.)

The only program which seems close is Zebra, or rather a Z39.50 system
in general. But I know almost nothing about those systems and it
doesn't look like all that many people use that specific program.

A completely different one would be eXist which allows searches of XML
fields using mySQL. However, I would like not to have a database
system running because that makes for a more complicated setup. Plus,
it is written in Java, which would put an undo requirement for a
bioPYTHON project.

Any thoughts?

                    Andrew
                    dalke at acm.org
P.S.
  And I'll even work to integrate the searching with mailman,
since I get annoyed not being able to search back archives.






More information about the Python-list mailing list