ANN: NUCULAR B3 Full text indexing (now on Win32 too)

Aaron Watters aaron.watters at gmail.com
Thu Feb 14 15:27:56 CET 2008


On Feb 14, 3:50 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> The main thing killing most of the search apps that I'm involved with
> is disk latency.  If Aaron is listening, I might suggest offering a
> config option to redundantly recording the stored search fields with
> every search term in the index.

I'm not sure what you mean, but if I understand I think Nucular
already does this.  The signatures of the primary indices are

Description: DocumentId x AttributeIndex x FullValue
  "Given a document Id find attributes and their values"
AttributeIndex: AttributeIndex x TruncatedValue x DocumentId
  "Given an attribute and a value (prefix) find document Id's"
AttributeWord: AttributeIndex x Word x DocumentId
  "Given an attribute and a word find documents containing
   that word in that attribute"
WordIndex: Word x DocumentId
  "Given a word find documents containing that word anywhere"

There are a lot of other possibilities which could be added fairly
easily (and I'd like to work out an abstraction layer to make it
even easier -- so you don't need to directly modify the
library code).

For instance you might want to make proximity searching faster by
indexing words in a document with their locations.  Currently
proximity searches that must filter thousands of documents containing
all the relevant words are noticably slower than other queries.

It's a hard problem: every additional index and index column
makes some queries faster, but it may make other queries slower
sometimes and it always makes index builds and index files
more expensive.

It has also occurred to me that the underlying
index implementations and related data structures may be of
interest to Python programmers for all sorts of other purposes
too.

As far as how Nucular compares to Sphinx or anything else:
I don't know and I'm not the right person to evaluate that.
I'd encourage people to try out Nucular and see if it is
easy enough to use and fast enough
and feature rich enough for the intended use.  If
it isn't maybe you should find something else.  Suggestions
and criticism are always welcome.

  -- Aaron Watters

===
"Visit New Jersey: It's not as bad as you think!"
  -- suggested New Jersey tourism slogan

http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=frighten+away+evil+spirits



More information about the Python-list mailing list