Binary trees storing huge amounts of data in nodes

Robert Roy rjroy at takingcontrol.com
Tue Nov 7 20:43:29 CET 2000


On Tue, 7 Nov 2000 16:22:02 +0100, "Thomas Weholt" <thomas at cintra.no>
wrote:

>Hi,
>
>I need to build a customized full-text search engine ( Yes, I've asked about
>this before ) and I'd like to use something that's well supported in Python,
>either gdm or Berkley DB. I want to store a word as key and data about
>occurences in the value-part. My problem is that the amount of data in the
>node can be huge. ( The amount of data to be scanned is a collection of
>programming articles and source code, documents etc. ).
>
>How can I best do this? Could I use Berkley DB for storing words and
>pointers to someplace the data was stored? How should I organize the data
>for best response time? The generated index is pretty static, ie. data are
>appended, not often removed or moved.
>
>I've looked at Ransacker but it doesn't seem to fit the amount of data I
>need to scan. I'm using a PostgreSQL-database to store my data in so if
>anybody know how I best could make a full-text search engine in Python using
>PostgreSQL as back-end, that would be just great.
>
>Thanks.
>
>Best regards,
>Thomas
>
>

Full text search engines are non-trivial. You might be better off
taking a look at some of the projects out there and seeing if you can
adapt them or use them as-is for your needs. For example SWISH++ is a
GPL'd full text indexer that while primarily intended for HTML, can
index many other text (and some binary) file formats. It can run as a
search server where it listens to a unix domain socket for queries.
You could easily build a python interface to this. 

http://www.best.com/~pjl/software/swish/

related to the above is:
http://sunsite.berkeley.edu/SWISH-E/


Bob



More information about the Python-list mailing list