Binary trees storing huge amounts of data in nodes

Thomas Weholt thomas at cintra.no
Wed Nov 8 09:01:02 CET 2000


"Robert Roy" <rjroy at takingcontrol.com> wrote in message
news:3a0858aa.256973828 at news1.on.sympatico.ca...
> On Tue, 7 Nov 2000 16:22:02 +0100, "Thomas Weholt" <thomas at cintra.no>
> wrote:
>
> >Hi,
> >
> >I need to build a customized full-text search engine ( Yes, I've asked
about
> >this before ) and I'd like to use something that's well supported in
Python,
> >either gdm or Berkley DB. I want to store a word as key and data about
> >occurences in the value-part. My problem is that the amount of data in
the
> >node can be huge. ( The amount of data to be scanned is a collection of
> >programming articles and source code, documents etc. ).
> >
> >How can I best do this? Could I use Berkley DB for storing words and
> >pointers to someplace the data was stored? How should I organize the data
> >for best response time? The generated index is pretty static, ie. data
are
> >appended, not often removed or moved.
> >
> >I've looked at Ransacker but it doesn't seem to fit the amount of data I
> >need to scan. I'm using a PostgreSQL-database to store my data in so if
> >anybody know how I best could make a full-text search engine in Python
using
> >PostgreSQL as back-end, that would be just great.
> >
> >Thanks.
> >
> >Best regards,
> >Thomas
> >
> >
>
> Full text search engines are non-trivial. You might be better off
> taking a look at some of the projects out there and seeing if you can
> adapt them or use them as-is for your needs. For example SWISH++ is a
> GPL'd full text indexer that while primarily intended for HTML, can
> index many other text (and some binary) file formats. It can run as a
> search server where it listens to a unix domain socket for queries.
> You could easily build a python interface to this.
>
> http://www.best.com/~pjl/software/swish/
>
> related to the above is:
> http://sunsite.berkeley.edu/SWISH-E/
>
>
> Bob

Well, easily ??? Not knowing C or C++ that would probably be far from easy.
I'm looking at pymifluz to see if I understand what's going on. My problem
is that I don't intend to use it to index html or plain text files on disk,
but data in a PostgreSQL database. PostgreSQL comes with some experimental
full-text indexing source too I'm gonna look at, but again it's in C/C++ and
I don't think there's any python bindings to it. ( Hopefully the author of
PyGreSQL will include bindings for that in his excellent module for
PostgreSQL ;-> )

Thanks anyway. I'll take a look at SWISH-E too.

Best regards,
Thomas Weholt





More information about the Python-list mailing list