HTML search engine written in Python - is there one?

Robert Roy rjroy at takingcontrol.com
Fri May 19 12:55:45 EDT 2000


A full featured full text indexing solution is not trivial. It all
depends on what kind of queries you want to perform. If all you want
to do are queries such as "find all files which contain the word
'dog'" that can be done quite easily, probably under 200  lines of
code for a trivial solution using sgmllib and gdbm. However if you
want to do phrase searching or stem searching or wild-card searching,
then it gets really complicated in a hurry. 

Another factor is how many files you are dealing with. Indices often
run 4-8X the size of the indexed files. And do you want to dynamically
update the index or are you happy just re-indexing the whole works
periodically. A static index is somewhat easier to build than a fully
dynamic one.

An interesting GPL'd indexing package is SWISH++
see:
http://www.best.com/~pjl/software/swish/

A good tactic might be to use this for your indexing, and running the
search engine as a daemon, building a python interface to talk to it
via Unix domain sockets or alternately shelling out and capturing and
parsing the return values. 

Good luck.
Bob

On Fri, 19 May 2000 12:41:58 +0100, "Simon Brunning"
<sbrunning at trisystems.co.uk> wrote:

>I need something that will build an index of the text content of a
>number of HTML files, and allow  you to nun queries on the index.
>Does anyone know of such a thing, or am I going to have to write my
>own?
>
>--
>Cheers,
>Simon Brunning
>TriSystems Ltd.
>sbrunning at trisystems.co.uk
>The opinions expressed are mine, and are not necessarily those of my
>employer. All comments provided "as is" with no warranties of any
>kind whatsoever.
>
>
>
>




More information about the Python-list mailing list