[Tutor] Re: [quicky intro to vector search engines]

Sun Jul 20 07:08:03 2003

Hello Danny and All,

>Date: Tue, 1 Jul 2003 12:30:12 -0700 (PDT)
>From: Danny Yoo <dyoo@hkn.eecs.berkeley.edu>
>Subject: Re: [Tutor] writing a search engine   [quicky intro to vector 
>search engines]

Vector search engines looked fun; I just had to give it a try :-)
I uploaded a basic implementation to:

http://www.gabuzomeu.net/alex/py/vsse/SearchEngine.zip

Feedback is welcome.

>The general idea is to turn each document in our collection into a vector 
>in N-dimensional space.  And N can be potentially really large: it's the 
>number of unique words in our whole document collection.

I used Numeric arrays to store these data. Individual arrays are stored in 
a dictionary keyed by file names.

As a test, I indexed the Linux Howto collection (text-only version). That's 
304 files; about 25 MB of data. When searching the collection, the result I 
get look reasonable:

 >>> app.search("apache", 0.15)
Searching in 304 files...
---------------------
Apache-Overview-HOWTO.txt 60.68%
Apache-Compile-HOWTO.txt 34.38%
Apache-WebDAV-LDAP-HOWTO.txt 33.25%
WWW-HOWTO.txt 16.82%

 >>> app.search("beowulf cluster", 0.1)
Searching in 304 files...
---------------------
Beowulf-HOWTO.txt 34.56%
SSI-UML-HOWTO.txt 26.92%
openMosix-HOWTO.txt 18.16%
Cluster-HOWTO.txt 12.80%
Parallel-Processing-HOWTO.txt 11.69%
CPU-Design-HOWTO.txt 10.59%

Memory usage is quite high (about 100 MB for the PythonWin process). When 
saving the index instance to a file as a binary pickle, the file is quite 
large too (70 MB).

>With this numbering, we can now transform our documents into vectors in 
>13-dimensional space.
>
>###
> >>> def transformIntoVector(words):
>...     vector = [0] * 13
>...     for w in words:
>...         vector[d[w]] = 1
>...     return vector

We may have a problem when a query word is not included in the index. For 
now, I just drop it, print out a message and query the index with the 
remaining query words.

>And once we do this, then we can do a search by finding the vectors in our 
>document collection that are most "similar" to our query vector.  One cute 
>way to do a simliarity check is to simply "multiply" a query vector, 
>componentwise, against our document vectors.

>###
> >>> def v_multiply(v1, v2):
>...     sum = 0
>...     for x, y in zip(v1, v2):
>...         sum += x * y
>...     return sum
>###

Here, I just used the formula in the Perl article you quoted*. One 
difference with your exemple is that I store the number of instances of a 
word (or stem) in the document vector.

* http://www.perl.com/lpt/a/2003/02/19/engine.html

>Since our query was the word "emergency", we matched against our first 
>document, but got no hits in our second document.

>self.matrix is a 'pysparse' matrix of all my document vectors, using some 
>libraries from the PySparse project,
>
>    http://www.python.org/pycon/papers/pysparse.html

I haven't tried using pysparse yet.

>(The real reason I'm working on this code is to do document clustering for 
>a digital library

While looking for a pure-python stemmer, I came across this reference:

Machine Learning: Text classification report, Ludvig Omholt, 2001/06/01:

         http://ludde.net/ml/index.html

Some Python code is included:

         http://ludde.net/ml/implementation.html

Cheers.

Alexandre