[Tutor] Re: [quicky intro to vector search engines]
Alexandre Ratti
alex@gabuzomeu.net
Sun Jul 20 07:08:03 2003
Hello Danny and All,
>Date: Tue, 1 Jul 2003 12:30:12 -0700 (PDT)
>From: Danny Yoo <dyoo@hkn.eecs.berkeley.edu>
>Subject: Re: [Tutor] writing a search engine [quicky intro to vector
>search engines]
Vector search engines looked fun; I just had to give it a try :-)
I uploaded a basic implementation to:
http://www.gabuzomeu.net/alex/py/vsse/SearchEngine.zip
Feedback is welcome.
>The general idea is to turn each document in our collection into a vector
>in N-dimensional space. And N can be potentially really large: it's the
>number of unique words in our whole document collection.
I used Numeric arrays to store these data. Individual arrays are stored in
a dictionary keyed by file names.
As a test, I indexed the Linux Howto collection (text-only version). That's
304 files; about 25 MB of data. When searching the collection, the result I
get look reasonable:
>>> app.search("apache", 0.15)
Searching in 304 files...
---------------------
Apache-Overview-HOWTO.txt 60.68%
Apache-Compile-HOWTO.txt 34.38%
Apache-WebDAV-LDAP-HOWTO.txt 33.25%
WWW-HOWTO.txt 16.82%
>>> app.search("beowulf cluster", 0.1)
Searching in 304 files...
---------------------
Beowulf-HOWTO.txt 34.56%
SSI-UML-HOWTO.txt 26.92%
openMosix-HOWTO.txt 18.16%
Cluster-HOWTO.txt 12.80%
Parallel-Processing-HOWTO.txt 11.69%
CPU-Design-HOWTO.txt 10.59%
Memory usage is quite high (about 100 MB for the PythonWin process). When
saving the index instance to a file as a binary pickle, the file is quite
large too (70 MB).
>With this numbering, we can now transform our documents into vectors in
>13-dimensional space.
>
>###
> >>> def transformIntoVector(words):
>... vector = [0] * 13
>... for w in words:
>... vector[d[w]] = 1
>... return vector
We may have a problem when a query word is not included in the index. For
now, I just drop it, print out a message and query the index with the
remaining query words.
>And once we do this, then we can do a search by finding the vectors in our
>document collection that are most "similar" to our query vector. One cute
>way to do a simliarity check is to simply "multiply" a query vector,
>componentwise, against our document vectors.
>###
> >>> def v_multiply(v1, v2):
>... sum = 0
>... for x, y in zip(v1, v2):
>... sum += x * y
>... return sum
>###
Here, I just used the formula in the Perl article you quoted*. One
difference with your exemple is that I store the number of instances of a
word (or stem) in the document vector.
* http://www.perl.com/lpt/a/2003/02/19/engine.html
>Since our query was the word "emergency", we matched against our first
>document, but got no hits in our second document.
>self.matrix is a 'pysparse' matrix of all my document vectors, using some
>libraries from the PySparse project,
>
> http://www.python.org/pycon/papers/pysparse.html
I haven't tried using pysparse yet.
>(The real reason I'm working on this code is to do document clustering for
>a digital library
While looking for a pure-python stemmer, I came across this reference:
Machine Learning: Text classification report, Ludvig Omholt, 2001/06/01:
http://ludde.net/ml/index.html
Some Python code is included:
http://ludde.net/ml/implementation.html
Cheers.
Alexandre