[Tutor] writing a search engine [quicky intro to vector search engines]

Tue Jul 1 15:31:09 2003

On Tue, 1 Jul 2003, Mathias Mamsch wrote:

> > Any suggestions on where to start?  I plan on writing one for a small
> > site of mine.  It will basically be "google-style"- simple and
> > functional.

Hi Mathias,

If you're ambitious, you can try writing a vector search engine.  *grin*

The general idea is to turn each document in our collection into a vector
in N-dimensional space.  And N can be potentially really large: it's the
number of unique words in our whole document collection.  No wimpy 3d
space here!

For example, let's say we had the following documents:

###
docs = ["This is a test of the emergency broadcast system",
        "This is neat, what kind of system is it?"]
###

Given a set of documents, we can map each unique word to some kind of
numeric identifier.  Perhaps we can use some kind of dictionary, like
this:

###
d = { 'this' : 0,
      'is' : 1,
      'a' : 2,
      'test' : 3,
      'of' : 4,
      'the' : 5,
      'emergency' : 6,
      'broadcast' : 7,
      'system' : 8,
      'neat' : 9,
      'what' : 10,
      'kind' : 11,
      'it' : 12 }
###

With this numbering, we can now transform our documents into vectors in
13-dimensional space.

###
>>> def transformIntoVector(words):
...     vector = [0] * 13
...     for w in words:
...         vector[d[w]] = 1
...     return vector
...
>>> doc_vectors = [transformIntoVector(doc.split()) for doc in docs]
>>> doc_vectors[0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
>>> doc_vectors[1]
[1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1]
###

And now we store our library of document vectors somewhere handy on disk,
or perhaps just keep it in memory.  A query search, then, becomes a math
problem: we take our query, transform it into a "query vector",

###
>>> query = transformIntoVector(["emergency"])  ## I want all documents
>>>                                             ## that have "emergency".
>>> query
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
###

And once we do this, then we can do a search by finding the vectors in our
document collection that are most "similar" to our query vector.  One cute
way to do a simliarity check is to simply "multiply" a query vector,
componentwise, against our document vectors.

###
>>> def v_multiply(v1, v2):
...     sum = 0
...     for x, y in zip(v1, v2):
...         sum += x * y
...     return sum
...
>>> query
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
>>> doc_vectors
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
 [1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1]]
>>>
>>> v_multiply(query, doc_vectors[0])
1
>>> v_multiply(query, doc_vectors[1])
0
###

Shades of linear algebra!  *grin* Since our query was the word
"emergency", we matched against our first document, but got no hits in our
second document.

The explanation about is oversimplified, but I hope it makes some sense.
*grin*  I've taken some of the ideas of the code from,

    http://www.perl.com/lpt/a/2003/02/19/engine.html

and written them as concept code in Python.  My code is so rough at the
moment that I really don't want to show it to anyone... *grin* But here's
a small sample of what it looks like:

###
class Engine:
    def search(self, query_text, threshold=0.5):
        query_vector = self.vectorizer.makeVector(query_text)
        result_vector = Numeric.zeros(self.matrix.shape[1],
                                      typecode='d')
        self.matrix.matvec_transp(query_vector, result_vector)

        return map(self.mapToDocId,
                   Numeric.nonzero(Numeric.greater(result_vector,
                                                   threshold)))
###

self.matrix is a 'pysparse' matrix of all my document vectors, using some
libraries from the PySparse project,

    http://www.python.org/pycon/papers/pysparse.html

That being said, the code is UGLY.  I promise I'll clean up the code when
I get the chance...  and I have to admit that my linear algebra is weak,
so there are some parts of the code that might be suboptimal.  I really
want to learn more about linear algebra now, now that I see that there's
something I can apply it with!

(The real reason I'm working on this code is to do document clustering for
a digital library, but I haven't yet found a fast way to do a matrix
multiply on a 10000x60000 matrix without running out of memory.  Perhaps I
should learn about eigenvectors.)

If you're interested, you can download a tarball of it here:

    http://hkn.eecs.berkeley.edu/~dyoo/python/vector_search.tar.gz

But I am not linking it on my website until I add documentation and good
comments into the code.  *grin*

Anyway, I hope this helps!