[Tutor] writing a search engine [quicky intro to vector search
engines]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Tue Jul 1 15:31:09 2003
On Tue, 1 Jul 2003, Mathias Mamsch wrote:
> > Any suggestions on where to start? I plan on writing one for a small
> > site of mine. It will basically be "google-style"- simple and
> > functional.
Hi Mathias,
If you're ambitious, you can try writing a vector search engine. *grin*
The general idea is to turn each document in our collection into a vector
in N-dimensional space. And N can be potentially really large: it's the
number of unique words in our whole document collection. No wimpy 3d
space here!
For example, let's say we had the following documents:
###
docs = ["This is a test of the emergency broadcast system",
"This is neat, what kind of system is it?"]
###
Given a set of documents, we can map each unique word to some kind of
numeric identifier. Perhaps we can use some kind of dictionary, like
this:
###
d = { 'this' : 0,
'is' : 1,
'a' : 2,
'test' : 3,
'of' : 4,
'the' : 5,
'emergency' : 6,
'broadcast' : 7,
'system' : 8,
'neat' : 9,
'what' : 10,
'kind' : 11,
'it' : 12 }
###
With this numbering, we can now transform our documents into vectors in
13-dimensional space.
###
>>> def transformIntoVector(words):
... vector = [0] * 13
... for w in words:
... vector[d[w]] = 1
... return vector
...
>>> doc_vectors = [transformIntoVector(doc.split()) for doc in docs]
>>> doc_vectors[0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
>>> doc_vectors[1]
[1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1]
###
And now we store our library of document vectors somewhere handy on disk,
or perhaps just keep it in memory. A query search, then, becomes a math
problem: we take our query, transform it into a "query vector",
###
>>> query = transformIntoVector(["emergency"]) ## I want all documents
>>> ## that have "emergency".
>>> query
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
###
And once we do this, then we can do a search by finding the vectors in our
document collection that are most "similar" to our query vector. One cute
way to do a simliarity check is to simply "multiply" a query vector,
componentwise, against our document vectors.
###
>>> def v_multiply(v1, v2):
... sum = 0
... for x, y in zip(v1, v2):
... sum += x * y
... return sum
...
>>> query
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
>>> doc_vectors
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1]]
>>>
>>> v_multiply(query, doc_vectors[0])
1
>>> v_multiply(query, doc_vectors[1])
0
###
Shades of linear algebra! *grin* Since our query was the word
"emergency", we matched against our first document, but got no hits in our
second document.
The explanation about is oversimplified, but I hope it makes some sense.
*grin* I've taken some of the ideas of the code from,
http://www.perl.com/lpt/a/2003/02/19/engine.html
and written them as concept code in Python. My code is so rough at the
moment that I really don't want to show it to anyone... *grin* But here's
a small sample of what it looks like:
###
class Engine:
def search(self, query_text, threshold=0.5):
query_vector = self.vectorizer.makeVector(query_text)
result_vector = Numeric.zeros(self.matrix.shape[1],
typecode='d')
self.matrix.matvec_transp(query_vector, result_vector)
return map(self.mapToDocId,
Numeric.nonzero(Numeric.greater(result_vector,
threshold)))
###
self.matrix is a 'pysparse' matrix of all my document vectors, using some
libraries from the PySparse project,
http://www.python.org/pycon/papers/pysparse.html
That being said, the code is UGLY. I promise I'll clean up the code when
I get the chance... and I have to admit that my linear algebra is weak,
so there are some parts of the code that might be suboptimal. I really
want to learn more about linear algebra now, now that I see that there's
something I can apply it with!
(The real reason I'm working on this code is to do document clustering for
a digital library, but I haven't yet found a fast way to do a matrix
multiply on a 10000x60000 matrix without running out of memory. Perhaps I
should learn about eigenvectors.)
If you're interested, you can download a tarball of it here:
http://hkn.eecs.berkeley.edu/~dyoo/python/vector_search.tar.gz
But I am not linking it on my website until I add documentation and good
comments into the code. *grin*
Anyway, I hope this helps!