[Tutor] Search theory for luddites?

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Dec 5 01:45:20 CET 2005


> My current planned approach is to have each file added read and keywords
> extracted (i.e. variable names, libraries imported etc.) and used as
> keys in a dictionary, and each key points to a list of files containing
> it (files will be stored in a zip or similar).
>
> i.e.
>
> words = {
>
> "imaplib": ["X.py"],
> "smtplib": ["X.py", "Y.py"]
>
> }
>
> And if I do a search for "smtplib" or "imaplib" it'll return X.py
> first, as it contains both, and Y.py second.
>
> Is there a more efficient approach? I don't think a relational DB really
> suits this sort of task at present, but I'm wondering about search
> times.


Hi Liam,

That sounds fine, and it's a popular approach.  What you have there is
known as an "inverted index":

    http://www.nist.gov/dads/HTML/invertedIndex.html

and it's the way that many search engines know how to link up keywords to
their respective documents.


> I've googled for search theory, and hit c2's wiki up, but I'm having no
> luck at finding applicable, understandable theory... a lot of stuff
> about trees and nodes and paths through graphs, but I can't correlate
> that to my problem.

There's a really short-and-sweet book called "Understanding Search
Engines: Mathematical Modeling and Text Retrieval" that covers some core
ideas:

    http://www.cs.utk.edu/~berry/usebook/

I'd heartily recommend that one; they do talk about the theory, but from a
practitioner's point of view, so it's quite readable.  And it is very thin
and easy to carry. *grin*





More information about the Tutor mailing list