[Tutor] Search theory for luddites?

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Dec 5 01:45:20 CET 2005

> My current planned approach is to have each file added read and keywords
> extracted (i.e. variable names, libraries imported etc.) and used as
> keys in a dictionary, and each key points to a list of files containing
> it (files will be stored in a zip or similar).
> i.e.
> words = {
> "imaplib": ["X.py"],
> "smtplib": ["X.py", "Y.py"]
> }
> And if I do a search for "smtplib" or "imaplib" it'll return X.py
> first, as it contains both, and Y.py second.
> Is there a more efficient approach? I don't think a relational DB really
> suits this sort of task at present, but I'm wondering about search
> times.

Hi Liam,

That sounds fine, and it's a popular approach.  What you have there is
known as an "inverted index":


and it's the way that many search engines know how to link up keywords to
their respective documents.

> I've googled for search theory, and hit c2's wiki up, but I'm having no
> luck at finding applicable, understandable theory... a lot of stuff
> about trees and nodes and paths through graphs, but I can't correlate
> that to my problem.

There's a really short-and-sweet book called "Understanding Search
Engines: Mathematical Modeling and Text Retrieval" that covers some core


I'd heartily recommend that one; they do talk about the theory, but from a
practitioner's point of view, so it's quite readable.  And it is very thin
and easy to carry. *grin*

More information about the Tutor mailing list