[Tutor] Inverted Index Algorithm
Steve Nelson
sanelson at gmail.com
Sat Apr 1 09:10:10 CEST 2006
On 3/31/06, Kent Johnson <kent37 at tds.net> wrote:
> Steve Nelson wrote:
>
> Do you need help getting started with Python or with inverted indexing
> in particular?
Sorry - I should have been clearer. I'm reasonably confident in
Python, and if I get stuck with that side of things will ask for help.
I was more struggling with the Inverted Indexing bit.
I think I want to write a program that will do text searches on
documents, to learn about and compare the two approaches I've heard of
- Inverted Indexing and Signature Files. SO I think I am asking for
help at the logical / implementation level.
My understanding so far goes something like this:
Suppose I have three documents - as a trivial but fun example, we
could make them song lyrics. The simplest thing that could possibly
work would be to supply one or two words and literally scan every word
in the song for a match. This has two disadvantages - firstly it is
likely to produce false positives, and secondly it is likely to be
very inefficient.
The next step would be to introduce an index. I think again, the
simplest thing that could possibly work would be a literal index of
every word and every document in which it appears. This would save
processing time, but wouldn't be very intelligent.
This is where I think the inverted indexing comes in. As I understand
it we can now produce an index of key words, with document name and
location in document for each key word. This makes the search more
involved, and more intelligent. Finally we could have some logic that
did some set analysis to return only results that make sense.
Am I thinking along the right lines, or have I misunderstood? Could
someone perhaps come up with some code snippets that make clearer
examples?
I also need to think about signature files, but I havn't really any
idea how these work.
Thanks all!
S.
> Kent
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list