itertools.intersect?

David M. Wilson dw at botanicus.net
Wed Jun 10 22:43:51 EDT 2009


On Jun 11, 3:05 am, Chris Rebert <c... at rebertia.com> wrote:
> On Wed, Jun 10, 2009 at 5:53 PM, Mensanator<mensana... at aol.com> wrote:
> > On Jun 10, 5:24 pm, David Wilson <d... at botanicus.net> wrote:
> >> Hi,
>
> >> During a fun coding session yesterday, I came across a problem that I
> >> thought was already solved by itertools, but on investigation it seems
> >> it isn't.
>
> >> The problem is simple: given one or more ordered sequences, return
> >> only the objects that appear in each sequence, without reading the
> >> whole set into memory. This is basically an SQL many-many join.
>
> > Why not use SQL?
>
> Agreed. I seem to recall the last person asking for such a function
> wanted to use it to combine SQL results.
>

My original use case was a full text indexer. Here's the code:

    http://code.google.com/p/ghetto-fts/


Let me invert the question and ask: why would I want to use SQL for
this? Or in my own words: what kind of girly-code requires an RDBMS
just to join some sequences? =)

Given that Google reports 14.132 billion occurrences of "the" on the
English web, which is about right, given that some estimate the
English web at ~15 billion documents, or about 33.8 bits to uniquely
identify each document, let's assume we use a 64bit integer, that's
theoretically 111.7GiB of data loaded into SQL just for a single word.

Introducing SQL quickly results in artificially requiring a database
system, when a 15 line function would have sufficed. It also restricts
how I store my data, and prevents, say, using a columnar, variable
length, or delta encoding on my sequence of document IDs, which would
massively improve the storage footprint (say, saving 48-56 bits per
element). I'll avoid mention of the performance aspects altogether.

"What the hell are you thinking",


David


> Cheers,
> Chris
> --http://blog.rebertia.com




More information about the Python-list mailing list