David M. Wilson
dw at botanicus.net
Wed Jun 10 22:43:51 EDT 2009
On Jun 11, 3:05 am, Chris Rebert <c... at rebertia.com> wrote:
> On Wed, Jun 10, 2009 at 5:53 PM, Mensanator<mensana... at aol.com> wrote:
> > On Jun 10, 5:24 pm, David Wilson <d... at botanicus.net> wrote:
> >> Hi,
> >> During a fun coding session yesterday, I came across a problem that I
> >> thought was already solved by itertools, but on investigation it seems
> >> it isn't.
> >> The problem is simple: given one or more ordered sequences, return
> >> only the objects that appear in each sequence, without reading the
> >> whole set into memory. This is basically an SQL many-many join.
> > Why not use SQL?
> Agreed. I seem to recall the last person asking for such a function
> wanted to use it to combine SQL results.
My original use case was a full text indexer. Here's the code:
Let me invert the question and ask: why would I want to use SQL for
this? Or in my own words: what kind of girly-code requires an RDBMS
just to join some sequences? =)
Given that Google reports 14.132 billion occurrences of "the" on the
English web, which is about right, given that some estimate the
English web at ~15 billion documents, or about 33.8 bits to uniquely
identify each document, let's assume we use a 64bit integer, that's
theoretically 111.7GiB of data loaded into SQL just for a single word.
Introducing SQL quickly results in artificially requiring a database
system, when a 15 line function would have sufficed. It also restricts
how I store my data, and prevents, say, using a columnar, variable
length, or delta encoding on my sequence of document IDs, which would
massively improve the storage footprint (say, saving 48-56 bits per
element). I'll avoid mention of the performance aspects altogether.
"What the hell are you thinking",
More information about the Python-list