Writing huve ge Sets() to disk

Mon Jan 10 22:14:56 EST 2005

Martin MOKREJŠ wrote:

> Tim Peters wrote:
>
...

>>> I was really hoping I'll get an answer how to alter the indexes for 
>>> dictionaries
>>> in python.
>>
>>
>>
>> Sorry, I don't have a guess for what that might mean.
>
>
> I'm not an expert, mysql for example givs you ability to index say
> first 10 characters of a text column, which are typically varchar.
> Just for curiosity I'd like to know how do it in python.
>
> When importing data from a flatfile into mysql table, there's an
> option to delay indexing to the very last moment, when all keys are
> loaded (it doesn't make sense to re-create index after each new
> row into table is added). I believe it's exactly same waste of cpu/io
> in this case - when I create a dictionary and fill it with data,
> I want to create index afterward, not after every key/value pair
> is recorded.

Okay, you seem to be missing this key idea:

    A hash-table (dict) is approximately the same level of abstraction
    as a btree index.

Your MySQL "index" is likely implemented as a btree.  A hash-table could 
just as readily be used to implement the index.  When you insert into 
either of these structures (btree or hash), you are not creating an 
"index" separate from the structure, the structure *is* an "index" of 
the type you are thinking about.  These structures are both efficient 
representations that map from a data-value to some other data-value.  
Hashes with a good hash function (such as Python's dicts) are basically 
O(1) (worst case O(n), as Tim notes), while Btrees (such as common 
database indices) are O(log(n)) (or something of that type, basically it 
grows much more slowly than n).

>>> Once more, I expect to have between E4 or E5 to E8??? words
>>> stored in 20 dictionaries (remember words of sizes in range 1-20?
>>> Every of those 20 dictionaries should be, I believe, indexed just once.
>>> The indexing method should know all entries in a given file are of same
>>> size, i.e. 5 chars, 15 chars, 20 chars etc.
>>
>>
>>
>> I think you're making this more complicated than it needs to be.
>
>
> I hope the algoritm can save some logic. For example, can turn off 
> locking support,
> index only part of the key etc.

I'd tend to agree with Tim.  You're making this all far too complex in 
what appears to be the absence of any real need.  There's a maxim in 
computer programming that you avoid, wherever possible, what is called 
"premature optimisation".  You are here trying to optimise away a 
bottleneck that doesn't exist (indexing overhead, and locking support 
are basically nil for a dictionary).  It is almost a certainty that 
these are not *real* bottlenecks in your code (what with not existing), 
so your time would be better spent writing a "dumb" working version of 
the code and profiling it to see where the *real* bottlenecks are.

> For example, looking up a key with it's value only once in the whole 
> for loop tells me
> I don't need an index. Yes, I'll do this 4 times for those 4 
> languages, but
> still I think it's faster to live without an index, when I can sort
> records. I think I like the sorted text file approach, and the dictionary
> approach without an index would be almost the same, especially if I 
> manage
> to tell the db layout not to move the cursor randomly but just to walk 
> down the
> pre-sorted data.

Again, you don't really have a cursor with a dictionary (and since it's 
randomly ordered, a cursor wouldn't mean anything).  A *btree* has an 
order, but not a dictionary.  You could construct a sorted list in 
memory that would approximate what I *think* you're thinking of as a 
dictionary-without-an-index.

Good luck,
Mike

________________________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://www.vrplumber.com
  http://blog.vrplumber.com