Writing huge Sets() to disk

Mon Jan 10 13:34:55 EST 2005

Robert Brewer wrote:
> Martin MOKREJŠ wrote:
> 
>>Robert Brewer wrote:
>>
>>>Martin MOKREJŠ wrote:
>>>
>>>
>>>> I have sets.Set() objects having up to 20E20 items,
>>>>each is composed of up to 20 characters. Keeping
>>>>them in memory on !GB machine put's me quickly into swap.
>>>>I don't want to use dictionary approach, as I don't see a sense
>>>>to store None as a value. The items in a set are unique.
>>>>
>>>> How can I write them efficiently to disk?
>>>
>>>
>>>got shelve*?
>>
>>I know about shelve, but doesn't it work like a dictionary?
>>Why should I use shelve for this? Then it's faster to use
>>bsddb directly and use string as a key and None as a value, I'd guess.
> 
> 
> If you're using Python 2.3, then a sets.Set *is* implemented with

Yes, I do.

> a dictionary, with None values. It simply has some extra methods to
> make it behave like a set. In addition, the Set class already has
> builtin methods for pickling and unpickling.

Really? Does Set() have such a method to pickle efficiently?
I haven't seen it in docs.

> 
> So it's probably faster to use bsddb directly, but why not find out
> by trying 2 lines of code that uses shelve? The time-consuming part

Because I don't know how can I afect indexing using bsddb, for example.
For example, create index only for say keysize-1 or keysize-2 chars
of a keystring.

How to delay indexing so that index isn't rebuild after every addiotion
of a new key? I want to do it a the end of the loop adding new keys.

Even better, how to turn off indexing completely (to save space)?

> of your quest is writing the timed test suite that will indicate
> which route will be fastest, which you'll have to do regardless.

Unfortunately, I'm hoping to get first an idea what can be made
faster and how when using sets and dictionaries.

M.