Writing huge Sets() to disk

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Mon Jan 10 12:27:15 EST 2005

Paul McGuire wrote:
> "Martin MOKREJ©" <mmokrejs at ribosome.natur.cuni.cz> wrote in message
> news:mailman.448.1105373471.22381.python-list at python.org...
>>  I have sets.Set() objects having up to 20E20 items,
>>each is composed of up to 20 characters. Keeping
>>them in memory on !GB machine put's me quickly into swap.
>>I don't want to use dictionary approach, as I don't see a sense
>>to store None as a value. The items in a set are unique.
>>  How can I write them efficiently to disk? To be more exact,
>>I have 20 sets. _set1 has 1E20 keys of size 1 character.
>>alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
> 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>>for aa1 in alphabet:
>>    # l = [aa1]
>>    #_set1.add(aa1)
>>    for aa2 in alphabet:
>>        # l.append(aa2)
>>        #_set2.add(''.join(l))
>>  The reason I went for sets instead of lists is the speed,
>>availability of unique, common and other methods.
>>What would you propose as an elegant solution?
>>Actually, even those nested for loops take ages. :(
> _set1 only has 19 keys of size 1 character - 'A' is duplicated.

Ahh, you are right. That's a typo, yet I have to figure out which char
I have to use. But 'Z' will do for the tests anyway.

> Assuming you replace 'A' with another character (say 'Z'), then here is what
> you get:
> _set1 - 20 elements (20**1)
> _set2 - 400 elements (20**2)
> _set3 - 8000 elements (20**3)
> ...
> _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26
> If you have no duplicates in your alphabet, then you wont need to use sets,
> every combination will be unique.  In this case, just use a generator.

As I noted in later response, actually I will compare these ideal sets to some
real world examples. I have no expectation how many entries will be common
and how many unique. The only thing I know even in real world, I might be in
size ranges of 2E6 or perhaps 2E8?

> Here's a recursive generator approach that may save you a bunch of nested
> editing (set maxDepth as high as you dare):
> alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
> 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
> maxDepth = 3
> def genNextCombo(root=list(),depth=1):
>     for a in alphabet:
>         thisRoot = root + [a]
>         yield "".join( thisRoot )
>         if depth < maxDepth:
>             for c in genNextCombo(thisRoot, depth+1):
>                 yield c
> for c in genNextCombo():
>     print c  # or write to file, or whatever

Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
doesn't really tell what happens here. :( But is quite fast and
definitely saves those the stupid recursively hand-written for

Thanks Paul!

More information about the Python-list mailing list