Writing huge Sets() to disk
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Mon Jan 10 12:27:15 EST 2005
Paul McGuire wrote:
> "Martin MOKREJ©" <mmokrejs at ribosome.natur.cuni.cz> wrote in message
> news:mailman.448.1105373471.22381.python-list at python.org...
>
>>Hi,
>> I have sets.Set() objects having up to 20E20 items,
>>each is composed of up to 20 characters. Keeping
>>them in memory on !GB machine put's me quickly into swap.
>>I don't want to use dictionary approach, as I don't see a sense
>>to store None as a value. The items in a set are unique.
>>
>> How can I write them efficiently to disk? To be more exact,
>>I have 20 sets. _set1 has 1E20 keys of size 1 character.
>>
>>alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
>
> 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>
>>for aa1 in alphabet:
>> # l = [aa1]
>> #_set1.add(aa1)
>> for aa2 in alphabet:
>> # l.append(aa2)
>> #_set2.add(''.join(l))
>>[cut]
>>
>> The reason I went for sets instead of lists is the speed,
>>availability of unique, common and other methods.
>>What would you propose as an elegant solution?
>>Actually, even those nested for loops take ages. :(
>>M.
>
>
> _set1 only has 19 keys of size 1 character - 'A' is duplicated.
Ahh, you are right. That's a typo, yet I have to figure out which char
I have to use. But 'Z' will do for the tests anyway.
>
> Assuming you replace 'A' with another character (say 'Z'), then here is what
> you get:
> _set1 - 20 elements (20**1)
> _set2 - 400 elements (20**2)
> _set3 - 8000 elements (20**3)
> ...
> _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26
>
> If you have no duplicates in your alphabet, then you wont need to use sets,
> every combination will be unique. In this case, just use a generator.
As I noted in later response, actually I will compare these ideal sets to some
real world examples. I have no expectation how many entries will be common
and how many unique. The only thing I know even in real world, I might be in
size ranges of 2E6 or perhaps 2E8?
> Here's a recursive generator approach that may save you a bunch of nested
> editing (set maxDepth as high as you dare):
>
> alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
> 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>
> maxDepth = 3
>
> def genNextCombo(root=list(),depth=1):
> for a in alphabet:
> thisRoot = root + [a]
> yield "".join( thisRoot )
> if depth < maxDepth:
> for c in genNextCombo(thisRoot, depth+1):
> yield c
>
> for c in genNextCombo():
> print c # or write to file, or whatever
Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
doesn't really tell what happens here. :( But is quite fast and
definitely saves those the stupid recursively hand-written for
loops.
Thanks Paul!
M.
More information about the Python-list
mailing list