Writing huge Sets() to disk

Mon Jan 10 16:28:44 EST 2005

Tim Peters wrote:
> [Martin MOKREJŠ]
> 
>>...
>>
>>I gave up the theoretical approach. Practically, I might need up
>>to store maybe those 1E15 keys.
> 
> 
> We should work on our multiplication skills here <wink>.  You don't
> have enough disk space to store 1E15 keys.  If your keys were just one
> byte each, you would need to have 4 thousand disks of 250GB each to
> store 1E15 keys.  How much disk space do you actually have?  I'm
> betting you have no more than one 250GB disk.

Can get about 700GB on raid5, but this doesn't help here of course. :(
I definitely appreciate your comments, I somehow forgot to make sure
I can store it. I was mainly distracted by the fact I might anyway
hit almost the same size, if there's too few words used in reality.
Therefore when looking for those words not found in such a dictionary,
I'd be close to the teoretical, maximal size of say in order of E15
or E14.

> ...
> 
> [Istvan Albert]
> 
>>>On my system storing 1 million words of length 15
>>>as keys of a python dictionary is around 75MB.
> 
> 
>>Fine, that's what I wanted to hear. How do you improve the algorithm?
>>Do you delay indexing to the very latest moment or do you let your
>>computer index 999 999 times just for fun?
> 
> 
> It remains wholly unclear to me what "the algorithm" you want might

My intent is do some analysis in protein science. But it can be really
thought of as analysing those 4 different languages.

> be.  As I mentioned before, if you store keys in sorted text files,
> you can do intersection and difference very efficiently just by using
> the Unix `comm` utiltity.

Now I got your point. I understand the comm(1) is written in C, but it still
has to scan file1 once and file2 n-times, where n is a number of lines
in file1, right? Without any index ... I'll consider it, actually will test,
thanks!

I was really hoping I'll get an answer how to alter the indexes for dictionaries
in python.

You convinced me not to even try to construct to theoretical dictionary,
as it will take ages just to create. Even if I'd manage, I couldn't
save it (the theoretical and possibly not even the dict(theoret) - dict(real)
result).

Still, before I give the whole project, once more:

I'll parse some text files, isolates separate words and add them to
either Set(), list, dict, flatfile line by line.

Depending on the above, I'll compare them and look for those unique
to some "language". I need to keep track of frequencies used
in every such language, so the dict approach would be the best.
The number stored as a value vould be a float ^H^H^H^H^H^H Decimal()
type - very small number.

Once more, I expect to have between E4 or E5 to E8??? words
stored in 20 dictionaries (remember words of sizes in range 1-20?
Every of those 20 dictionaries should be, I believe, indexed just once.
The indexing method should know all entries in a given file are of same
size, i.e. 5 chars, 15 chars, 20 chars etc.

I already did implement the logic to walk through those 20 different
dictionaries from language a and from language b and find uout those
unique to a or common to both of them. Why I went to ask on this list
was to make sure I took right approach. Sets seemed to be better solution
for the simple comparison (common, unique). To keep track of those
very small frequencies, I anyway have to have dictionaries. I say
that again, how can I improve speed of dictionary indexing?
It doesn't matter here if I have 10E4 keys in dictionary or
10E8 keys in a dictionary.

Or I wanted to hear: go for Sets(), having set with 1E8 keys
might take 1/10 of the size you need for a dictionary ... but
you cannot dump them efficiently on a disk. Using shelve will
cost you maybe 2x the size of dictionary approach and will
be also slover that writing dictionary directly.

Something along these words. I really appreciate your input!
Martin