Writing huge Sets() to disk

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Mon Jan 10 13:48:46 EST 2005


Adam DePrince wrote:
> On Mon, 2005-01-10 at 11:11, Martin MOKREJŠ wrote:
> 
>>Hi,
>>  I have sets.Set() objects having up to 20E20 items,
>>each is composed of up to 20 characters. Keeping
>>them in memory on !GB machine put's me quickly into swap.
>>I don't want to use dictionary approach, as I don't see a sense
>>to store None as a value. The items in a set are unique.
> 
> 
> Lets be realistic.  Your house is on fire and you are remodeling the
> basement.
> 
> Assuming you are on a 64 bit machine with full 64 bit addressing, your
> absolute upper limit on the size of a set is 2^64, or
> 18446744073709551616 byte.  Your real upper limit is at least an order
> of magnitude smaller.
> 
> You are asking us how to store 20E20, or 2000000000000000000000, items
> in a Set.  That is still an order of magnitude greater than the number
> of *bits* you can address.  Your desktop might not be able to enumerate
> all of these strings in your lifetime, much less index and store them.
> 
> We might as well be discussing the number of angles that can sit on the
> head of a pin.  Any discussion of a list vs Set/dict is a small micro
> optimization matter dwarfed by the fact that there don't exist machines
> to hold this data.  The consideration of Set vs. dict is an even less
> important matter of syntactic sugar.
> 
> To me, it sounds like you are taking an AI class and trying to deal with
> a small search space by brute force.  First, stop banging your head
> against the wall algorithmically.  Nobody lost their job for saying NP
> != P.  Then tell us what you are tring to do; perhaps there is a better
> way, perhaps the problem is unsolvable and there is a heuristic that
> will satisfy your needs. 

Hi Adam,
  just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.

  Once you have those Set's or dict's for those 4 languages, you ask
for common words and for those unique to Polish. I have no estimates
of real-world numbers, but we might be in range of 1E6 or 1E8?
I believe in any case, huge.

  My concern is actually purely scientific, not really related to analysis
of these 4 languages, but I believe it describes my intent quite well.

  I wanted to be able to get a list of words NOT found in say Polish,
and therefore wanted to have a list of all, theoretically existing words.
In principle, I can drop this idea of having ideal, theoretical lexicon.
But have to store those real-world dictionaries anyway to hard drive.

Martin



More information about the Python-list mailing list