Writing huge Sets() to disk
mmokrejs at ribosome.natur.cuni.cz
Mon Jan 10 20:32:59 CET 2005
Tim Peters wrote:
> [Martin MOKREJŠ]
>> just imagine, you want to compare how many words are in English, German,
>>Czech, Polish disctionary. You collect words from every language and record
>>them in dict or Set, as you wish.
> Call the set of all English words E; G, C, and P similarly.
>> Once you have those Set's or dict's for those 4 languages, you ask
>>for common words
> This Python expression then gives the set of words common to all 4:
> E & G & C & P
>>and for those unique to Polish.
> P - E - G - C
> is a reasonably efficient way to compute that.
Nice, is it equivalent to common / unique methods of Sets?
>>I have no estimates
>>of real-world numbers, but we might be in range of 1E6 or 1E8?
>>I believe in any case, huge.
> No matter how large, it's utterly tiny compared to the number of
> character strings that *aren't* words in any of these languages.
> English has a lot of words, but nobody estimates it at over 2 million
> (including scientific jargon, like names for chemical compounds):
As I've said, I analyze in real something else then languages.
However, it can be described with the example of words in different languages.
But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?
>>My concern is actually purely scientific, not really related to analysis
>>of these 4 languages, but I believe it describes my intent quite well.
>> I wanted to be able to get a list of words NOT found in say Polish,
>>and therefore wanted to have a list of all, theoretically existing words.
>>In principle, I can drop this idea of having ideal, theoretical lexicon.
>>But have to store those real-world dictionaries anyway to hard drive.
> Real-word dictionaries shouldn't be a problem. I recommend you store
> each as a plain text file, one word per line. Then, e.g., to convert
> that into a set of words, do
> f = open('EnglishWords.txt')
> set_of_English_words = set(f)
I'm aware I can't keep set_of_English_words in memory.
More information about the Python-list