Writing huge Sets() to disk

Mon Jan 10 14:12:29 EST 2005

[Martin MOKREJŠ]
>  just imagine, you want to compare how many words are in English, German,
> Czech, Polish disctionary. You collect words from every language and record
> them in dict or Set, as you wish.

Call the set of all English words E; G, C, and P similarly.

>  Once you have those Set's or dict's for those 4 languages, you ask
> for common words

This Python expression then gives the set of words common to all 4:

    E & G & C & P

> and for those unique to Polish.

    P -  E - G  - C

is a reasonably efficient way to compute that.

> I have no estimates
> of real-world numbers, but we might be in range of 1E6 or 1E8?
> I believe in any case, huge.

No matter how large, it's utterly tiny compared to the number of
character strings that *aren't* words in any of these languages. 
English has a lot of words, but nobody estimates it at over 2 million
(including scientific jargon, like names for chemical compounds):

    http://www.worldwidewords.org/articles/howmany.htm

> My concern is actually purely scientific, not really related to analysis
> of these 4 languages, but I believe it describes my intent quite well.
>
>  I wanted to be able to get a list of words NOT found in say Polish,
> and therefore wanted to have a list of all, theoretically existing words.
> In principle, I can drop this idea of having ideal, theoretical lexicon.
> But have to store those real-world dictionaries anyway to hard drive.

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do

    f = open('EnglishWords.txt')
    set_of_English_words = set(f)
    f.close()

You'll have a trailing newline character in each word, but that
doesn't really matter.

Note that if you sort the word-per-line text files first, the Unix
`comm` utility can be used to perform intersection and difference on a
pair at a time with virtually no memory burden (and regardless of file
size).