Writing huge Sets() to disk
Tim Peters
tim.peters at gmail.com
Mon Jan 10 14:12:29 EST 2005
[Martin MOKREJŠ]
> just imagine, you want to compare how many words are in English, German,
> Czech, Polish disctionary. You collect words from every language and record
> them in dict or Set, as you wish.
Call the set of all English words E; G, C, and P similarly.
> Once you have those Set's or dict's for those 4 languages, you ask
> for common words
This Python expression then gives the set of words common to all 4:
E & G & C & P
> and for those unique to Polish.
P - E - G - C
is a reasonably efficient way to compute that.
> I have no estimates
> of real-world numbers, but we might be in range of 1E6 or 1E8?
> I believe in any case, huge.
No matter how large, it's utterly tiny compared to the number of
character strings that *aren't* words in any of these languages.
English has a lot of words, but nobody estimates it at over 2 million
(including scientific jargon, like names for chemical compounds):
http://www.worldwidewords.org/articles/howmany.htm
> My concern is actually purely scientific, not really related to analysis
> of these 4 languages, but I believe it describes my intent quite well.
>
> I wanted to be able to get a list of words NOT found in say Polish,
> and therefore wanted to have a list of all, theoretically existing words.
> In principle, I can drop this idea of having ideal, theoretical lexicon.
> But have to store those real-world dictionaries anyway to hard drive.
Real-word dictionaries shouldn't be a problem. I recommend you store
each as a plain text file, one word per line. Then, e.g., to convert
that into a set of words, do
f = open('EnglishWords.txt')
set_of_English_words = set(f)
f.close()
You'll have a trailing newline character in each word, but that
doesn't really matter.
Note that if you sort the word-per-line text files first, the Unix
`comm` utility can be used to perform intersection and difference on a
pair at a time with virtually no memory burden (and regardless of file
size).
More information about the Python-list
mailing list