How can I create customized classes that have similar properties as 'str'?

zooko zookog at gmail.com
Mon Dec 3 09:51:57 EST 2007


On Nov 24, 4:44 am, Licheng Fang <fanglich... at gmail.com> wrote:
>
> Yes, millions. In my natural language processing tasks, I almost
> always need to define patterns, identify their occurrences in a huge
> data, and count them. Say, I have a big text file, consisting of
> millions of words, and I want to count the frequency of trigrams:

I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory.  Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled.  You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try "tokenizing" your trigrams by defining a dict from three unigrams
to a sequentially allocated integer "trigram id" (also called a
"trigram token"), and a reverse dict which goes from a trigram id to
the three unigrams.  Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects.  If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure.  In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer...  Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees.  (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn



More information about the Python-list mailing list