an intern-like memory saver

Niels Diepeveen niels at endea.demon.nl
Fri Apr 14 10:56:15 EDT 2000


sjmachin at lexicon.net schreef:
> 
> Problem:
> I have an application that works with words and can have millions of them
> in memory at one time. Apart from the main data structure, a dictionary is
> used to maintain frequencies. As the words are loaded from files, multiple
> instances of the same word don't share memory. The memory savings
> could be huge --- see how frequent "the" is in English text, or "Smith" in
> an Anglo telephone directory.
> 
> Trial solution:
> (1) Clone dictobject.c. Make it into an extension module for a type called
> "mydict". Add a method called "key_ref".with one argument:
> adict.key_ref(obj). If adict.has_key(obj) is true, this returns a reference to
> the key value inside the dictionary; else it returns "obj".
> (2) Make simple changes to the application:
> (a) Change
>    freq_dict = {}
> to
>    freq_dict = mydict.mydict()
> (b) assuming for purposes of exposition that words are stored simply in a
> list, after
>    freq_dict[w] = freq_dict.get(w, 0) + 1
> change
>    word_list.append(w)
> to
>    word_list.append(freq_dict.key_ref(w))
> 
> Results:
> Gratifying. An exercise that was running out of real memory (384 MB) and
> taking a day now takes an hour or so.
> 
> Questions:
> (1) Would this be sufficiently generally useful to make it a method of the
> standard dictionary object in Python?

It seems to make sense. I don't know if it would be widely used though.
I can think of a number of ways to do very much the same thing in
Python. I don't know whether you've tried, but I think simply using
    w = intern(w)
might be even faster than what you describe.

> (2) As methods seem to be found by sequential search, wouldn't it be a
> good idea to move "get" a bit higher up the method_def list in
> dictobject.c? At the end, after "update" and "copy", doesn't seem like a
> good idea.
> (3) Has anyone had any success in compiling Python on WinNT 4.0 with
> gcc 2.95.2? It's just fine for making extension modules; I haven't tried
> compiling the whole Python yet.

I'm working on 1.6a2 now (for Win95, should work for NT too). I've got
the
core to compile and run, but it takes some patches. I haven't got
Tkinter working yet (problems linking the Tcl/Tk libaries, any
suggestions other than building those from source welcome).
If you like, I can send you what I have so far.

> (4) Has anyone any better ideas for gauging Python memory usage than
> sitting watching the graphical display in WinNT's Task Manager? Might an
> instrumented or instrumentable malloc/free package (like Doug Lea's) that
> permitted implementation of a Python builtin memused() be the way to go,
> or is there a policy of using the standard malloc from the C library on each
> platform?

As far as I can tell, you can use any malloc package you like; there are
no assumptions other than the C standard.

-- 
Niels Diepeveen
Endea automatisering





More information about the Python-list mailing list