an intern-like memory saver

Fri Apr 14 21:08:23 EDT 2000

In article <38F7318F.AE4F0EE5 at endea.demon.nl>,
  Niels Diepeveen <niels at endea.demon.nl> wrote:
>
>
> sjmachin at lexicon.net schreef:
> >
> > Problem:
> > I have an application that works with words and can have millions
of them
> > in memory at one time. Apart from the main data structure, a
dictionary is
> > used to maintain frequencies. As the words are loaded from files,
multiple
> > instances of the same word don't share memory. The memory savings
> > could be huge --- see how frequent "the" is in English text,
or "Smith" in
> > an Anglo telephone directory.
> >
> > Trial solution:
> > (1) Clone dictobject.c. Make it into an extension module for a type
called
> > "mydict". Add a method called "key_ref".with one argument:
> > adict.key_ref(obj). If adict.has_key(obj) is true, this returns a
reference to
> > the key value inside the dictionary; else it returns "obj".
> > (2) Make simple changes to the application:
> > (a) Change
> >    freq_dict = {}
> > to
> >    freq_dict = mydict.mydict()
> > (b) assuming for purposes of exposition that words are stored
simply in a
> > list, after
> >    freq_dict[w] = freq_dict.get(w, 0) + 1
> > change
> >    word_list.append(w)
> > to
> >    word_list.append(freq_dict.key_ref(w))
> >
> > Results:
> > Gratifying. An exercise that was running out of real memory (384
MB) and
> > taking a day now takes an hour or so.
> >
> > Questions:
> > (1) Would this be sufficiently generally useful to make it a method
of the
> > standard dictionary object in Python?
>
> It seems to make sense. I don't know if it would be widely used
though.
> I can think of a number of ways to do very much the same thing in
> Python. I don't know whether you've tried, but I think simply using
>     w = intern(w)
> might be even faster than what you describe.

I tried "intern" and read the source code for it. Speed is not a
consideration until I run out of real memory and start swapping to
disk. My objective is to reduce memory usage so that that doesn't
happen. "intern" has its *own* hidden dictionary; if I use it, it may
even *increase* my memory usage. I already have my own dictionary with
all those strings as keys, so I save the overhead part of the intern
dictionary. For what it's worth my key_ref method works with any
objects that can be dictionary keys, not just strings. What are the
other ways you say one can do very much the same thing in Python?

>
> > (3) Has anyone had any success in compiling Python on WinNT 4.0 with
> > gcc 2.95.2? It's just fine for making extension modules; I haven't
tried
> > compiling the whole Python yet.
>
> I'm working on 1.6a2 now (for Win95, should work for NT too). I've got
> the
> core to compile and run, but it takes some patches. I haven't got
> Tkinter working yet (problems linking the Tcl/Tk libaries, any
> suggestions other than building those from source welcome).
> If you like, I can send you what I have so far.

Hmmm ... thanks but no thanks, I don't have any spare time to work on
that at the moment.

>
> > (4) Has anyone any better ideas for gauging Python memory usage than
> > sitting watching the graphical display in WinNT's Task Manager?
Might an
> > instrumented or instrumentable malloc/free package (like Doug
Lea's) that
> > permitted implementation of a Python builtin memused() be the way
to go,
> > or is there a policy of using the standard malloc from the C
library on each
> > platform?
>
> As far as I can tell, you can use any malloc package you like; there
are
> no assumptions other than the C standard.

I was happy enough with the idea of having an instrumented malloc in my
own Python (when I get around to compiling it with gcc), but was
wondering about the possibility of having an instrumented malloc as
part of the *standard* Python distribution.

>
> --
> Niels Diepeveen
> Endea automatisering

Thanks for your interest & comments,
John Machin

Sent via Deja.com http://www.deja.com/
Before you buy.