[Tutor] Memory optimization problem [intern() can save space for commonly used strings]

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sat Sep 6 17:10:27 EDT 2003



On Fri, 5 Sep 2003, Jonathan Hayward http://JonathansCorner.com wrote:

> I have been working on a search engine, which is a memory hog. I
> originally had it use a hashtable to keep a histogram of occurrences for
> words for files

Hi Jonathan,


How many unique words are you running into?  Let's use that to estimate
memory usage.  Also, can you show us how you're constructing the
dictionary?  Are you using a separate dictionary for each document in your
collection?

The process that you describe definitely shouldn't be hogging memory: the
whole point of doing histogramming and indicing is to summarize a
document, so it should take significantly less memory than the document
itself, not more.  *grin*



One thing that you might want to see is if doing a string "intern()" on
your keywords will help reduce memory.

intern() is a function that stores strings in a global string table, and
it's designed to reduce the memory-requirements of storing keywords.
Duplicate strings get remapped to the same string --- it reduces
redundancy, and since strings are guaranteed to be immutable, doing an
intern() should be pretty safe.

###
>>> s1 = "hello"
>>> s2 = "h" + "ello"
>>> id(s1)
135629824
>>> id(s2)
135630520
###


Here, 's1' and 's2' have the same content, but they are different strings
since they were constructed differently.


Once we use intern() to stuff the strings into our symbol table:

###
>>> s1 = intern(s1)
>>> s2 = intern(s2)
>>> id(s1)
135629824
>>> id(s2)
135629824
###

then s1 and s2 end up being id-entical as well as equal.  Keeping a symbol
table is especially crucial in search engine applications, since you'll
end up seeing the same strings over and over.

That being said, it's a very bad idea to use intern() indiscriminately:
doing so will probably make the symbol table grow too large!



> Do you have any suggestions for what I might look for that's hogging
> memory?

Besides the string interning tip... I have no idea.  *grin* We really do
need to see more code to say anything useful.  If your codebase is large,
please feel free to link it on the web somewhere and post the link; many
of us would be happy to take a look.


Good luck to you!




More information about the Tutor mailing list