[Tutor] Memory optimization problem [intern() can save space
for commonly used strings]
Jonathan Hayward http://JonathansCorner.com
jonathan.hayward at pobox.com
Tue Sep 9 23:06:11 EDT 2003
Danny Yoo wrote:
>On Fri, 5 Sep 2003, Jonathan Hayward http://JonathansCorner.com wrote:
>
>
>
>>I have been working on a search engine, which is a memory hog. I
>>originally had it use a hashtable to keep a histogram of occurrences for
>>words for files
>>
>>
>
>Hi Jonathan,
>
>
>How many unique words are you running into? Let's use that to estimate
>memory usage. Also, can you show us how you're constructing the
>dictionary? Are you using a separate dictionary for each document in your
>collection?
><snip>
>
>
Thank you. The code that's keeping the histogram is:
class histogram(ancestor):
"""Class for a (possibly weighted) histogram."""
def __init__(self):
ancestor.__init__(self)
self.occurrences = {}
self.total_occurrences = 0
def add_occurrence(self, category, number=1):
if category in self.occurrences.keys():
self.occurrences[category] += number
else:
self.occurrences[category] = number
self.total_occurrences += number
def finalize(self):
occurrences_as_list = []
for word in self.occurrences.keys():
occurrences_as_list.append((word_numberer.get_number(word), \
self.occurrences[word]))
sort_by_item(occurrences_as_list, 0)
self.finalized_data = tuple(occurrences_as_list)
self.occurrences = None
def get_occurrences(self, category):
if self.occurrences == None:
result = binary_search(self.finalized_data, category, (0,0))[1]
return result
else:
if category in self.occurrences.keys():
return self.occurrences[category]
else:
return 0
def get_proportion(self, category):
if self.total_occurrences > 0:
return float(self.get_occurrences(category)) / \
float(self.total_occurrences)
else:
return 0
def get_score(self, other_histogram):
if self.total_occurrences == 0 or \
other_histogram.total_occurrences == 0:
return 0
else:
numerator = 0
for pair in self.finalized_data:
key = pair[0]
numerator += self.get_occurrences(key) * \
other_histogram.get_occurrences(key)
denominator = self.total_occurrences * \
other_histogram.total_occurrences
return float(numerator) / float(denominator)
def get_total_occurrences(self):
return total_occurrences
def get_words(self):
return self.occurrences.keys()
def remove_occurrence(self, category, number=1):
if category in self.occurrences:
difference = min(number, self.occurrences[category])
self.occurrences[category] -= number
if self.occurrences[category] <= 0:
del self.occurrences[category]
self.total_occurrences -= min
What it does is read in the document, get a histogram in words, then
replace the histogram with a tuple of indices to words. A document might
have 50,000 words, 4000 unique.
What bugs me is that there's 150M of data, mostly in large text files of
no particularly great variation, and when I read it in the histograms
are 50M pickled, and 200-250M in live memory--so my optimized version
takes more space than if I were keeping the entire document collection
in memory.
--
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com
More information about the Tutor
mailing list