[Tutor] Memory optimization problem [intern() can save space for commonly used strings]

Tue Sep 9 23:06:11 EDT 2003

Danny Yoo wrote:

>On Fri, 5 Sep 2003, Jonathan Hayward http://JonathansCorner.com wrote:
>
>  
>
>>I have been working on a search engine, which is a memory hog. I
>>originally had it use a hashtable to keep a histogram of occurrences for
>>words for files
>>    
>>
>
>Hi Jonathan,
>
>
>How many unique words are you running into?  Let's use that to estimate
>memory usage.  Also, can you show us how you're constructing the
>dictionary?  Are you using a separate dictionary for each document in your
>collection?
><snip>
>  
>
Thank you. The code that's keeping the histogram is:

class histogram(ancestor):
    """Class for a (possibly weighted) histogram."""
    def __init__(self):
        ancestor.__init__(self)
        self.occurrences = {}
        self.total_occurrences = 0
    def add_occurrence(self, category, number=1):
        if category in self.occurrences.keys():
            self.occurrences[category] += number
        else:
            self.occurrences[category] = number
        self.total_occurrences += number
    def finalize(self):
        occurrences_as_list = []
        for word in self.occurrences.keys():
            occurrences_as_list.append((word_numberer.get_number(word), \
              self.occurrences[word]))
        sort_by_item(occurrences_as_list, 0)
        self.finalized_data = tuple(occurrences_as_list)
        self.occurrences = None
    def get_occurrences(self, category):
        if self.occurrences == None:
            result = binary_search(self.finalized_data, category, (0,0))[1]
            return result
        else:
            if category in self.occurrences.keys():
                return self.occurrences[category]
            else:
                return 0
    def get_proportion(self, category):
        if self.total_occurrences > 0:
            return float(self.get_occurrences(category)) / \
              float(self.total_occurrences)
        else:
            return 0
    def get_score(self, other_histogram):
        if self.total_occurrences == 0 or \
          other_histogram.total_occurrences == 0:
            return 0
        else:
            numerator = 0
            for pair in self.finalized_data:
                key = pair[0]
                numerator += self.get_occurrences(key) * \
                  other_histogram.get_occurrences(key)
            denominator = self.total_occurrences * \
              other_histogram.total_occurrences
            return float(numerator) / float(denominator)
    def get_total_occurrences(self):
       return total_occurrences
    def get_words(self):
        return self.occurrences.keys()
    def remove_occurrence(self, category, number=1):
        if category in self.occurrences:
            difference = min(number, self.occurrences[category])
            self.occurrences[category] -= number
            if self.occurrences[category] <= 0:
                del self.occurrences[category]
            self.total_occurrences -= min

What it does is read in the document, get a histogram in words, then 
replace the histogram with a tuple of indices to words. A document might 
have 50,000 words, 4000 unique.

What bugs me is that there's 150M of data, mostly in large text files of 
no particularly great variation, and when I read it in the histograms 
are 50M pickled, and 200-250M in live memory--so my optimized version 
takes more space than if I were keeping the entire document collection 
in memory.

-- 
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com