interning strings

Sun Nov 7 21:01:05 EST 2004

A while ago, we faced a similar issue, trying to reduce total memory
usage and runtime of one of our Python applications which parses very
large log files (100+ MB).

One particular class is instantiated many times and changing just that
class to use __slots__ helped quite a bit.  More details are here

<http://mail.python.org/pipermail/python-list/2004-May/220513.html>

/Jean Brouwers
 ProphICy Semiconductor, Inc.

In article <418eab10$0$13356$afc38c87 at news.optusnet.com.au>, Mike
Thompson wrote:

> [snip very useful explanation]
> 
> > 
> > By the way, why would you want to mess with these implementation details?
> > Use the == operator to compare strings and be happy ever after :-)
> > 
> 
> '==' won't help me, I'm afraid.
> 
> I need to improve the speed and memory footprint of an application which 
> reads in a very large XML document.
> 
> Some elements in the incoming documents can be filtered out, so I've 
> written my own SAX handler to extract just what I want. All the same, 
> the content being read in is substantial.
> 
> So, to further reduce memory footprint, my SAX handler tries to manually 
> intern (using dicts of strings) a lot of the duplicated content and 
> attributes coming from the XML documents. Also, I use the SAX feature 
> 'feature_string_interning' to hopefully intern the strings used for 
> attribute names etc.
> 
> Which is all working fine, except that now, as a final process, I'd like 
> to understand interning a bit more.
> 
>  From your explanation there seems to be no language rules, just 
> implementation accidents.  And none of those will be particularly 
> helpful in my case.
> 
> However, I still think I'm going to try using the builtin 'intern' 
> rather than my own dict cache. That may provide an advantage, even if it 
> doesn't work with unicode.
> 
> --
> Mike