Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

Sat Aug 7 04:16:47 EDT 2010

dmtr wrote:

> On Aug 6, 11:50 pm, Peter Otten <__pete... at web.de> wrote:
>> I don't know to what extent it still applys but switching off cyclic
>> garbage collection with
>>
>> import gc
>> gc.disable()
> 
> 
> Haven't tried it on the real dataset. On the synthetic test it (and
> sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
> usage. Not significant. I'll try it on the real dataset though.
> 
> 
>> while building large datastructures used to speed up things
>> significantly. That's what I would try first with your real data.
>>
>> Encoding your unicode strings as UTF-8 could save some memory.
> 
> Yes...  In fact that's what I'm trying now... .encode('utf-8')
> definitely creates some clutter in the code, but I guess I can
> subclass dict... And it does saves memory! A lot of it. Seems to be a
> bit faster too....
> 
>> When your integers fit into two bytes, say, you can use an array.array()
>> instead of the tuple.
> 
> Excellent idea. Thanks!  And it seems to work too, at least for the
> test code. Here are some benchmarks (x86 desktop):
> 
> Unicode key / tuple:
>>>> for i in xrange(0, 1000000): d[unicode(i)] =  (i, i+1, i+2, i+3, i+4,
>>>> i+5, i+6)
> 1000000 keys, ['VmPeak:\t  224704 kB', 'VmSize:\t  224704 kB'],
> 4.079240 seconds, 245143.698209 keys per second
> 
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = 
>>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
> 1000000 keys, ['VmPeak:\t  201440 kB', 'VmSize:\t  201440 kB'],
> 4.985136 seconds, 200596.331486 keys per second
> 
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =  (i, i+1,
>>>> i+2, i+3, i+4, i+5, i+6)
> 1000000 keys, ['VmPeak:\t  125652 kB', 'VmSize:\t  125652 kB'],
> 3.572301 seconds, 279931.625282 keys per second
> 
> Almost halved the memory usage. And faster too. Nice.

> def benchmark_dict(d, N):
>     start = time.time()
> 
>     for i in xrange(N):
>         length = lengths[random.randint(0, 255)]
>         word = ''.join([ letters[random.randint(0, 255)] for i in 
xrange(length) ])
>         d[word] += 1
> 
>     dt = time.time() - start
>     vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % 
os.getpid()).read())
>     print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N, 
len(d), vm, dt, N / dt)
> 

Looking at your benchmark, random.choice(letters) has probably less overhead 
than letters[random.randint(...)]. You might even try to inline it as

letters[int(random.random())*256)]

Peter