Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?
__peter__ at web.de
Sat Aug 7 10:16:47 CEST 2010
> On Aug 6, 11:50 pm, Peter Otten <__pete... at web.de> wrote:
>> I don't know to what extent it still applys but switching off cyclic
>> garbage collection with
>> import gc
> Haven't tried it on the real dataset. On the synthetic test it (and
> sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
> usage. Not significant. I'll try it on the real dataset though.
>> while building large datastructures used to speed up things
>> significantly. That's what I would try first with your real data.
>> Encoding your unicode strings as UTF-8 could save some memory.
> Yes... In fact that's what I'm trying now... .encode('utf-8')
> definitely creates some clutter in the code, but I guess I can
> subclass dict... And it does saves memory! A lot of it. Seems to be a
> bit faster too....
>> When your integers fit into two bytes, say, you can use an array.array()
>> instead of the tuple.
> Excellent idea. Thanks! And it seems to work too, at least for the
> test code. Here are some benchmarks (x86 desktop):
> Unicode key / tuple:
>>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4,
>>>> i+5, i+6)
> 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
> 4.079240 seconds, 245143.698209 keys per second
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =
>>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
> 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
> 4.985136 seconds, 200596.331486 keys per second
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1,
>>>> i+2, i+3, i+4, i+5, i+6)
> 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
> 3.572301 seconds, 279931.625282 keys per second
> Almost halved the memory usage. And faster too. Nice.
> def benchmark_dict(d, N):
> start = time.time()
> for i in xrange(N):
> length = lengths[random.randint(0, 255)]
> word = ''.join([ letters[random.randint(0, 255)] for i in
> d[word] += 1
> dt = time.time() - start
> vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
> print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N,
len(d), vm, dt, N / dt)
Looking at your benchmark, random.choice(letters) has probably less overhead
than letters[random.randint(...)]. You might even try to inline it as
More information about the Python-list