Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?
Peter Otten
__peter__ at web.de
Sat Aug 7 04:16:47 EDT 2010
dmtr wrote:
> On Aug 6, 11:50 pm, Peter Otten <__pete... at web.de> wrote:
>> I don't know to what extent it still applys but switching off cyclic
>> garbage collection with
>>
>> import gc
>> gc.disable()
>
>
> Haven't tried it on the real dataset. On the synthetic test it (and
> sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
> usage. Not significant. I'll try it on the real dataset though.
>
>
>> while building large datastructures used to speed up things
>> significantly. That's what I would try first with your real data.
>>
>> Encoding your unicode strings as UTF-8 could save some memory.
>
> Yes... In fact that's what I'm trying now... .encode('utf-8')
> definitely creates some clutter in the code, but I guess I can
> subclass dict... And it does saves memory! A lot of it. Seems to be a
> bit faster too....
>
>> When your integers fit into two bytes, say, you can use an array.array()
>> instead of the tuple.
>
> Excellent idea. Thanks! And it seems to work too, at least for the
> test code. Here are some benchmarks (x86 desktop):
>
> Unicode key / tuple:
>>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4,
>>>> i+5, i+6)
> 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
> 4.079240 seconds, 245143.698209 keys per second
>
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =
>>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
> 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
> 4.985136 seconds, 200596.331486 keys per second
>
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1,
>>>> i+2, i+3, i+4, i+5, i+6)
> 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
> 3.572301 seconds, 279931.625282 keys per second
>
> Almost halved the memory usage. And faster too. Nice.
> def benchmark_dict(d, N):
> start = time.time()
>
> for i in xrange(N):
> length = lengths[random.randint(0, 255)]
> word = ''.join([ letters[random.randint(0, 255)] for i in
xrange(length) ])
> d[word] += 1
>
> dt = time.time() - start
> vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
> print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N,
len(d), vm, dt, N / dt)
>
Looking at your benchmark, random.choice(letters) has probably less overhead
than letters[random.randint(...)]. You might even try to inline it as
letters[int(random.random())*256)]
Peter
More information about the Python-list
mailing list