What is heating the memory here? hashlib?
Chris Angelico
rosuav at gmail.com
Sat Feb 13 21:01:52 EST 2016
On Sun, Feb 14, 2016 at 12:44 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>> What happens if, after hashing each file (and returning from this
>> function), you call gc.collect()? If that reduces your RAM usage, you
>> have reference cycles somewhere.
>>
> I have used gc and del. No luck.
>
> The most probable cause seems to be hashlib not correctly handling big
> buffers updates. I am working in a computer and testing in another. For
> the second part may be somehow I forgot to transfer the change to the
> other computer. Unlikely but possible.
I'd like to see the problem boiled down to just the hashlib calls.
Something like this:
import hashlib
data = b"*" * 4*1024*1024
lastdig = None
while "simulating files":
h = hashlib.sha256()
hu = h.update
for chunk in range(100):
hu(data)
dig = h.hexdigest()
if lastdig is None:
lastdig = dig
print("Digest:",dig)
else:
if lastdig != dig:
print("Digest fail!")
Running this on my system (Python 3.6 on Debian Linux) produces a
long-running process with stable memory usage, which is exactly what
I'd expect. Even using different data doesn't change that:
import hashlib
import itertools
byte = itertools.count()
data = b"*" * 4*1024*1024
while "simulating files":
h = hashlib.sha256()
hu = h.update
for chunk in range(100):
hu(data + bytes([next(byte)&255]))
dig = h.hexdigest()
print("Digest:",dig)
Somewhere between my code and yours is something that consumes all
that memory. Can you neuter the actual disk reading (replacing it with
constants, like this) and make a complete and shareable program that
leaks all that memory?
ChrisA
More information about the Python-list
mailing list