What is heating the memory here? hashlib?

Chris Angelico rosuav at gmail.com
Sat Feb 13 21:01:52 EST 2016


On Sun, Feb 14, 2016 at 12:44 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>> What happens if, after hashing each file (and returning from this
>> function), you call gc.collect()? If that reduces your RAM usage, you
>> have reference cycles somewhere.
>>
> I have used gc and del. No luck.
>
> The most probable cause seems to be hashlib not correctly handling big
> buffers updates. I am working in a computer and testing in another. For
> the second part may be somehow I forgot to transfer the change to the
> other computer. Unlikely but possible.

I'd like to see the problem boiled down to just the hashlib calls.
Something like this:

import hashlib
data = b"*" * 4*1024*1024
lastdig = None
while "simulating files":
    h = hashlib.sha256()
    hu = h.update
    for chunk in range(100):
        hu(data)
        dig = h.hexdigest()
    if lastdig is None:
        lastdig = dig
        print("Digest:",dig)
    else:
        if lastdig != dig:
            print("Digest fail!")

Running this on my system (Python 3.6 on Debian Linux) produces a
long-running process with stable memory usage, which is exactly what
I'd expect. Even using different data doesn't change that:

import hashlib
import itertools
byte = itertools.count()
data = b"*" * 4*1024*1024
while "simulating files":
    h = hashlib.sha256()
    hu = h.update
    for chunk in range(100):
        hu(data + bytes([next(byte)&255]))
    dig = h.hexdigest()
    print("Digest:",dig)

Somewhere between my code and yours is something that consumes all
that memory. Can you neuter the actual disk reading (replacing it with
constants, like this) and make a complete and shareable program that
leaks all that memory?

ChrisA


More information about the Python-list mailing list