What is heating the memory here? hashlib?
Paulo da Silva
p_s_d_a_s_i_l_v_a_ns at netcabo.pt
Mon Feb 15 03:05:46 EST 2016
Às 02:21 de 14-02-2016, Steven D'Aprano escreveu:
> On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote:
Thanks Steven for your advices.
This is a small script to solve a specific problem.
It will be used in future to solve other similar problems probably with
When I found it eating memory and, what I thought was the 1st reason for
that was fixed and it still ate the memory, I thought of something less
obvious. After all it seems there is nothing wrong with it (see my other
> That's your first clue that, perhaps, you should be reading in relatively
> small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling
> shows that typically you should read from files in small-ish chunks, and
> that trying to read in large chunks is often counter-productive:
> The first three links all talk about optimal sizes being measured in small
> multiples of 4K, not 40MB.
I didn't know about this!
Most of my files are about ~>30MB. So I chose 40MB to avoid python
loops. After all, python should be able to optimize those things.
> You can try to increase the system buffer, by changing the "open" line to:
> with open(pathname, 'rb', buffering=40*M) as f:
This is another thing. One thing is the requested amount of data I want
another is to choose de "really" buffer size. (I didn't know about this
argument - thanks).
> By the way, do you need a cryptographic checksum? sha256 is expensive to
> calculate. If all you are doing is trying to match files which could have
> the same content, you could use a cheaper hash, like md5 or even crc32.
I don't know the probability of collision of each of them. The script
has sha256 and md5 as options. When the failed execution I had chosen
sha256. I didn't check if it takes much more time. A collision might
cause data loss. So ...
More information about the Python-list