What is heating the memory here? hashlib?

Steven D'Aprano steve at pearwood.info
Sat Feb 13 21:21:17 EST 2016


On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote:

> Hello all.
> 
> I'm running in a very strange (for me at least) problem.
> 
> def getHash(self):
> bfsz=File.blksz
> h=hashlib.sha256()
> hu=h.update
> with open(self.getPath(),'rb') as f:
> f.seek(File.hdrsz)    # Skip header
> b=f.read(bfsz)
> while len(b)>0:
> hu(b)
> b=f.read(bfsz)
> fhash=h.digest()
> return fhash

This is a good, and tricky, question! Unfortunately, this sort of
performance issue may depend on the specific details of your system.

You can start by telling us what version of Python you are running. You've
already said you're running on Kubuntu, which makes it Linux. Is that a
32-bit or 64-bit version?


Next, let's see if we can simplify the code and make it runnable by anyone,
in the spirit of http://www.sscce.org/



import hashlib
K = 1024
M = 1024*K


def get_hash(pathname, size):
    h = hashlib.sha256()
    with open(pathname, 'rb') as f:
        f.seek(4*K)
        b = f.read(size)
        while b:
            h.update(b)
            b = f.read(size)
    return h.digest()


Does this simplified version demonstrate the same problem?

What happens if you eliminate the actual hashing?


def get_hash(pathname, size):
    with open(pathname, 'rb') as f:
        f.seek(4*K)
        b = f.read(size)
        while b:
            b = f.read(size)
    return "1234"*16


This may allow you to determine whether the problem lies in *reading* the
files or *hashing* the files.


Be warned: if you read from the same file over and over again, Linux will
cache that file, and your tests will not reflect the behaviour when you
read thousands of different files from disk rather than from memory cache.

What sort of media are you reading from?

- hard drive?
- flash drive or USB stick?
- solid state disk?
- something else?

They will all have different read characteristics.

What happens when you call f.read(size)? By default, Python uses the
following buffering strategy for binary files:


    * Binary files are buffered in fixed-size chunks; the size of the buffer
      is chosen using a heuristic trying to determine the underlying
      device's "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
      On many systems, the buffer will typically be 4096 or 8192 bytes long.


See help(open).

That's your first clue that, perhaps, you should be reading in relatively
small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling
shows that typically you should read from files in small-ish chunks, and
that trying to read in large chunks is often counter-productive:

https://duckduckgo.com/html/?q=file+read+buffer+size

The first three links all talk about optimal sizes being measured in small
multiples of 4K, not 40MB.

You can try to increase the system buffer, by changing the "open" line to:

    with open(pathname, 'rb', buffering=40*M) as f:

and see whether that helps.


By the way, do you need a cryptographic checksum? sha256 is expensive to
calculate. If all you are doing is trying to match files which could have
the same content, you could use a cheaper hash, like md5 or even crc32.



-- 
Steven



More information about the Python-list mailing list