Mailman 3 hashlib.sha256(file_obj) - Python-ideas

13 Mar 2022

      Hi,

TL;DR: you can read just the code snippets and the last paragraph.

First of all, I'm assuming hashlib is used to calculate hashes of large
files in production very very often, such that even small performance and
usability improvements would make a huge difference. If you don't share
this assumption, delete this email :-)

Today, hashlib.update() accepts only a bytes(), and many StackOverflow
answers include code like:

sha256 = hashlib.sha256()
with open(path) as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        sha256.update(data)
return sha256.hexdigest()

This is problematic because not everybody knows this pattern, so many
codebases will include code like:

with open(path) as f:
    return sha256(f.read()).hexdigest()

and, frankly, who can blame them.

It is also problematic for performance reasons - even if we know to do the
chunking thing, while hashing a large file, the GIL will be taken and
released many times, and many buffers will be allocated and deallocated.

As far as I can see, hashlib already has a lock per hash object and safely
releases the GIL in update() with a long bytes(), so it would be safe to
add an option for update()/new() to take a file pointer and do the chunked
reading/updating with one static buffer with the GIL released throughout,
so that this would be

with open(path) as f:
    return sha256(f).hexdigest()

We can discuss whether this is the best API or its preferable to have

with open(path) as f:
    return sha256.from_file(f).hexdigest()

or

with open(path) as f:
    return sha256().update_from_file(f).hexdigest()

but I submit that today many people try sha256(f).hexdigest() because
they're used to e.g. json and csv accepting file objects, and that today
passing a file object raises, so making both new() and update() accept file
objects would be the most beginner-friendly and won't break anything.

Knowing that there are a million tiny details that need to be... hashed
out, and given that I'm willing to write the code, would the devs be
receptive to something like this?

Thanks,
Aur

hashlib.sha256(file_obj)

Aur Saraf

Aur Saraf

tags

participants (1)