hashlib.sha256(file_obj)
Hi, TL;DR: you can read just the code snippets and the last paragraph. First of all, I'm assuming hashlib is used to calculate hashes of large files in production very very often, such that even small performance and usability improvements would make a huge difference. If you don't share this assumption, delete this email :-) Today, hashlib.update() accepts only a bytes(), and many StackOverflow answers include code like: sha256 = hashlib.sha256() with open(path) as f: while True: data = f.read(BUF_SIZE) if not data: break sha256.update(data) return sha256.hexdigest() This is problematic because not everybody knows this pattern, so many codebases will include code like: with open(path) as f: return sha256(f.read()).hexdigest() and, frankly, who can blame them. It is also problematic for performance reasons - even if we know to do the chunking thing, while hashing a large file, the GIL will be taken and released many times, and many buffers will be allocated and deallocated. As far as I can see, hashlib already has a lock per hash object and safely releases the GIL in update() with a long bytes(), so it would be safe to add an option for update()/new() to take a file pointer and do the chunked reading/updating with one static buffer with the GIL released throughout, so that this would be with open(path) as f: return sha256(f).hexdigest() We can discuss whether this is the best API or its preferable to have with open(path) as f: return sha256.from_file(f).hexdigest() or with open(path) as f: return sha256().update_from_file(f).hexdigest() but I submit that today many people try sha256(f).hexdigest() because they're used to e.g. json and csv accepting file objects, and that today passing a file object raises, so making both new() and update() accept file objects would be the most beginner-friendly and won't break anything. Knowing that there are a million tiny details that need to be... hashed out, and given that I'm willing to write the code, would the devs be receptive to something like this? Thanks, Aur
It was brought to my attention that there is an open ticket for this. Moving discussion there: https://bugs.python.org/issue45150 Aur On Sun, Mar 13, 2022 at 6:19 PM Aur Saraf <sonoflilit@gmail.com> wrote:
Hi,
TL;DR: you can read just the code snippets and the last paragraph.
First of all, I'm assuming hashlib is used to calculate hashes of large files in production very very often, such that even small performance and usability improvements would make a huge difference. If you don't share this assumption, delete this email :-)
Today, hashlib.update() accepts only a bytes(), and many StackOverflow answers include code like:
sha256 = hashlib.sha256() with open(path) as f: while True: data = f.read(BUF_SIZE) if not data: break sha256.update(data) return sha256.hexdigest()
This is problematic because not everybody knows this pattern, so many codebases will include code like:
with open(path) as f: return sha256(f.read()).hexdigest()
and, frankly, who can blame them.
It is also problematic for performance reasons - even if we know to do the chunking thing, while hashing a large file, the GIL will be taken and released many times, and many buffers will be allocated and deallocated.
As far as I can see, hashlib already has a lock per hash object and safely releases the GIL in update() with a long bytes(), so it would be safe to add an option for update()/new() to take a file pointer and do the chunked reading/updating with one static buffer with the GIL released throughout, so that this would be
with open(path) as f: return sha256(f).hexdigest()
We can discuss whether this is the best API or its preferable to have
with open(path) as f: return sha256.from_file(f).hexdigest()
or
with open(path) as f: return sha256().update_from_file(f).hexdigest()
but I submit that today many people try sha256(f).hexdigest() because they're used to e.g. json and csv accepting file objects, and that today passing a file object raises, so making both new() and update() accept file objects would be the most beginner-friendly and won't break anything.
Knowing that there are a million tiny details that need to be... hashed out, and given that I'm willing to write the code, would the devs be receptive to something like this?
Thanks, Aur
participants (1)
-
Aur Saraf