Fatest standard way to sum bytes (and their squares)?

Dan Stromberg strombrg at gmail.com
Sun Aug 19 02:55:33 CEST 2007

On Sun, 12 Aug 2007 02:26:59 -0700, Erik Max Francis wrote:

> For a file hashing system (finding similar files, rather than identical 
> ones), I need to be able to efficiently and quickly sum the ordinals of 
> the bytes of a file and their squares.  Because of the nature of the 
> application, it's a requirement that I do it in Python, or only with 
> standard library modules (if such facilities exist) that might assist.
> So far the fastest way I've found is using the `sum` builtin and 
> generators::
> 	ordinalSum = sum(ord(x) for x in data)
> 	ordinalSumSquared = sum(ord(x)**2 for x in data)
> This is about twice as fast as an explicit loop, but since it's going to 
> be processing massive amounts of data, the faster the better.  Are there 
> any tricks I'm not thinking of, or perhaps helper functions in other 
> modules that I'm not thinking of?

I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.

You might experiment with using mmap() instead of read()...  If it helps,
it may help big, because the I/O time is likely to dominate the CPU time.

