[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Thu Feb 27 14:10:17 CET 2014

Hi,

Von: Jeff Hardy [mailto:jdhardy at gmail.com]
> On Thu, Feb 27, 2014 at 11:11 AM, Markus Schaber <m.schaber at codesys.com>
> wrote:
> > Hi,
> >
> > I'm just trying to sum it up:
> >
> > 1) The current code:
> >    - High memory usage.
> >    - High load on the large object heap.
> >    - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> >    - High CPU usage when used incrementally (quadratic to the number of
> blocks added).
> >
> > 2) Optimizing with MemoryStream and lazy calculation:
> >    - High memory usage.
> >    - High load on the large object heap.
> >    - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> >    + Optimal CPU usage when the hash is only fetched once.
> >    ± Better than current code, but still not optimal when hash is
> incrementally fetched several times.
> >
> > 3) Optimizing with jagged arrays and lazy calculation:
> >    - High memory usage.
> >    + Improved or no impact on the large object heap (depending on the exact
> implementation)
> >    - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> >    + Optimal CPU usage when the hash is only fetched once.
> >    ± Better than current code, but still not optimal when hash is
> incrementally fetched several times.
> >
> > 4) Using the existing .NET incremental APIs
> >    + Low, constant memory usage.
> >    + No impact on the large object heap.
> >    + No limit of data length by the amount of memory.
> >    + Optimal CPU usage when the hash is only fetched once.
> >    - Breaks when hash is incrementally fetched several times (which likely
> is a violation of the Python API).
> >
> > 5) Finding or porting a different Hash implementation in C#:
> >    + Low, constant memory usage.
> >    + No impact on the large object heap.
> >    + No limit of data length by the amount of memory.
> >    + Optimal CPU usage when the hash is only fetched once.
> >    + Optimal CPU usage when the hash is incrementally fetched several times.
> >
> > I've a local prototype implemented for 2), but I'm not sure whether that's
> > the best way to go...
> 
> Good analysis!
> 
> My preference would be for (4), raising an exception if .update() is called
> after .digest(), or .copy() is called at all. As a fallback, an extra
> parameter to hashlib.new (&c) that triggers (2), for cases where its needed -
> I can't say for sure, but I would think calling .update() after .digest()
> would be rare, and so would .copy() (damn you Google for shutting down code
> search). At least then the common case is fast and edge cases are (usually)
> possible.

Do you think asking on some cPython lists could give usable feedback how
common it is to call copy() or to continue feeding data after calling
digest()?

> > Maybe we should google for purely managed implementations of the hash codes
> > with a sensible license...
> 
> There seems to be for MD5 and SHA1 but not SHA2 or RIPEMD160. They could be
> ported from the public domain Crypto++ library, but that seems like a lot of
> work for an edge case.

Yes, that seems to be a lot of work.

On the other hand, it's the 100% solution. :-)

Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.schaber at codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915