[Ironpython-users] Hashing a directory is magnitudes slower than in cPython
Markus Schaber
m.schaber at codesys.com
Thu Feb 27 14:10:17 CET 2014
Hi,
Von: Jeff Hardy [mailto:jdhardy at gmail.com]
> On Thu, Feb 27, 2014 at 11:11 AM, Markus Schaber <m.schaber at codesys.com>
> wrote:
> > Hi,
> >
> > I'm just trying to sum it up:
> >
> > 1) The current code:
> > - High memory usage.
> > - High load on the large object heap.
> > - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> > - High CPU usage when used incrementally (quadratic to the number of
> blocks added).
> >
> > 2) Optimizing with MemoryStream and lazy calculation:
> > - High memory usage.
> > - High load on the large object heap.
> > - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> > + Optimal CPU usage when the hash is only fetched once.
> > ± Better than current code, but still not optimal when hash is
> incrementally fetched several times.
> >
> > 3) Optimizing with jagged arrays and lazy calculation:
> > - High memory usage.
> > + Improved or no impact on the large object heap (depending on the exact
> implementation)
> > - Limited by the available amount of memory (which might be considered a
> violation of the Python API).
> > + Optimal CPU usage when the hash is only fetched once.
> > ± Better than current code, but still not optimal when hash is
> incrementally fetched several times.
> >
> > 4) Using the existing .NET incremental APIs
> > + Low, constant memory usage.
> > + No impact on the large object heap.
> > + No limit of data length by the amount of memory.
> > + Optimal CPU usage when the hash is only fetched once.
> > - Breaks when hash is incrementally fetched several times (which likely
> is a violation of the Python API).
> >
> > 5) Finding or porting a different Hash implementation in C#:
> > + Low, constant memory usage.
> > + No impact on the large object heap.
> > + No limit of data length by the amount of memory.
> > + Optimal CPU usage when the hash is only fetched once.
> > + Optimal CPU usage when the hash is incrementally fetched several times.
> >
> > I've a local prototype implemented for 2), but I'm not sure whether that's
> > the best way to go...
>
> Good analysis!
>
> My preference would be for (4), raising an exception if .update() is called
> after .digest(), or .copy() is called at all. As a fallback, an extra
> parameter to hashlib.new (&c) that triggers (2), for cases where its needed -
> I can't say for sure, but I would think calling .update() after .digest()
> would be rare, and so would .copy() (damn you Google for shutting down code
> search). At least then the common case is fast and edge cases are (usually)
> possible.
Do you think asking on some cPython lists could give usable feedback how
common it is to call copy() or to continue feeding data after calling
digest()?
> > Maybe we should google for purely managed implementations of the hash codes
> > with a sensible license...
>
> There seems to be for MD5 and SHA1 but not SHA2 or RIPEMD160. They could be
> ported from the public domain Crypto++ library, but that seems like a lot of
> work for an edge case.
Yes, that seems to be a lot of work.
On the other hand, it's the 100% solution. :-)
Best regards
Markus Schaber
CODESYS® a trademark of 3S-Smart Software Solutions GmbH
Inspiring Automation Solutions
3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50
E-Mail: m.schaber at codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
More information about the Ironpython-users
mailing list