[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Thu Feb 27 12:11:45 CET 2014

Hi,

I'm just trying to sum it up:

1) The current code:
   - High memory usage. 
   - High load on the large object heap.
   - Limited by the available amount of memory (which might be considered a violation of the Python API). 
   - High CPU usage when used incrementally (quadratic to the number of blocks added).

2) Optimizing with MemoryStream and lazy calculation:
   - High memory usage.
   - High load on the large object heap.
   - Limited by the available amount of memory (which might be considered a violation of the Python API).
   + Optimal CPU usage when the hash is only fetched once.
   ± Better than current code, but still not optimal when hash is incrementally fetched several times.

3) Optimizing with jagged arrays and lazy calculation:
   - High memory usage.
   + Improved or no impact on the large object heap (depending on the exact implementation)
   - Limited by the available amount of memory (which might be considered a violation of the Python API).
   + Optimal CPU usage when the hash is only fetched once.
   ± Better than current code, but still not optimal when hash is incrementally fetched several times.

4) Using the existing .NET incremental APIs
   + Low, constant memory usage.
   + No impact on the large object heap.
   + No limit of data length by the amount of memory.
   + Optimal CPU usage when the hash is only fetched once.
   - Breaks when hash is incrementally fetched several times (which likely is a violation of the Python API).

5) Finding or porting a different Hash implementation in C#:
   + Low, constant memory usage.
   + No impact on the large object heap.
   + No limit of data length by the amount of memory.
   + Optimal CPU usage when the hash is only fetched once.
   + Optimal CPU usage when the hash is incrementally fetched several times.

I've a local prototype implemented for 2), but I'm not sure whether that's the best way to go...

Maybe we should google for purely managed implementations of the hash codes with a sensible license...

Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH 

Inspiring Automation Solutions 
________________________________________
3S-Smart Software Solutions GmbH 
Dipl.-Inf. Markus Schaber | Product Development Core Technology 
Memminger Str. 151 | 87439 Kempten | Germany 
Tel. +49-831-54031-979 | Fax +49-831-54031-50 

E-Mail: m.schaber at codesys.com | Web: codesys.com | CODESYS store: store.codesys.com 
CODESYS forum: forum.codesys.com 

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 
Von: Curt Hagenlocher [mailto:curt at hagenlocher.org] 
Gesendet: Dienstag, 25. Februar 2014 17:59
An: Jeff Hardy
Cc: Markus Schaber; Discussion of IronPython; Emmanuel Chomarat
Betreff: Re: [Ironpython-users] Hashing a directory is magnitudes slower than in cPython

"Basically, there's a mismatch between what .NET provides and what Python needs for perfect compatibility."

Yes. I think I remember implementing this and that's exactly the problem I ran into. I think we looked into incorporating a modified version of the BCL code directly into IronPython, but at least in those days, that was a pretty hard thing to get done. We ran into a similar issue when implementing the compression API.

You could get around the problem in the client code with an "if sys.platform == 'cli'" and then use the .NET classes directly.

On Tue, Feb 25, 2014 at 8:53 AM, Jeff Hardy <jdhardy at gmail.com> wrote:
On Tue, Feb 25, 2014 at 12:38 PM, Markus Schaber <m.schaber at codesys.com> wrote:
> Hi,
>
> A coworker just consulted me on a performance problem of IronPython vs. cPython.
>
> ... snip ...
>
> On a closer look, there's the additional (and IMHO much worse) problem that the update() method seems not to work incrementally:
>
> private void update(IList<byte> newBytes) {
>     byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
>     Array.Copy(_bytes, updatedBytes, _bytes.Length);
>     newBytes.CopyTo(updatedBytes, _bytes.Length);
>     _bytes = updatedBytes;
>     _hash = GetHasher().ComputeHash(_bytes);
> }
>
> In our use-case, this means that every file which is read leads to a reallocation and copying and recalculation of the MD5 sum of all the data which was read until now. This is suboptimal from memory and performance perspective.
>
> I'm not an expert on the .NET crypto APIs, but I guess there should be some incremental API available there which could be exploited.
http://ironpython.codeplex.com/workitem/34022

I've also CC'd Emmanuel Chomarat, who was investigating a fix for
this. Unfortunately I don't think there's an easy solution based on
how the .NET APIs are constructed. Quoting from Emmanuel's email to me
a while back:

"I am now using TransformBlock / TransformBlockFinal to compute the
current hash with a linear complexity ( whereas we had before n**2)
but I am still facing an issue.
First we need to have a copy operator, this is not possible because we
can not share the hash instance between two objects in .net, the only
way to make it consistent with what python is doing is by keeping a
copy of the full data in MEMORY in order to create a new instance with
these data when copy is called.
The second thing is that digest can be called several times in python
with some new data added/updated to the hash , in C# as soon as
TransformBlockFinal has been called once we can not anymore add more
data to the stream. Once again I have been able to use the same
previous technic but at a memory cost + computation cost if we call
serveral times digest/hexdigest.

I don't see any to escape this pb with MS api that does not expose
internal states as the underlying md5 lib in python does."

Basically, there's a mismatch between what .NET provides and what
Python needs for perfect compatibility. Keeping all data in memory is
not desirable, but neither is failing some operations. And I would
*really* prefer not to have to reimplement all of the cryptographic
hash functions Python has.

One option is to default to not buffering and failing on certain
operations, and offer a constructor flag that enables buffering to
allow the otherwise-impossible operations. Not my favourite idea, but
workable.

- Jeff
_______________________________________________
Ironpython-users mailing list
Ironpython-users at python.org
https://mail.python.org/mailman/listinfo/ironpython-users