[Python-Dev] Make str/bytes hash algorithm pluggable?

Fri Oct 4 05:24:49 CEST 2013

On Thu, Oct 3, 2013 at 11:42 AM, Christian Heimes <christian at python.org>wrote:

> Hi,
>
> some of you may have seen that I'm working on a PEP for a new hash API
> and new algorithms for hashing of bytes and str. The PEP has three major
> aspects. It introduces DJB's SipHash as secure hash algorithm, chances
> the hash API to process blocks of data instead characters and it adds an
> API to make the algorithm pluggable. A draft is already available [1].
>
> Now I got some negative feedback on the 'pluggable' aspect of the new
> PEP on Twitter [2]. I like to get feedback from you before I finalize
> the PEP.
>
> The PEP proposes a pluggable hash API for a couple of reasons. I like to
> give users of Python a chance to replace a secure hash algorithm with a
> faster hash algorithm. SipHash is about as fast as FNV for common cases
> as our implementation of FNV process only 8 to 32 bits per cycle instead
> of 32 or 64. I haven't actually benchmarked how a faster hash algorithm
> affects the a real program, though ...
>
> I also like to make it easier to replace the hash algorithm with a
> different one in case a vulnerability is found. With the new API vendors
> and embedders have an easy and clean way to use their own hash
> implementation or an optimized version that is more suitable for their
> platform, too. For example a mobile phone vendor could provide an
> optimized implementation with ARM NEON intrinsics.
>
>
> On which level should Python support a pluggable hash algorithm?
>
> 1) Compile time option: The hash code is compiled into Python's core.
> Embedders have to recompile Python with different options to replace the
> function.
>

This would be fine with me.

>
> 2) Library option: A hash algorithm can be added and one avaible hash
> algorithm can be set before Py_Initialize() is called for the first
> time. The approach gives embedders the chance the set their own
> algorithm without recompiling Python.
>

This would be more convenient. But I only want to replace the algorithm in
my code embedding CPython, not add one. So long as the performance impact
of supporting this is not usefully relevant, do that (let me supply the
algorithm to be used for each of bytes and str before Py_Initialize is
called).

>
> 3) Startup options: Like 2) plus an additional environment variable and
> command line argument to select an algorithm. With a startup option
> users can select a different algorithm themselves.
>

I can't imagine any reason I or anyone else would ever want this.

side note: In Python 2 and earlier the hash algorithm went to great lengths
to make unicode and bytes values that were the same (in at least ascii,
possibly latin-1 or utf-8 as well) hash to the same value. Is it safe to
assume that very annoying performance sapping invariant is no longer
required in Python 3 given that the whole default encoding for bytes to
unicode comparisons is gone?  (and thus the need for them to land in the
same dict hash bucket)

-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20131003/9f5cfdb2/attachment.html>