[Python-checkins] peps: Update PEP 456 to reflect the changes in features/pep-456

christian.heimes python-checkins at python.org
Mon Oct 28 00:33:31 CET 2013


http://hg.python.org/peps/rev/8a787f8c1dd6
changeset:   5231:8a787f8c1dd6
user:        Christian Heimes <christian at cheimes.de>
date:        Mon Oct 28 00:33:22 2013 +0100
summary:
  Update PEP 456 to reflect the changes in features/pep-456

files:
  pep-0456.txt |  139 ++++++++++++++++++++++----------------
  1 files changed, 80 insertions(+), 59 deletions(-)


diff --git a/pep-0456.txt b/pep-0456.txt
--- a/pep-0456.txt
+++ b/pep-0456.txt
@@ -75,15 +75,13 @@
 * It MUST support hashing of unaligned memory in order to support
   hash(memoryview).
 
-* It MUST NOT return ``-1``. The value is reserved for error cases and yet
-  uncached hash values.  (Note: A special case can be added to map ``-1``
-  to ``-2``.)
-
 * It is highly RECOMMENDED that the length of the input influences the
   outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
 
-* It MAY return ``0`` for zero length input in order to disguise the
-  randomization seed. (Note: This can be handled as special case, too.)
+The internal interface code between the hash function and the tp_hash slots
+implements special cases for zero length input and a return value of ``-1``.
+An input of length ``0`` is mapped to hash value ``0``. The output ``-1``
+is mapped to ``-2``.
 
 
 Current implementation with modified FNV
@@ -306,52 +304,63 @@
 ``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
 exactly once at startup.
 
-hash function
--------------
 
-function prototype::
+hash function definition
+------------------------
 
-    typedef Py_hash_t (*PyHash_Func_t)(const void *, Py_ssize_t);
+Implementation::
+
+    typedef struct {
+        /* function pointer to hash function, e.g. fnv or siphash24 */
+        Py_hash_t (*const hash)(const void *, Py_ssize_t);
+        const char *name;       /* name of the hash algorithm and variant */
+        const int hash_bits;    /* internal size of hash value */
+        const int seed_bits;    /* size of seed input */
+    } PyHash_FuncDef;
+
+    PyAPI_FUNC(PyHash_FuncDef*) PyHash_GetFuncDef(void);
+
+
+autoconf
+--------
+
+A new test is added to the configure script. The test sets
+``HAVE_ALIGNED_REQUIRED``, when it detects a platform, that requires aligned
+memory access for integers. Must current platforms such as X86, X86_64 and
+modern ARM don't need aligned data.
+
+A new option ``--with-hash-algorithm`` enables the user to select a hash
+algorithm in the configure step.
 
 
 hash function selection
 -----------------------
 
-type definition::
+The value of the macro ``PY_HASH_ALGORITHM`` defines which hash algorithm is
+used internally. It may be set to any of the three values ``PY_HASH_SIPHASH24``,
+``PY_HASH_FNV`` or ``PY_HASH_EXTERNAL``. If ``PY_HASH_ALGORITHM`` is not
+defined at all, then the best available algorithm is selected. On platforms
+wich don't require aligned memory access (``HAVE_ALIGNED_REQUIRED`` not
+defined) and an unsigned 64bit integer type ``PY_UINT64_T``, SipHash24 is
+used. On strict C89 platforms without a 64 bit data type, or architectures such
+as SPARC, FNV is selected as fallback. A hash algorithm can be selected with
+an autoconf option, for example ``./configure --with-hash-algorithm=fnv``.
 
-    #define PY_HASH_SIPHASH24 0x53495024
-    #define PY_HASH_FNV 0x464E56
+The value ``PY_HASH_EXTERNAL`` allows 3rd parties to provide their own
+implementation at compile time.
 
-    #ifndef PY_HASH_ALGORITHM
-    #if defined(PY_UINT64_T) && defined(PY_UINT32_T)
-    #define PY_HASH_ALGORITHM PY_HASH_SIPHASH24
-    #else
-    #define PY_HASH_ALGORITHM PY_HASH_FNV
-    #endif /* uint64_t && uint32_t */
-    #endif /* PY_HASH_ALGORITHM */
-
-    typedef struct {
-        PyHash_Func_t hash;    /* function pointer */
-        char *name;            /* name of the hash algorithm and variant */
-        int hash_bits;         /* internal size of hash value */
-        int seed_bits;         /* size of seed input */
-    } PyHash_FuncDef;
-
-    PyAPI_DATA(PyHash_FuncDef) PyHash_Func;
 
 Implementation::
 
-    #if PY_HASH_ALGORITHM == PY_HASH_FNV
-    PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
-                                  16 * sizeof(Py_hash_t)};
+    #if PY_HASH_ALGORITHM == PY_HASH_EXTERNAL
+    extern PyHash_FuncDef PyHash_Func;
+    #elif PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
+    static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
+    #elif PY_HASH_ALGORITHM == PY_HASH_FNV
+    static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
+                                         16 * sizeof(Py_hash_t)};
     #endif
 
-    #if PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
-    PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
-    #endif
-
-TODO: select hash algorithm with autoconf variable
-
 
 Python API addition
 ===================
@@ -378,34 +387,37 @@
 Necessary modifications to C code
 =================================
 
-_Py_HashBytes (Objects/object.c)
---------------------------------
+_Py_HashBytes() (Objects/object.c)
+----------------------------------
 
 ``_Py_HashBytes`` is an internal helper function that provides the hashing
 code for bytes, memoryview and datetime classes. It currently implements FNV
-for ``unsigned char*``. The function can either be modified to use the new
-API or it could be completely removed to avoid an unnecessary level of
-indirection.
+for ``unsigned char *``.
 
+The function is moved to Python/pyhash.c and modified to use the hash function
+through PyHash_Func.hash(). The function signature is altered to take
+a ``const void *`` as first argument. ``_Py_HashBytes`` also takes care of
+special cases. It maps zero length input to ``0`` and return value of ``-1``
+to ``-2``.
 
-bytes_hash (Objects/bytesobject.c)
-----------------------------------
+bytes_hash() (Objects/bytesobject.c)
+------------------------------------
 
 ``bytes_hash`` uses ``_Py_HashBytes`` to provide the tp_hash slot function
-for bytes objects. If ``_Py_HashBytes`` is to be removed then ``bytes_hash``
-must be reimplemented.
+for bytes objects. The function will continue to use ``_Py_HashBytes``
+but withoht a type cast.
 
-
-memory_hash (Objects/memoryobject.c)
-------------------------------------
+memory_hash() (Objects/memoryobject.c)
+--------------------------------------
 
 ``memory_hash`` provides the tp_hash slot function for read-only memory
 views if the original object is hashable, too. It's the only function that
-has to support hashing of unaligned memory segments in the future.
+has to support hashing of unaligned memory segments in the future. The
+function will continue to use ``_Py_HashBytes`` but withoht a type cast.
 
 
-unicode_hash (Objects/unicodeobject.c)
---------------------------------------
+unicode_hash() (Objects/unicodeobject.c)
+----------------------------------------
 
 ``unicode_hash`` provides the tp_hash slot function for unicode. Right now it
 implements the FNV algorithm three times for ``unsigned char*``, ``Py_UCS2``
@@ -416,12 +428,12 @@
 
     if (PyUnicode_READY(u) == -1)
         return -1;
-    x = PyHash_Func.hash(PyUnicode_DATA(u),
-                         PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
+    x = _Py_HashBytes(PyUnicode_DATA(u),
+                      PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
 
 
-generic_hash (Modules/_datetimemodule.c)
-----------------------------------------
+generic_hash() (Modules/_datetimemodule.c)
+------------------------------------------
 
 ``generic_hash`` acts as a wrapper around ``_Py_HashBytes`` for the tp_hash
 slots of date, time and datetime types. timedelta objects are hashed by their
@@ -459,8 +471,17 @@
 strings. For very short strings the setup cost for SipHash dominates its
 speed but it is still in the same order of magnitude as the current FNV code.
 
-It's yet unknown how the new distribution of hash values affects collisions
-of common keys in dicts of Python classes.
+
+Hash value distribution
+-----------------------
+
+A good distribution of hash values is important for dict and set performance.
+Both SipHash24 and FNV take the length of the input into account, so that
+strings made up entirely of NULL bytes don't have the same hash value. The
+last bytes of the input tend to affect the least significant bits of the hash
+value, too. That attribute reduces the amount of hash collisions for strings
+with a common prefix.
+
 
 Typical length
 --------------

-- 
Repository URL: http://hg.python.org/peps


More information about the Python-checkins mailing list