[Python-checkins] peps: PEP 456: add some of the new implementation details to the PEP's text

christian.heimes python-checkins at python.org
Wed Nov 13 23:33:02 CET 2013


http://hg.python.org/peps/rev/610f7b296939
changeset:   5267:610f7b296939
user:        Christian Heimes <christian at cheimes.de>
date:        Wed Nov 13 23:32:54 2013 +0100
summary:
  PEP 456: add some of the new implementation details to the PEP's text

files:
  pep-0456.txt |  110 ++++++++++++++++++++++++++++++++------
  1 files changed, 93 insertions(+), 17 deletions(-)


diff --git a/pep-0456.txt b/pep-0456.txt
--- a/pep-0456.txt
+++ b/pep-0456.txt
@@ -222,6 +222,7 @@
 can no longer be considered as secure algorithm. It still may be an
 alternative is hash collision attacks are of no concern.
 
+
 CityHash
 --------
 
@@ -246,6 +247,12 @@
 weakness in CityHash.
 
 
+DJBX33A
+-------
+
+TODO
+
+
 HMAC, MD5, SHA-1, SHA-2
 -----------------------
 
@@ -268,6 +275,45 @@
 other prominent projects have came to the same conclusion.
 
 
+Small string optimization
+=========================
+
+Hash functions like SipHash24 have a costly initialization and finalization
+code that can dominate speed of the algorithm for very short strings. On the
+other hand Python calculates the hash value of short strings quite often. A
+simple and fast function for especially for hashing of small strings can make
+a measurably impact on performance. For example these measurements were taken
+during a run of Python's regression tests. Additional measurements of other
+code have shown a similar distribution.
+
+===== ============ =======
+bytes hash() calls portion
+===== ============ =======
+1            18709    0.2%
+2           737480    9.5%
+3           636178   17.6%
+4          1518313   36.7%
+5           643022   44.9%
+6           770478   54.6%
+7           525150   61.2%
+8           304873   65.1%
+9           297272   68.8%
+10           68191   69.7%
+11         1388484   87.2%
+12          480786   93.3%
+13           52730   93.9%
+14           65309   94.8%
+15           44245   95.3%
+16           85643   96.4%
+Total      7921678
+===== ============ =======
+
+However a fast function like DJBX33A is not as secure as SipHash24. A cutoff
+at about 5 to 7 bytes should provide a decent safety margin and speed up at
+the same time. The PEP's reference implementation provides such a cutoff with
+``Py_HASH_CUTOFF`` but disables the optimization by default.
+
+
 C API additions
 ===============
 
@@ -277,28 +323,57 @@
 -----------
 
 The ``_Py_HashSecret_t`` type of Python 2.6 to 3.3 has two members with either
-32- or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys.
-The typedef will be changed to an union with a guaranteed size of 128 bits on
-all architectures. On platforms with a 64-bit data type it will have two
-``uint64`` members. Because C89 compatible compilers may not have ``uint64``
-the union also has an array of 16 chars.
+32- or 64-bit length each. SipHash requires two 64-bit unsigned integers as
+keys. The typedef will be changed to an union with a guaranteed size of 24
+bytes on all architectures. The union provides a 128 bit random key for
+SipHash24 and FNV as well as an additional value of 64 bit for the optional
+small string optimization and pyexpat seed. The additional 64 bit seed ensures
+that pyexpat or small string optimization cannot reveal bits of the SipHash24
+seed.
+
+memory layout on 64 bit systems::
+
+    cccccccc cccccccc cccccccc  uc -- unsigned char[24]
+    pppppppp ssssssss ........  fnv -- two Py_hash_t
+    k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T
+    ........ ........ ssssssss  djbx33a -- 16 bytes padding + one Py_hash_t
+    ........ ........ eeeeeeee  pyexpat XML hash salt
+
+memory layout on 32 bit systems::
+
+    cccccccc cccccccc cccccccc  uc -- unsigned char[24]
+    ppppssss ........ ........  fnv -- two Py_hash_t
+    k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T (if available)
+    ........ ........ ssss....  djbx33a -- 16 bytes padding + one Py_hash_t
+    ........ ........ eeee....  pyexpat XML hash salt
 
 new type definition::
 
     typedef union {
-        unsigned char uc16[16];
+        /* ensure 24 bytes */
+        unsigned char uc[24];
+        /* two Py_hash_t for FNV */
         struct {
             Py_hash_t prefix;
             Py_hash_t suffix;
-        } ht;
+        } fnv;
     #ifdef PY_UINT64_T
+        /* two uint64 for SipHash24 */
         struct {
             PY_UINT64_T k0;
             PY_UINT64_T k1;
-        } ui64;
+        } siphash;
     #endif
+        /* a different (!) Py_hash_t for small string optimization */
+        struct {
+            unsigned char padding[16];
+            Py_hash_t suffix;
+        } djbx33a;
+        struct {
+            unsigned char padding[16];
+            Py_hash_t hashsalt;
+        } expat;
     } _Py_HashSecret_t;
-
     PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;
 
 ``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
@@ -336,9 +411,9 @@
 hash function selection
 -----------------------
 
-The value of the macro ``PY_HASH_ALGORITHM`` defines which hash algorithm is
-used internally. It may be set to any of the three values ``PY_HASH_SIPHASH24``,
-``PY_HASH_FNV`` or ``PY_HASH_EXTERNAL``. If ``PY_HASH_ALGORITHM`` is not
+The value of the macro ``Py_HASH_ALGORITHM`` defines which hash algorithm is
+used internally. It may be set to any of the three values ``Py_HASH_SIPHASH24``,
+``Py_HASH_FNV`` or ``Py_HASH_EXTERNAL``. If ``Py_HASH_ALGORITHM`` is not
 defined at all, then the best available algorithm is selected. On platforms
 wich don't require aligned memory access (``HAVE_ALIGNED_REQUIRED`` not
 defined) and an unsigned 64bit integer type ``PY_UINT64_T``, SipHash24 is
@@ -346,17 +421,17 @@
 as SPARC, FNV is selected as fallback. A hash algorithm can be selected with
 an autoconf option, for example ``./configure --with-hash-algorithm=fnv``.
 
-The value ``PY_HASH_EXTERNAL`` allows 3rd parties to provide their own
+The value ``Py_HASH_EXTERNAL`` allows 3rd parties to provide their own
 implementation at compile time.
 
 
 Implementation::
 
-    #if PY_HASH_ALGORITHM == PY_HASH_EXTERNAL
+    #if Py_HASH_ALGORITHM == Py_HASH_EXTERNAL
     extern PyHash_FuncDef PyHash_Func;
-    #elif PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
+    #elif Py_HASH_ALGORITHM == Py_HASH_SIPHASH24
     static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
-    #elif PY_HASH_ALGORITHM == PY_HASH_FNV
+    #elif Py_HASH_ALGORITHM == Py_HASH_FNV
     static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
                                          16 * sizeof(Py_hash_t)};
     #endif
@@ -381,7 +456,8 @@
                   # new fields:
                   algorithm='siphash24',
                   hash_bits=64,
-                  seed_bits=128)
+                  seed_bits=128,
+                  cutoff=0)
 
 
 Necessary modifications to C code

-- 
Repository URL: http://hg.python.org/peps


More information about the Python-checkins mailing list