[Python-checkins] peps: document requirements
christian.heimes
python-checkins at python.org
Mon Sep 30 17:08:00 CEST 2013
http://hg.python.org/peps/rev/bd6fdd7cd47a
changeset: 5156:bd6fdd7cd47a
user: Christian Heimes <christian at cheimes.de>
date: Mon Sep 30 17:07:51 2013 +0200
summary:
document requirements
talk about AES-NI CMAC and HMAC as possible alternatives (too slow)
document necessary changes to C code
files:
pep-0456.txt | 188 ++++++++++++++++++++++++++++++++++----
1 files changed, 168 insertions(+), 20 deletions(-)
diff --git a/pep-0456.txt b/pep-0456.txt
--- a/pep-0456.txt
+++ b/pep-0456.txt
@@ -125,8 +125,33 @@
function makes it impossible to conceal the secrets.
-Hash algorithm
-==============
+Requirements for a hash function
+================================
+
+
+
+* It must be able to hash arbitrarily large blocks of memory from 1 bytes up
+ to the maximum ``ssize_t`` value.
+
+* It must produce at least 32bit values on 32bit platforms and at least 64bit
+ values on 64bit platforms. (Note: Larger outputs can be compressed with e.g.
+ ``v ^ (v >> 32)``.)
+
+* It must support hashing of unaligned memory in order to support
+ hash(memoryview).
+
+* It must not return ``-1``. It` either stands for error or missing hash value.
+ (Note: A special case can be added to map ``-1`` to ``-2``.)
+
+* It should return ``0`` for zero length input. (Note: This can be handled as
+ special case, too.)
+
+
+Examined hashing algorithms
+===========================
+
+The author of this PEP has researched several hashing algorithms that are
+considered modern, fast and state-of-the-art.
SipHash
-------
@@ -145,9 +170,12 @@
DoS attacks.
siphash24 is the recommend variant with best performance. It uses 2 rounds per
-message block and 4 finalization rounds.
-
-Marek Majkowski C implementation csiphash [csiphash]_::
+message block and 4 finalization rounds. Besides the reference implementation
+several other implementations are available. Some are single-shot functions,
+others use a Merkle–Damgård construction-like approach with init, update and
+finalize functions. Marek Majkowski C implementation csiphash [csiphash]_
+defines the prototype of the function. (Note: ``k`` is split up into two
+uint64_t)::
uint64_t siphash24(const void *src,
unsigned long src_sz,
@@ -160,9 +188,10 @@
MurmurHash [murmur]_ is a family of non-cryptographic keyed hash function
developed by Austin Appleby. Murmur3 is the latest and fast variant of
MurmurHash. The C++ reference implementation has been released into public
-domain. It features 32bit seed and 32 or 128bit output.
+domain. It features 32 or 128bit output with a 32bit seed. (Note: The out
+parameter is a buffer with either 1 or 4 bytes.)
-::
+Murmur3's function prototypes are::
void MurmurHash3_x86_32(const void *key,
int len,
@@ -179,6 +208,10 @@
uint32_t seed,
void *out);
+Aumasson, Bernstein and Boßlet have shown [sip]_ [ocert-2012-001]_ that
+Murmur3 is not resilient against hash collision attacks. Therefore Murmur3
+can no longer be considered as secure algorithm. It still may be an
+alternative is hash collision attacks are of no concern.
CityHash
--------
@@ -197,9 +230,36 @@
uint64 seed1)
+Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip]_ a similar
+weakness in CityHash.
-C API Implementation
-====================
+
+HMAC, MD5, SHA-1, SHA-2
+-----------------------
+
+These hash algorithms are too slow and have high setup and finalization costs.
+For these reasons they are not considered fit for this purpose.
+
+
+AES CMAC
+--------
+
+Modern AMD and Intel CPUs have AES-NI (AES instruction set) [aes-ni]_ to speed
+up AES encryption. CMAC with AES-NI might be a viable option but it's probably
+too slow for daily operation. (testing required)
+
+
+Conclusion
+----------
+
+SipHash provides the best combination of speed and security. Developers of
+other prominent projects have came to the same conclusion.
+
+
+C API additions
+===============
+
+All C API extension modifications are no part of the stable API.
hash secret
-----------
@@ -232,21 +292,26 @@
``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
exactly once at startup.
+hash function
+-------------
+
+function prototype::
+
+ typedef Py_hash_t (*PyHash_func_t)(void *, Py_ssize_t);
+
hash function table
-------------------
type definition::
- typedef Py_hash_t (*PyHash_func_t)(void *, Py_ssize_t);
-
typedef struct {
PyHash_func_t hashfunc;
char *name;
unsigned int precedence;
} PyHash_FuncDef;
- PyAPI_DATA(PyHash_FuncDef) *PyHash_FuncTable;
+ PyAPI_DATA(PyHash_FuncDef *) PyHash_FuncTable;
Implementation::
@@ -264,11 +329,13 @@
hash function API
-----------------
-::
+function proto types::
- int PyHash_SetHashAlgorithm(char *name);
+ PyAPI_FUNC(int) PyHash_SetHashAlgorithm(char *name);
- PyHash_FuncDef* PyHash_GetHashAlgorithm(void);
+ PyAPI_FUNC(PyHash_FuncDef *) PyHash_GetHashAlgorithm(void);
+
+ PyAPI_DATA(PyHash_FuncDef *) _PyHash_Func;
``PyHash_SetHashAlgorithm(NULL)`` selects the hash algorithm with the highest
precedence. ``PyHash_SetHashAlgorithm("sip24")`` selects siphash24 as hash
@@ -279,11 +346,12 @@
``PyHash_GetHashAlgorithm()`` returns a pointer to current hash function
definition or `NULL`.
-(XXX use an extern variable to hold a function pointer to improve performance?)
+``_PyHash_Func`` holds the set hash function definition. It can't be modified
+or reset once a hash algorithm is set.
-Python API
-==========
+Python API addition
+===================
sys module
----------
@@ -309,13 +377,86 @@
_testcapi.get_hash(name: str, str_or_buffer) -> int
+Necessary modifications to C code
+=================================
+
+_Py_HashBytes (Objects/object.c)
+--------------------------------
+
+``_Py_HashBytes`` is an internal helper function that provides the hashing
+code for bytes, memoryview and datetime classes. It currently implements FNV
+for ``unsigned char*``. The function can either be modified to use the new
+API or it could be completely removed to avoid an unnecessary level of
+indirection.
+
+
+bytes_hash (Objects/bytesobject.c)
+----------------------------------
+
+``bytes_hash`` uses ``_Py_HashBytes`` to provide the tp_hash slot function
+for bytes objects. If ``_Py_HashBytes`` is to be removed then ``bytes_hash``
+must be reimplemented.
+
+
+memory_hash (Objects/memoryobject.c)
+------------------------------------
+
+``memory_hash`` provides the tp_hash slot function for read-only memory
+views if the original object is hashable, too. It's the only function that
+has to support hashing of unaligned memory segments in the future.
+
+
+unicode_hash (Objects/unicodeobject.c)
+--------------------------------------
+
+``bytes_hash`` provides the tp_hash slot function for unicode. Right now it
+implements the FNV algorithm three times for ``unsigned char*``, ``Py_UCS2``
+and ``Py_UCS4``. A reimplementation of the function must take care to use the
+correct length. Since the macro ``PyUnicode_GET_LENGTH`` returns the length
+of the unicode string and not its size in octets, the length must be
+multiplied with the size of the internal unicode kind::
+
+ Py_ssize_t len;
+ Py_uhash_t x;
+
+ len = PyUnicode_GET_LENGTH(self);
+ switch (PyUnicode_KIND(self)) {
+ case PyUnicode_1BYTE_KIND: {
+ const Py_UCS1 *c = PyUnicode_1BYTE_DATA(self);
+ x = _PyHash_Func->hashfunc(c, len * sizeof(Py_UCS1));
+ break;
+ }
+ case PyUnicode_2BYTE_KIND: {
+ const Py_UCS2 *s = PyUnicode_2BYTE_DATA(self);
+ x = _PyHash_Func->hashfunc(s, len * sizeof(Py_UCS2));
+ break;
+ }
+ case PyUnicode_4BYTE_KIND: {
+ const Py_UCS4 *l = PyUnicode_4BYTE_DATA(self);
+ x = _PyHash_Func->hashfunc(l, len * sizeof(Py_UCS4));
+ break;
+ }
+ }
+
+
+generic_hash (Modules/_datetimemodule.c)
+----------------------------------------
+
+``generic_hash`` acts as a wrapper around ``_Py_HashBytes`` for the tp_hash
+slots of date, time and datetime types. timedelta objects are hashed by their
+state (days, seconds, microseconds) and tzinfo objects are not hashable. The
+data members of date, time and datetime types' struct are not void* aligned.
+This can easily by fixed with memcpy()ing four to ten bytes to an aligned
+buffer.
+
+
Further things to consider
==========================
ASCII str / bytes hash collision
--------------------------------
-Since the implementation of [#pep-0393]_ bytes and ASCII text have the same
+Since the implementation of [pep-0393]_ bytes and ASCII text have the same
memory layout. Because of this the new hashing API will keep the invariant::
hash("ascii string") == hash(b"ascii string")
@@ -337,6 +478,9 @@
strings. For very short strings the setup costs for SipHash dominates its
speed but it is still in the same order of magnitude as the current FNV code.
+It's yet unknown how the new distribution of hash values affects collisions
+of common keys in dicts of Python classes.
+
Serhiy Storchaka has shown in [issue16427]_ that a modified FNV
implementation with 64bits per cycle is able to process long strings several
times faster than the current FNV implementation.
@@ -385,6 +529,8 @@
.. [ocert] http://www.nruns.com/_downloads/advisory28122011.pdf
+.. [ocert-2012-001] http://www.ocert.org/advisories/ocert-2012-001.html
+
.. [poc] https://131002.net/siphash/poc.py
.. [issue13703] http://bugs.python.org/issue13703
@@ -401,7 +547,9 @@
.. [csiphash] https://github.com/majek/csiphash/
-.. [#pep-0393] http://www.python.org/dev/peps/pep-0393/
+.. [pep-0393] http://www.python.org/dev/peps/pep-0393/
+
+.. [aes-ni] http://en.wikipedia.org/wiki/AES_instruction_set
Copyright
--
Repository URL: http://hg.python.org/peps
More information about the Python-checkins
mailing list