[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 00:54:07 CEST 2007

I'm in favor of not allowing unicode for hash functions.  Depending on
the system default encoding for a hash will not be portable.

another question for hashlib:  It uses PyArg_Parse to get a single 's'
out of an optional parameter [see the code] and I couldn't figure out
what the best thing to do there was.  It just needs a C string to pass
to openssl to lookup a hash function by name.  Its C so i doubt it'll
ever be anything but ascii.  How should that parameter be parsed
instead of the old 's' string format?  PyBUF_CHARACTER actually sounds
ideal in that case assuming it guarantees UTF-8 but I wasn't clear
that it did that (is it always utf-8 or the possibly useless as far as
APIs expecting C strings are concerned system "default encoding")?
Requiring a bytes object would also work but I really don't like the
idea of users needing to use a specific type for something so simple.
(i consider string constants with their preceding b, r, u, s, type
characters ugly in code without a good reason for them to be there)

test_hashlib.py passed on the x86 osx system i was using to write the
code.  I neglected to run the full suite or grep for hashlib in other
test suites and run those so i missed the test_unicodedata failure,
sorry about the breakage.

Is it just me or do unicode objects supporting the buffer api seem
like an odd concept given that buffer api consumers (rather than
unicode consumers) shouldn't need to know about encodings of the data
being received.

-gps

On 8/26/07, Guido van Rossum <guido at python.org> wrote:
> Change r57490 by Gregory P Smith broke a test in test_unicodedata and,
> on PPC OSX, several tests in test_hashlib.
>
> Looking into this it's pretty clear *why* it broke: before, the 's#'
> format code was used, while Gregory's change changed this into using
> the buffer API (to ensure the data won't move around). Now, when a
> (Unicode) string is passed to s#, it uses the UTF-8 encoding. But the
> buffer API uses the raw bytes in the Unicode object, which is
> typically UTF-16 or UTF-32. (I can't quite figure out why the tests
> didn't fail on my Linux box; I'm guessing it's an endianness issue,
> but it can't be that simple. Perhaps that box happens to be falling
> back on a different implementation of the checksums?)
>
> I checked in a fix (because I don't like broken tests :-) which
> restores the old behavior by passing PyBUF_CHARACTER to
> PyObject_GetBuffer(), which enables a special case in the buffer API
> for PyUnicode that returns the UTF-8 encoded bytes instead of the raw
> bytes. (I still find this questionable, especially since a few random
> places in bytesobject.c also use PyBUF_CHARACTER, presumably to make
> tests pass, but for the *bytes* type, requesting *characters* (even
> encoded ones) is iffy.
>
> But I'm wondering if passing a Unicode string to the various hash
> digest functions should work at all! Hashes are defined on sequences
> of bytes, and IMO we should insist on the user to pass us bytes, and
> not second-guess what to do with Unicode.
>
> Opinions?
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/greg%40krypto.org
>