[Python-3000] How should the hash digest of a Unicode string be computed?

Guido van Rossum guido at python.org
Mon Aug 27 20:05:30 CEST 2007


On 8/27/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 8/26/07, Guido van Rossum <guido at python.org> wrote:
> > But I'm wondering if passing a Unicode string to the various hash
> > digest functions should work at all! Hashes are defined on sequences
> > of bytes, and IMO we should insist on the user to pass us bytes, and
> > not second-guess what to do with Unicode.
>
> Conceptually, unicode *by itself* can't be represented as a buffer.
>
> What can be represented is a unicode string + an encoding.  The
> question is whether the hash function needs to know the encoding to
> figure out the hash.
>
> If you're hashing arbitrary bytes, then it doesn't really matter --
> there is no expectation that a recoding should have the same hash.
>
> For hashing as a shortcut to __ne__, it does matter for text.
>
> Unfortunately, for historical reasons, plenty of code grabs the string
> buffer expecting text.

Such code is broken, and this will be an error soon. I think this
handles all the other issues -- as promised, *any* operation that
mixes str and bytes (or anything else supporting the buffer API) will
fail with a TypeError unless an encoding is specified explicitly.

> For dict comparisons, we really ought to specify the equality (and
> therefore hash) in terms of a canonical equivalent, encoded in X (It
> isn't clear to me that X should be UTF-8 in particular, but the main
> thing is to pick something.)

No, dict keys can't be bytes or buffers.

> The alternative is that defensive code will need to do a (normally
> useless boilerplate) decode/canonicalize/reencode dance before
> dictionary checks and insertions.
>
> I would rather see that boilerplate done once in the unicode type (and
> again in any equivalent types, if need be), because
>    (1)  most storage type/encodings would be able to take shortcuts.
>    (2)  if people don't do the defensive coding, the bugs will be very obscure

There is no dance.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list