[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 07:02:16 CEST 2007

On 8/26/07, Gregory P. Smith <greg at krypto.org> wrote:
> On 8/26/07, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> > Gregory P. Smith wrote:
> > > I'm in favor of not allowing unicode for hash functions.  Depending on
> > > the system default encoding for a hash will not be portable.
> > >
> > > another question for hashlib:  It uses PyArg_Parse to get a single 's'
> > > out of an optional parameter [see the code] and I couldn't figure out
> > > what the best thing to do there was.  It just needs a C string to pass
> > > to openssl to lookup a hash function by name.  Its C so i doubt it'll
> > > ever be anything but ascii.  How should that parameter be parsed
> > > instead of the old 's' string format?  PyBUF_CHARACTER actually sounds
> > > ideal in that case assuming it guarantees UTF-8 but I wasn't clear
> > > that it did that (is it always utf-8 or the possibly useless as far as
> > > APIs expecting C strings are concerned system "default encoding")?
> > > Requiring a bytes object would also work but I really don't like the
> > > idea of users needing to use a specific type for something so simple.
> > > (i consider string constants with their preceding b, r, u, s, type
> > > characters ugly in code without a good reason for them to be there)
> > >
> >
> > The PyBUF_CHARACTER flag was an add-on after I realized that the old
> > buffer API was being in several places to get Unicode objects to encode
> > their data as a string (in the default encoding of the system, I believe).
> >
> > The unicode object is the only one that I know of that actually does
> > something different when it is called with PyBUF_CHARACTER.
> >
> > > Is it just me or do unicode objects supporting the buffer api seem
> > > like an odd concept given that buffer api consumers (rather than
> > > unicode consumers) shouldn't need to know about encodings of the data
> > > being received.
> >
> > I think you have a point.   The buffer API does support the concept of
> > "formats" but not "encodings" so having this PyBUF_CHARACTER flag looks
> > rather like a hack.   I'd have to look, because I don't even remember
> > what is returned as the "format" from a unicode object if it is
> > requested (it is probably not correct).
>
> given that utf-8 characters are varying widths i don't see how it could ever
> practically be correct for unicode.

Well, *practically*, the unicode object returns UTF-8 for
PyBUF_CHARACTER. That is correct (at least until I rip all this out,
which I'm in the middle of -- but no time to finish it tonight).

> > I would prefer that the notion of encoding a unicode object is separated
> > from the notion of the buffer API, but last week I couldn't see another
> > way to un-tease it.
> >
> > -Travis
>
> A thought that just occurred to me... Would a PyBUF_CANONICAL flag be useful
> instead of CHARACTERS?  For unicode that'd mean utf-8 (not just the default
> encoding) but I could imagine other potential uses such as multi-dimension
> buffers (PIL image objects?) presenting a defined canonical form of the data
> useful for either serialization and hashing.  Any buffer api implementing
> object would define its own canonical form.

Note, the default encoding in 3.0 is fixed to UTF-8. (And it's fixed
in a much more permanent way than in 2.x -- it is really hardcoded and
there is really no way to change it.)

But I'm thinking YAGNI -- the buffer API should always just return the
bytes as they already are sitting in memory, not some transformation
thereof. The current behavior of the unicode object for
PyBUF_CHARACTER violates this. (There are no other violations BTW.)
This is why I want to rip it out. I'm close...

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)