[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 05:43:30 CEST 2007

On 8/26/07, Travis Oliphant <oliphant.travis at ieee.org> wrote:
>
> Gregory P. Smith wrote:
> > I'm in favor of not allowing unicode for hash functions.  Depending on
> > the system default encoding for a hash will not be portable.
> >
> > another question for hashlib:  It uses PyArg_Parse to get a single 's'
> > out of an optional parameter [see the code] and I couldn't figure out
> > what the best thing to do there was.  It just needs a C string to pass
> > to openssl to lookup a hash function by name.  Its C so i doubt it'll
> > ever be anything but ascii.  How should that parameter be parsed
> > instead of the old 's' string format?  PyBUF_CHARACTER actually sounds
> > ideal in that case assuming it guarantees UTF-8 but I wasn't clear
> > that it did that (is it always utf-8 or the possibly useless as far as
> > APIs expecting C strings are concerned system "default encoding")?
> > Requiring a bytes object would also work but I really don't like the
> > idea of users needing to use a specific type for something so simple.
> > (i consider string constants with their preceding b, r, u, s, type
> > characters ugly in code without a good reason for them to be there)
> >
>
> The PyBUF_CHARACTER flag was an add-on after I realized that the old
> buffer API was being in several places to get Unicode objects to encode
> their data as a string (in the default encoding of the system, I believe).
>
> The unicode object is the only one that I know of that actually does
> something different when it is called with PyBUF_CHARACTER.
>
> Is it just me or do unicode objects supporting the buffer api seem
> > like an odd concept given that buffer api consumers (rather than
> > unicode consumers) shouldn't need to know about encodings of the data
> > being received.
>
> I think you have a point.   The buffer API does support the concept of
> "formats" but not "encodings" so having this PyBUF_CHARACTER flag looks
> rather like a hack.   I'd have to look, because I don't even remember
> what is returned as the "format" from a unicode object if it is
> requested (it is probably not correct).

given that utf-8 characters are varying widths i don't see how it could ever
practically be correct for unicode.

I would prefer that the notion of encoding a unicode object is separated
> from the notion of the buffer API, but last week I couldn't see another
> way to un-tease it.
>
> -Travis

A thought that just occurred to me... Would a PyBUF_CANONICAL flag be useful
instead of CHARACTERS?  For unicode that'd mean utf-8 (not just the default
encoding) but I could imagine other potential uses such as multi-dimension
buffers (PIL image objects?) presenting a defined canonical form of the data
useful for either serialization and hashing.  Any buffer api implementing
object would define its own canonical form.

-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070826/20b62c7a/attachment.htm