[Python-3000] How should the hash digest of a Unicode string be computed?
Gregory P. Smith
greg at krypto.org
Mon Aug 27 05:43:30 CEST 2007
On 8/26/07, Travis Oliphant <oliphant.travis at ieee.org> wrote:
>
> Gregory P. Smith wrote:
> > I'm in favor of not allowing unicode for hash functions. Depending on
> > the system default encoding for a hash will not be portable.
> >
> > another question for hashlib: It uses PyArg_Parse to get a single 's'
> > out of an optional parameter [see the code] and I couldn't figure out
> > what the best thing to do there was. It just needs a C string to pass
> > to openssl to lookup a hash function by name. Its C so i doubt it'll
> > ever be anything but ascii. How should that parameter be parsed
> > instead of the old 's' string format? PyBUF_CHARACTER actually sounds
> > ideal in that case assuming it guarantees UTF-8 but I wasn't clear
> > that it did that (is it always utf-8 or the possibly useless as far as
> > APIs expecting C strings are concerned system "default encoding")?
> > Requiring a bytes object would also work but I really don't like the
> > idea of users needing to use a specific type for something so simple.
> > (i consider string constants with their preceding b, r, u, s, type
> > characters ugly in code without a good reason for them to be there)
> >
>
> The PyBUF_CHARACTER flag was an add-on after I realized that the old
> buffer API was being in several places to get Unicode objects to encode
> their data as a string (in the default encoding of the system, I believe).
>
> The unicode object is the only one that I know of that actually does
> something different when it is called with PyBUF_CHARACTER.
>
> Is it just me or do unicode objects supporting the buffer api seem
> > like an odd concept given that buffer api consumers (rather than
> > unicode consumers) shouldn't need to know about encodings of the data
> > being received.
>
> I think you have a point. The buffer API does support the concept of
> "formats" but not "encodings" so having this PyBUF_CHARACTER flag looks
> rather like a hack. I'd have to look, because I don't even remember
> what is returned as the "format" from a unicode object if it is
> requested (it is probably not correct).
given that utf-8 characters are varying widths i don't see how it could ever
practically be correct for unicode.
I would prefer that the notion of encoding a unicode object is separated
> from the notion of the buffer API, but last week I couldn't see another
> way to un-tease it.
>
> -Travis
A thought that just occurred to me... Would a PyBUF_CANONICAL flag be useful
instead of CHARACTERS? For unicode that'd mean utf-8 (not just the default
encoding) but I could imagine other potential uses such as multi-dimension
buffers (PIL image objects?) presenting a defined canonical form of the data
useful for either serialization and hashing. Any buffer api implementing
object would define its own canonical form.
-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070826/20b62c7a/attachment.htm
More information about the Python-3000
mailing list