On 8/26/07, <b class="gmail_sendername">Travis Oliphant</b> <<a href="mailto:firstname.lastname@example.org">email@example.com</a>> wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Gregory P. Smith wrote:<br>> I'm in favor of not allowing unicode for hash functions. Depending on<br>> the system default encoding for a hash will not be portable.<br>><br>> another question for hashlib: It uses PyArg_Parse to get a single 's'
<br>> out of an optional parameter [see the code] and I couldn't figure out<br>> what the best thing to do there was. It just needs a C string to pass<br>> to openssl to lookup a hash function by name. Its C so i doubt it'll
<br>> ever be anything but ascii. How should that parameter be parsed<br>> instead of the old 's' string format? PyBUF_CHARACTER actually sounds<br>> ideal in that case assuming it guarantees UTF-8 but I wasn't clear
<br>> that it did that (is it always utf-8 or the possibly useless as far as<br>> APIs expecting C strings are concerned system "default encoding")?<br>> Requiring a bytes object would also work but I really don't like the
<br>> idea of users needing to use a specific type for something so simple.<br>> (i consider string constants with their preceding b, r, u, s, type<br>> characters ugly in code without a good reason for them to be there)
<br>><br><br>The PyBUF_CHARACTER flag was an add-on after I realized that the old<br>buffer API was being in several places to get Unicode objects to encode<br>their data as a string (in the default encoding of the system, I believe).
<br><br>The unicode object is the only one that I know of that actually does<br>something different when it is called with PyBUF_CHARACTER.<br></blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
> Is it just me or do unicode objects supporting the buffer api seem<br>> like an odd concept given that buffer api consumers (rather than<br>> unicode consumers) shouldn't need to know about encodings of the data
<br>> being received.<br><br>I think you have a point. The buffer API does support the concept of<br>"formats" but not "encodings" so having this PyBUF_CHARACTER flag looks<br>rather like a hack. I'd have to look, because I don't even remember
<br>what is returned as the "format" from a unicode object if it is<br>requested (it is probably not correct).</blockquote><div><br>given that utf-8 characters are varying widths i don't see how it could ever practically be correct for unicode.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I would prefer that the notion of encoding a unicode object is separated<br>from the notion of the buffer API, but last week I couldn't see another
<br>way to un-tease it.<br><br>-Travis</blockquote><div><br>A thought that just occurred to me... Would a PyBUF_CANONICAL flag be useful instead of CHARACTERS? For unicode that'd mean utf-8 (not just the default encoding) but I could imagine other potential uses such as multi-dimension buffers (PIL image objects?) presenting a defined canonical form of the data useful for either serialization and hashing. Any buffer api implementing object would define its own canonical form.