Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
How to handle unicode data-formats could definitely be improved.
As before, I'm doubtful what the actual needs are. For example, is it desired to support generation of ID3v2 tags with such a data format? The tag is specified here:
Perhaps I was not clear enough about what I'm try to do. For a long time a lot of people have wanted something like Numeric in Python itself. There have been many hurdles to that goal.
After discussions at SciPy 2006 with Guido, we decided that the best way to proceed at this point was to extend the buffer protocol to allow packages to share array-like information with each-other.
There are several things missing from the buffer protocol that NumPy needs in order to be able to really understand the (fixed-size) memory another package has allocated and is sharing.
The most important of these is
1) Shape information 2) Striding information 3) Data-format information (how is each element perceived).
Shape and striding information can be shared with a C-array of integers.
How is data-format information supposed to be shared?
We've come up with a very flexible way to do this in NumPy using a single Python object. This Python object supports describing the layout of any fixed-size chunk of memory (right now in units of bytes --- bit fields could be added, though).
I'm proposing to add this object to Python so that the buffer protcol has a fast and efficient way to share #3. That's really all I'm after.
It also bothers me that so many ways to describe binary data are being used out there. This is a problem that deserves being solved. And, no, ctypes hasn't solved it (we can't directly use the ctypes solution). Perhaps this PEP doesn't hit all the corners, but a data-format object *is* a useful thing to consider.
The array object in Python already has a PyArray_Descr * structure that is a watered-down version of what I'm talking about. In fact, this is what Numeric built from (or vice-versa actually). And NumPy has greatly enhanced this object for any conceivable structure.
Guido seemed to think the data-type objects were nice when he saw them at SciPy 2006, and so I'm presenting a PEP.
Without the data-format object, I'm don't know how to extend the buffer protocol to communicate data-format information. Do you have a better idea?
I have no trouble limiting the data-type object to the buffer protocol extension PEP, but I do think it could gain wider use.
Is it the intent of this PEP to support such data structures, and allow the user to fill in a Unicode object, and then the processing is automatic? (i.e. in ID3v1, the string gets automatically Latin-1-encoded and zero-padded, in ID3v2, it gets automatically UTF-8 encoded, and null-terminated)
No, the point of the data-format object is to communicate information about data-formats not to encode or decode anything. Users of the data-format object could decide what they wanted to do with that information. We just need a standard way to communicate it through the buffer protocol.