[Numpy-discussion] Bytes vs. Unicode in Python3

Thu Dec 3 08:56:16 EST 2009

Pauli Virtanen wrote:
> Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
> [clip]
>   
>> Great! Are you storing the format string in the dtype types as well? (So
>> that no release is needed and acquisitions are cheap...)
>>     
>
> I regenerate it on each buffer acquisition. It's simple low-level C code, 
> and I suspect it will always be fast enough. Of course, we could *cache* 
> the result in the dtype. (If dtypes are immutable, which I don't remember 
> right now.)
>   
We discussed this at SciPy 09 -- basically, they are not necesarrily 
immutable in implementation, but anywhere they are not that is a bug and 
no code should depend on their mutability, so we are free to assume so.

> Do you have a case in mind where the speed of format string generation 
> would be a bottleneck?
>   
Going all the way down to user code; no. Well, contrived: You have a 
Python list of NumPy arrays and want to sum over the first element of 
each, acquiring the buffer by PEP 3118 (which is easy through Cython). 
In that case I can see all the memory allocation that must go on for 
each element for the format-string as a bottle-neck.

But mostly it's from cleanliness of implementation, like the fact that 
you don't know up-front how long the string need to be for nested dtypes.

Obviously, what you have done is much better than nothing, and probably 
sufficient for nearly all purposes, so I should stop complaining.
>
>> Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too
>> conservative with NumPy arrays. If a contiguous buffer is requested,
>> then  looping through the strides and checking that the strides are
>> monotonically decreasing/increasing could eventually save copying in
>> some cases. I think that could be worth it -- I actually have my own
>> code for IS_F_CONTIGUOUS rather than relying on the flags personally
>> because of this issue, so it does come up in practice.
>>     
>
> Are you sure?
>
> Assume monotonically increasing or decreasing strides with inner stride 
> of itemsize. Now, if the strides are not C or F-contiguous, doesn't this 
> imply that part of the data in the memory block is *not* pointed to by a 
> set of indices? [For example, strides = {itemsize, 3*itemsize}; dims = 
> {2, 2}. Now, there is unused memory between items (1,0) and (0,1).]
>
> This probably boils down to what exactly was meant in the PEP and Python 
> docs by "contiguous". I'd believe it was meant to be the same as in Numpy 
> -- that you can send the array data e.g. to Fortran as-is. If so, there 
> should not be gaps in the data, if the client explicitly requested that 
> the buffer be contiguous.
>
> Maybe you meant that the Numpy array flags (which the macros check) are 
> not always up-to-date wrt. the stride information?
>   
Yep, this is what I meant, and the rest is wrong. But now that I think 
about it, the case that bit me is

In [14]: np.arange(10)[None, None, :].flags.c_contiguous
Out[14]: False

I suppose this particular case could be fixed properly with little cost 
(if it isn't already). It is probably cleaner to just rely on the flags 
for PEP 3118, less confusion etc. Sorry for the distraction.

Dag Sverre