[Numpy-discussion] Changing the size of PyArrayObject_fields (the ndarray c-struct)

Sebastian Berg sebastian at sipsolutions.net
Sat Nov 21 20:50:41 EST 2020


On Sat, 2020-11-21 at 14:42 -0700, Charles R Harris wrote:
> On Sat, Nov 21, 2020 at 12:28 PM Matti Picus <matti.picus at gmail.com>
> wrote:
> 
> > PyArrayObject_fields is the c-struct that underlies ndarray. It is
> > defined in ndarraytypes.h [0]. Since version 1.7, we have been
> > trying to
> > hide it from the public C-API so that we can freely modify it, the
> > structure has the comment:
> > 
> > 
> >   * It has been recommended to use the inline functions defined
> > below
> >   * (PyArray_DATA and friends) to access fields here for a number
> > of
> >   * releases. Direct access to the members themselves is
> > deprecated.
> >   * To ensure that your code does not use deprecated access,
> >   * #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
> >   * (or NPY_1_8_API_VERSION or higher as required).
> > 
> > 
> > 
> > In order to clean up buffer exports, Sebastian suggested (and I
> > pushed
> > for supporting) PR 16938 [1] which would add a new field to the
> > struct.


TL;DR:  Unless we find real examples of affected users I currently
think we should just do it. I don't see much gain in pushing it off
(but I don't care).
I simply expect exceedingly few affected users compared to small gains
for everyone else (including our flexibility for future improvements).


> > As Eric pointed out on the pull request, this would change the size
> > of
> > the struct, meaning users of the struct (i.e., subclassing it in C)
> > would have to be very careful interacting with NumPy-generated
> > objects
> > which may have changed sizes.
> > 

Access to the struct remains ABI compatible. Only those relying on the
size will have issues.  Using the struct size is mainly relevant when
subclassing in C.  Just to note: Cython appears not relevant [0].


Adapting to this change may not be pretty, but it should not be very
tricky to work around it [1].
Without fixes/recompilation you hopefully get an error. But arbitrarily
weird things could happen. (Most likely a crash.)

I currently hope that so few enough users will run into this, that we
can just help every one who reports an issue...


> > 
> > 
> I think the risk is small and this will probably work, the potential
> problem is if people have extended the essentially private structure
> in C
> code. That said, it might be better to put it in after 1.20.x is
> branched
> so that it is out there for others to test against, and perhaps
> implement


I don't mind waiting. Although, I also don't really see us learning
much since only the bigger projects tend to test against master.

Currently, I am aware of only one tiny project that will run into this
and the author of it seemed fine with having to adapt [2].

If we find more projects that are broken by this, I may change my mind.
Until then, I think we should go ahead.


For those who got this far... Aside from code cleanup, the PR also
reduces some overheads. Some quick, approximate timings:

* `memoryview(arr)` is 20% faste. This also helps typed memoryviews
   in cython e.g.:

     cdef myfunc(double[::1] data):
         pass

  has the same speed-up (about 15% faster without any function body).
* `arr[...]` is about 20+% faster. This is because the PR speeds up
  deletion of every NumPy array object by a small bit.

I do not know whether this actually matters for real world code, likely
not significantly... (It would be cool to have a pool of real world
benchmarks to test this type of thing).

Cheers,

Sebastian


[0] My attempt at this using:

   cdef class myarr(np.ndarray):
       pass

failed. It achieved nothing but crashes with current NumPy.



[1] There are three approaches:

1. Just recompile (using the "deprecated api"):

     struct MyArraySubclass {
         char[sizeof(PyArrayObject_fields)];
         int my_field;
     }

    And in the module init function, you should add:

     if (sizeof(PyArrayObject_fields) < PyArrayObject_Type.tp_basicsize) {
         PyErr_SetString(PyExc_RuntimeError,
             "Module not binary compatible with NumPy, please recompile.")
         return -1;
     }

2. Do the same as 1., but add a small constant:

     sizeof(PyArrayObject_fields) + constant

   That way you can compile with an old NumPy version, but ensure
   compatibility with newer versions. (You still need the check while
   loading the module).

3. You go to lengths to achieve 100% binary compatibility no matter
   what and avoid `sizeof(PyArrayObject_fields)` entirely:
   
     size_t offset_of_myfields = PyArrayObject_Type.tp_basicsize + alignment_padding;
     MyArraySubclass->tp_basicsize = offset_of_myfields + sizeof(my_fields);

   And to access `my_fields` you have to use a small macro such as:
   
     #define MY_FIELDS(obj) ((my_fields *)((char *)obj + offset_of_myfields))
   

My personal guess is that solution 2. is sufficient for most small
projects, as it allows you to compile in a forward compatible way, but
the last is the perfectly clean solution of course.


[2] https://github.com/patmiller/bignumpy Also note that it may be that
the project can be easily replaced with a simpler solution that does
not require C-subclassing at all.


> fixes if needed. We do need to make this move at some point and can't
> stay
> fixed on an old structure forever, so the proposed change is
> something of a
> warning shot.
> 
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20201121/249fa8d0/attachment.sig>


More information about the NumPy-Discussion mailing list