[Numpy-discussion] record array performance issue / bug

Sun Nov 22 11:52:04 EST 2015

On Sat, Nov 21, 2015 at 8:54 PM, G Jones <glenn.caltech at gmail.com> wrote:

> Hi,
> Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
> the following code works OK if npackets = 2, but acts bizarrely if npackets
> is large (2**12):
>
> -----------
>
> npackets = 2**12
> dlen=2048
> PacketType = np.dtype([('timestamp','float64'),
>                            ('pkts',np.dtype(('int8',(npackets,dlen)))),
>                            ('data',np.dtype(('int8',(npackets*dlen,)))),
>                            ])
>
> b = np.zeros((1,),dtype=PacketType)
>
> b['timestamp']  # Should return array([0.0])
>
> ----------------
>
> Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
> b['timestamp'] results in 100% CPU usage while the memory consumption is
> increasing by hundreds of MB per second. When I interrupt, I find the
> traceback in numpy/core/_internal.pyc : _get_all_field_offsets
> Since it seems to work for small values of npackets, I suspect that if I
> had the memory and time, the access to b['timestamp'] would eventually
> return, so I think the issue is that the algorithm doesn't scale well with
> record dtypes made up of lots of bytes.
> Looking on Github, I can see this code has been in flux recently, but I
> can't quite tell if the issue I'm seeing is addressed by the issues being
> discussed and tackled there.
>

This should be fixed in 1.10.2.  1.10.2rc1 is up on sourceforge if you want
to test it.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151122/a2908e14/attachment.html>