originally, I have planned to make an extension of the .npy file format a
dedicated follow-up pull request, but I have upgraded my current request
instead, since it was not as difficult to implement as I initially thought
and probably a more straight-forward solution:
What is this pull request about? It is about appending to Numpy .npy files.
Why? I see two main use cases:
1. creating .npy files larger than the main memory. They can, once
finished, be loaded as memory maps
2. creating binary log files, which can be processed very efficiently
Are there not other good file formats to do this? Theoretically yes, but
practically they can be pretty complex and with very little tweaking .npy
could do efficient appending too.
Use case 1 is already covered by the Pip/Conda package npy-append-array I
have created and getting the functionality directly into Numpy was the
original goal of the pull request. This would have been possible without
introducing a new file format version, just by adding some spare space in
the header. During the pull request discussion it turned out that rewriting
the header after each append would be desirable in case the writing program
crashes to minimize data loss.
Use case 2 however would highly profit from a new file format version as it
would make rewriting the header unnecessary: since efficient appending can
only take place along one axis, setting shape[-1] = -1 in case of Fortran
order or shape = -1 otherwise (default) in the .npy header on file
creation could indicate that the array size is determined by the file size:
when np.load (typically with memory mapping on) gets called, it constructs
the ndarray with the actual shape by replacing the -1 in the constructor
call. Otherwise, the header is not modified anymore, neither on append nor
on file write finish.
Concurrent appends to a single file would not be advisable and should be
channeled through a single AppendArray instance. Concurrent reads while
writes take place however should work relatively smooth: every time an
np.load (ideally with mmap) is called, the ndarray would provide access to
all data written until that time.
Currently, my pull request provides:
1. A definition of .npy version 4.0 that supports -1 in the shape
2. implementations for fortran order and non-fortran order (default)
including test cases
3. Updated np.load
4. The AppendArray class that does the actual appending
Although there is a certain hassle with introducing a new .npy version, the
changes themselves are very small. I could also implement a fallback mode
for older Numpy installations, if someone is interested.
What do you think about such a feature, would it make sense? Anyone
available for some more code review?
Best from Berlin, Michael
PS thank you so far, I could improve my npy-append-array module as well and
from what I have seen so far the Numpy code readability exceeded my already
(sorry for the length, details/discussion below)
On the triage call, there seemed a preference to just try to skip the
deprecation and introduce `copy="never"`, `copy="if_needed"`, and
`copy="always"` (i.e. string options for the `copy` keyword argument).
Strictly speaking, this is against the typical policy (one year of
warning/errors). But nobody could think of a reasonable chance that
anyone actually uses it. (For me just "policy" will be enough of an
argument to just take it slow.)
BUT: If nobody has *any* concerns at all, I think we may just end up
introducing the change right away.
The PR is: https://github.com/numpy/numpy/pull/19173
## The Feature
There is the idea to add `copy=never` (or similar). This would modify
the existing `copy` argument to make it a 3-way decision:
* `copy=always` or `copy=True` to force a copy
* `copy=if_needed` or `copy=False` to prefer no-copy behavior
* `copy=never` to error when no-copy behavior is not possible
(this ensures that a view is returned)
this would affect the functions:
* np.array(object, copy=...)
* arr.astype(new_dtype, copy=...)
* np.reshape(arr, new_shape, copy=...), and the method arr.reshape()
* np.meshgrid and possibly
Where `reshape` currently does not have the option and would benefit by
allowing for `arr.reshape(-1, copy=never)`, which would guarantee a
## The Options
We have three options that are currently being discussed:
1. We introduce a new `np.CopyMode` or `np.<something>.Copy` Enum
with values `np.CopyMode.NEVER`, `np.CopyMode.IF_NEEDED`, and
* Plus: No compatibility concerns
* Downside(?): This would be a first in NumPy, and is untypical
API due to that.
2. We introduce `copy="never"`, `copy="if_needed"` and `copy="always"`
as strings (all other strings will be a `TypeError`):
* Problem: `copy="never"` currently means `copy=True` (the opposite)
Which means new code has to take care when it may run on
older NumPy versions. And in theory could make old code
return the wrong thing.
* Plus: Strings are the typical for options in NumPy currently.
3. Same as 2. But we take it very slow: Make strings an error right now
and only introduce the new options after two releases as per typical
We discussed it briefly today in the triage call and we were leaning
I was honestly expecting to converge to option 3 to avoid compatibility
issues (mainly surprises with `copy="never"` on older versions).
But considering how weird it is to currently pass `copy="never"`, the
question was whether we should not change it with a release note.
The probability of someone currently passing exactly one of those three
(and no other) strings seems exceedingly small.
Personally, I don't have a much of an opinion. But if *nobody* voices
any concern about just changing the meaning of the string inputs, I
think the current default may be to just do it.
Our next Newcomers' Hour will be held tomorrow, March 24th, at 4 pm UTC. We
have no agenda this time. Stop by to ask questions or just to say hi.
Join the meeting via Zoom: https://us02web.zoom.us/j/87192457898
NumPy Contributor Experience Lead
I would like to share the first formal draft of
NEP 50: Promotion rules for Python scalars
with everyone. The full text can be found here:
NEP 50 is an attempt to remove value-based casting/promotion. We wish
to replace it with clearer rules for the resulting dtype when mixing
NumPy arrays and Python scalars. As a brief example, the proposal
allows the following (unchanged):
>>> np.array([1, 2, 3], dtype=np.int8) + 100
np.array([101, 102, 103], dtype=np.int8)
While clearing up confusion caused by the value-inspecting behavior
that we see sometimes, such as:
>>> np.array([1, 2, 3], dtype=np.int8) + 300
np.array([301, 302, 303], dtype=np.int16) # note the int16
Where 300 is too large to fit an ``int8``. As well as removing the
special behavior of 0-D arrays or NumPy scalars:
>>> res = np.array(1, dtype=np.int8) + 100
This is the continuation of a long discussion (see the "Discussion"
section), including the poll I once posted:
I would be happy for any feadback, be it just editorial or fundamental
discussion. There are many alternatives which I have tried to capture
in the NEP.
So lets discuss here, or on discuss:
For smaller edits, don't hesitate to open a NumPy PR, or propose edits
on my branch (you can use the edit button to create a PR):
An important part of moving forward will be assessing the real world
impact. To start that process, I have created a branch as a draft PR
(at this time):
It is missing some parts, but should allow preliminary testing. The
main missing part is that the integer warnings and errors are less
strict than proposed in the NEP.
It would be invaluable to get a better idea to what extent existing
code, especially end-user code, is affected by the proposed changes.
Thanks in advance for any input! This is a big, complicated proposal,
but finding a way forward will hopefully clear up a source of confusion
and inconsistencies that make both maintainers and users life harder.
The next NumPy Newcomers Hour will be held this Thursday, June 30th at 4 pm
Ryan C. Cooper, an assistant professor-in-residence at the University of
Connecticut (Mansfield, Connecticut, USA), will share how he uses NumPy in
his Engineering classes, from individual student research to semester-long
courses. We will talk about lessons learned and key strategies to motivate
and engage new Python users.
Join us via Zoom: https://us02web.zoom.us/j/87192457898
Contributor Experience Lead | NumPy
Hi Numpy maintainers,
Would you be interested in integrating continuous fuzzing by way of
OSS-Fuzz? Fuzzing is a way to automate test-case generation and has been
heavily used for memory unsafe languages. Recently efforts have been put
into fuzzing memory safe languages and Python is one of the languages
where it would be great to use fuzzing.
In this PR: https://github.com/google/oss-fuzz/pull/7681 I did an
initial integration into OSS-Fuzz. Essentially, OSS-Fuzz is a free
service run by Google that performs continuous fuzzing of important open
If you would like to integrate, the only thing I need is a list of
email(s) that will get access to the data produced by OSS-Fuzz, such as
bug reports, coverage reports and more stats. Notice the emails
affiliated with the project will be public in the OSS-Fuzz repo, as they
will be part of a configuration file.
There are already some important Python projects on OSS-Fuzz such as
and it would be great to add Numpy to the list.
Let me know your thoughts on this and if you have any questions as I’m
happy to clarify or go more into details with fuzzing.
ADA Logics Ltd is registered in England. No: 11624074.
Registered office: 266 Banbury Road, Post Box 292,
OX2 7DL, Oxford, Oxfordshire , United Kingdom
A function to get the minimum and maximum values of an array simultaneously could be very useful, from both a convenience and performance point of view. Especially when arrays get larger the performance benefit could be significant, and even more if the array doesn't fit in L2/L3 cache or even memory.
There are many cases where not either the minimum or the maximum of an array is required, but both. Think of clipping an array, getting it's range, checking for outliers, normalizing, making a plot like a histogram, etc.
This function could be called aminmax() for example, and also be called like ndarray.minmax(). It should return a tuple (min, max) with the minimum and maximum values of the array, identical to calling (ndarray.min(), ndarray.max()).
With such a function, numpy.ptp() and the special cases of numpy.quantile(a, q=[0,1]) and numpy.percentile(a, q=[0,100]) could also potentially be speeded up, among others.
Potentially argmin and argmax could get the same treatment, being called argminmax().
There is also a very extensive post on Stack Overflow (a bit old already) with discussion and benchmarks: https://stackoverflow.com/questions/12200580/numpy-function-for-simultaneou…