originally, I have planned to make an extension of the .npy file format a
dedicated follow-up pull request, but I have upgraded my current request
instead, since it was not as difficult to implement as I initially thought
and probably a more straight-forward solution:
What is this pull request about? It is about appending to Numpy .npy files.
Why? I see two main use cases:
1. creating .npy files larger than the main memory. They can, once
finished, be loaded as memory maps
2. creating binary log files, which can be processed very efficiently
Are there not other good file formats to do this? Theoretically yes, but
practically they can be pretty complex and with very little tweaking .npy
could do efficient appending too.
Use case 1 is already covered by the Pip/Conda package npy-append-array I
have created and getting the functionality directly into Numpy was the
original goal of the pull request. This would have been possible without
introducing a new file format version, just by adding some spare space in
the header. During the pull request discussion it turned out that rewriting
the header after each append would be desirable in case the writing program
crashes to minimize data loss.
Use case 2 however would highly profit from a new file format version as it
would make rewriting the header unnecessary: since efficient appending can
only take place along one axis, setting shape[-1] = -1 in case of Fortran
order or shape = -1 otherwise (default) in the .npy header on file
creation could indicate that the array size is determined by the file size:
when np.load (typically with memory mapping on) gets called, it constructs
the ndarray with the actual shape by replacing the -1 in the constructor
call. Otherwise, the header is not modified anymore, neither on append nor
on file write finish.
Concurrent appends to a single file would not be advisable and should be
channeled through a single AppendArray instance. Concurrent reads while
writes take place however should work relatively smooth: every time an
np.load (ideally with mmap) is called, the ndarray would provide access to
all data written until that time.
Currently, my pull request provides:
1. A definition of .npy version 4.0 that supports -1 in the shape
2. implementations for fortran order and non-fortran order (default)
including test cases
3. Updated np.load
4. The AppendArray class that does the actual appending
Although there is a certain hassle with introducing a new .npy version, the
changes themselves are very small. I could also implement a fallback mode
for older Numpy installations, if someone is interested.
What do you think about such a feature, would it make sense? Anyone
available for some more code review?
Best from Berlin, Michael
PS thank you so far, I could improve my npy-append-array module as well and
from what I have seen so far the Numpy code readability exceeded my already
(sorry for the length, details/discussion below)
On the triage call, there seemed a preference to just try to skip the
deprecation and introduce `copy="never"`, `copy="if_needed"`, and
`copy="always"` (i.e. string options for the `copy` keyword argument).
Strictly speaking, this is against the typical policy (one year of
warning/errors). But nobody could think of a reasonable chance that
anyone actually uses it. (For me just "policy" will be enough of an
argument to just take it slow.)
BUT: If nobody has *any* concerns at all, I think we may just end up
introducing the change right away.
The PR is: https://github.com/numpy/numpy/pull/19173
## The Feature
There is the idea to add `copy=never` (or similar). This would modify
the existing `copy` argument to make it a 3-way decision:
* `copy=always` or `copy=True` to force a copy
* `copy=if_needed` or `copy=False` to prefer no-copy behavior
* `copy=never` to error when no-copy behavior is not possible
(this ensures that a view is returned)
this would affect the functions:
* np.array(object, copy=...)
* arr.astype(new_dtype, copy=...)
* np.reshape(arr, new_shape, copy=...), and the method arr.reshape()
* np.meshgrid and possibly
Where `reshape` currently does not have the option and would benefit by
allowing for `arr.reshape(-1, copy=never)`, which would guarantee a
## The Options
We have three options that are currently being discussed:
1. We introduce a new `np.CopyMode` or `np.<something>.Copy` Enum
with values `np.CopyMode.NEVER`, `np.CopyMode.IF_NEEDED`, and
* Plus: No compatibility concerns
* Downside(?): This would be a first in NumPy, and is untypical
API due to that.
2. We introduce `copy="never"`, `copy="if_needed"` and `copy="always"`
as strings (all other strings will be a `TypeError`):
* Problem: `copy="never"` currently means `copy=True` (the opposite)
Which means new code has to take care when it may run on
older NumPy versions. And in theory could make old code
return the wrong thing.
* Plus: Strings are the typical for options in NumPy currently.
3. Same as 2. But we take it very slow: Make strings an error right now
and only introduce the new options after two releases as per typical
We discussed it briefly today in the triage call and we were leaning
I was honestly expecting to converge to option 3 to avoid compatibility
issues (mainly surprises with `copy="never"` on older versions).
But considering how weird it is to currently pass `copy="never"`, the
question was whether we should not change it with a release note.
The probability of someone currently passing exactly one of those three
(and no other) strings seems exceedingly small.
Personally, I don't have a much of an opinion. But if *nobody* voices
any concern about just changing the meaning of the string inputs, I
think the current default may be to just do it.
just a brief heads up that:
is now merged. This moves `np.loadtxt` to C. Mainly making it much
faster. There are also some other improvements and changes though:
* It now supports `quotechar='"'` to support Excel dialect CSV.
* Parsing some numbers is stricter (e.g. removed support for `_`
or hex float parsing by default).
* `max_rows` now actually counts rows and not lines. A warning
is given if this makes a difference (blank lines).
* Some exception will change, parsing failures now (almost) always
give an informative `ValueError`.
* `converters=callable` is now valid to provide a single converter
for all columns.
Please test, and let us know if there is any issue or followup you
would like to see.
We do have possible followups planned
* Consider deprecating the `encoding="bytes"` default which exists
for Python 2 compatibility.
* Consider renaming `skip_rows` to the more precise `skip_lines`.
Moving to C unlocks possible further improvement, such as full
`csv.Dialect` support. We do not have this on the roadmap, but such
contributions are possible now.
Similarly, it should be possible to rewrite `genfromtxt` based on this
Numpy v1.21.2 <https://github.com/numpy/numpy/releases/tag/v1.21.2> added
support for windows/arm64 platforms but we still don't have any systems in
place to produce binary wheels or test win/arm64 packages. I think it will
be good to start looking into this. CPython has an official buildbot worker
running for win/arm64 and official python support for the platform will be
available from the 3.11 release.
It is not yet clear to me how the build and CI system for numpy is deployed
and how to enable support for a new platform like win/arm64.
One of the main issues in supporting win/arm64 build would be due to the
lack of win/arm64 VMs available on the cloud. But I see we have been
producing binary wheels for Apple M1 platforms on pypi and conda repository
for some time which also lacks the cloud VM support. I think we could take
some learnings from Apple M1 support and look at how a similar strategy can
be used for win/arm64.
I would like to hear if anyone has any thoughts on this topic. Also, any
pointers to understand numpy wheel generation and CI flow for similar
platforms would be helpful as well.
In regard to Feature Request: https://github.com/numpy/numpy/issues/16469
It was suggested to sent to the mailing list. I think I can make a strong
point as to why the support for this naming convention would make sense.
Such as it would follow other frameworks that often work alongside numpy
such as tensorflow. For backward compatibility, it can simply be an alias
I often convert portions of code from tf to np, it is as simple as changing
the base module from tf to np. e.g. np.expand_dims -> tf.expand_dims. This
is done either in debugging (e.g. converting tf to np without eager
execution to debug portion of the code), or during prototyping, e.g.
develop in numpy and convert in tf.
I find myself more than at one occasion to getting syntax errors because of
this particular function np.concatenate. It is unnecessarily long. I
imagine there are more people that also run into the same problems. Pandas
uses concat (torch on the other extreme uses simply cat, which I don't
think is as descriptive).
FYI, I noticed this package that claimed to be maintained by us:
https://pypi.org/project/numpy-aarch64/. That's not ours, so I tried to
contact the author (no email provided, but guessed the same username on
GitHub) and asked to remove it:
There are a very large number of packages with "numpy" in the name on PyPI,
and there's no way we can audit/police that effectively, but if it's a
rebuild that pretends like it's official then I think it's worth doing
something about. It could contain malicious code for all we know.
Our next Newcomers' Hour is tomorrow, February 24th at 4 pm UTC. We have no
agenda this time. Stop by to ask questions or just to say hi.
Join the meeting via Zoom: https://us02web.zoom.us/j/87192457898
NumPy Contributor Experience Lead