Changes to `np.loadtxt`
Hi all, just a brief heads up that: https://github.com/numpy/numpy/pull/20580 is now merged. This moves `np.loadtxt` to C. Mainly making it much faster. There are also some other improvements and changes though: * It now supports `quotechar='"'` to support Excel dialect CSV. * Parsing some numbers is stricter (e.g. removed support for `_` or hex float parsing by default). * `max_rows` now actually counts rows and not lines. A warning is given if this makes a difference (blank lines). * Some exception will change, parsing failures now (almost) always give an informative `ValueError`. * `converters=callable` is now valid to provide a single converter for all columns. Please test, and let us know if there is any issue or followup you would like to see. We do have possible followups planned * Consider deprecating the `encoding="bytes"` default which exists for Python 2 compatibility. * Consider renaming `skip_rows` to the more precise `skip_lines`. Moving to C unlocks possible further improvement, such as full `csv.Dialect` support. We do not have this on the roadmap, but such contributions are possible now. Similarly, it should be possible to rewrite `genfromtxt` based on this work. Cheers, Sebastian
Hi all, these changes came up in https://github.com/numpy/numpy/issues/21852 where a user had the use-case to look up the line number until where they want to read a file. The change is that `max_rows` now: * Represents the number of *rows* in the result * Gives a `UserWarning` when empty lines are skipped as a result. While previously `max_rows` used the number of *lines* (except those skipped initially). The difference is for a file formatted like: 1,2,3 # comment 2,3,4 The work-around to get the old version back is: import itertools lines = itertools.islice(open("file"), 0, max_rows) result = np.loadtxt(lines, ...) (Noted in the release notes and `UserWarning` – although the warning text could be improved.) There three possible "actions" I can think of: 1. We can add `max_lines` to do the `itertools` trick for the user. 2. The change is considered too big, we could revert it. 3. We could revert+deprecate the name for a new one, e.g. `nrows` and `nlines`. As an additional point of reference `pandas.read_csv` has `nrows` matching the new behavior. I do not have a strong opinion. I lean towards the new one, Chuck prefers the old meaning (I think). One reasoning for me was that users may also read too few data right now thinking `max_rows` has the new meaning already (i.e. we fix a bug for them). Cheers, Sebastian On Tue, 2022-02-08 at 08:08 -0600, Sebastian Berg wrote:
Hi all,
just a brief heads up that:
https://github.com/numpy/numpy/pull/20580
is now merged. This moves `np.loadtxt` to C. Mainly making it much faster. There are also some other improvements and changes though:
* It now supports `quotechar='"'` to support Excel dialect CSV. * Parsing some numbers is stricter (e.g. removed support for `_` or hex float parsing by default). * `max_rows` now actually counts rows and not lines. A warning is given if this makes a difference (blank lines). * Some exception will change, parsing failures now (almost) always give an informative `ValueError`. * `converters=callable` is now valid to provide a single converter for all columns.
Please test, and let us know if there is any issue or followup you would like to see.
We do have possible followups planned * Consider deprecating the `encoding="bytes"` default which exists for Python 2 compatibility. * Consider renaming `skip_rows` to the more precise `skip_lines`.
Moving to C unlocks possible further improvement, such as full `csv.Dialect` support. We do not have this on the roadmap, but such contributions are possible now. Similarly, it should be possible to rewrite `genfromtxt` based on this work.
Cheers,
Sebastian _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
participants (1)
-
Sebastian Berg