Add `nrows` to `genfromtxt`

There is a PR in github that adds a new keyword to the genfromtxt function, to limit the number of rows that actually get read in: https://github.com/numpy/numpy/pull/5103 It is mostly ready to go, and several devs have looked at it without complaining. Since it is an API change, I wanted to check here: if no one has any strong opposition, I will be merging it sometime tomorrow. You have been warned... ;-) Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

On 9/24/2014 2:52 PM, Jaime Fernández del Río wrote:
There is a PR in github that adds a new keyword to the genfromtxt function, to limit the number of rows that actually get read in: https://github.com/numpy/numpy/pull/5103
Sorry to come late to this party, but it seems to me that more versatile than an `nrows` keyword for the number of rows would be a "rows" keyword for a slice argument. fwiw, Alan Isaac

On 9/24/14, Alan G Isaac <alan.isaac@gmail.com> wrote:
On 9/24/2014 2:52 PM, Jaime Fernández del Río wrote:
There is a PR in github that adds a new keyword to the genfromtxt function, to limit the number of rows that actually get read in: https://github.com/numpy/numpy/pull/5103
Sorry to come late to this party, but it seems to me that more versatile than an `nrows` keyword for the number of rows would be a "rows" keyword for a slice argument.
fwiw, Alan Isaac
I've continued the PR for the addition of the `nrows` (now `max_rows`) argument to `genfromtxt` here: https://github.com/numpy/numpy/pull/5253 Alan's suggestion to use a slice is interesting, but I'd like to see a more concrete proposal for the API. For example, how does it interact with `skip_header` and `skip_footer`? How would one use it to read a file in batches? The following are a couple use-cases for `max_rows` (originally added as comments at https://github.com/numpy/numpy/pull/5103): (1) Read a file in batches: Suppose the file "a.csv" contains: 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 With `max_rows`, the file can be read in batches of, say, 4: In [31]: f = open("a.csv", "r") In [32]: genfromtxt(f, dtype=None, max_rows=4) Out[32]: array([[ 0, 10], [ 1, 11], [ 2, 12], [ 3, 13]]) In [33]: genfromtxt(f, dtype=None, max_rows=4) Out[33]: array([[ 4, 14] [ 5, 15], [ 6, 16], [ 7, 17]]) In [33]: genfromtxt(f, dtype=None, max_rows=4) Out[33]: array([[ 8, 18], [ 9, 19]]) (2) Multiple arrays in a single file: I've seen a file format of the form 3 5 1.0 1.5 2.1 2.5 4.8 3.5 1.0 8.7 6.0 2.0 4.2 0.7 4.4 5.3 2.0 2 3 89.1 66.3 42.1 12.3 19.0 56.6 The file contains multiple arrays. Each array is preceded by a line containing the number of rows and columns in that array. The `max_rows` argument would make it easy to read this file with genfromtxt: In [7]: f = open("b.dat", "r") In [8]: nrows, ncols = genfromtxt(f, dtype=None, max_rows=1) In [9]: A = genfromtxt(f, max_rows=nrows) In [10]: nrows, ncols = genfromtxt(f, dtype=None, max_rows=1) In [11]: B = genfromtxt(f, max_rows=nrows) In [12]: A Out[12]: array([[ 1. , 1.5, 2.1, 2.5, 4.8], [ 3.5, 1. , 8.7, 6. , 2. ], [ 4.2, 0.7, 4.4, 5.3, 2. ]]) In [13]: B Out[13]: array([[ 89.1, 66.3, 42.1], [ 12.3, 19. , 56.6]]) Warren
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 11/1/2014 10:31 AM, Warren Weckesser wrote:
Alan's suggestion to use a slice is interesting, but I'd like to see a more concrete proposal for the API. For example, how does it interact with `skip_header` and `skip_footer`? How would one use it to read a file in batches?
I'm probably just not understanding the question, but the initial answer I will give is, "just like the proposal for `max_rows`". That is, skip_header and skip_footer are honored, and the remainder of the file is sliced. For the equivalent of say `max_rows=500`, one would say `slice_rows=slice(500)`. Perhaps you could provide an example illustrating the issues this reply overlooks. Cheers, Alan

On Sat, Nov 1, 2014 at 10:54 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:
On 11/1/2014 10:31 AM, Warren Weckesser wrote:
Alan's suggestion to use a slice is interesting, but I'd like to see a more concrete proposal for the API. For example, how does it interact with `skip_header` and `skip_footer`? How would one use it to read a file in batches?
I'm probably just not understanding the question, but the initial answer I will give is, "just like the proposal for `max_rows`".
That is, skip_header and skip_footer are honored, and the remainder of the file is sliced. For the equivalent of say `max_rows=500`, one would say `slice_rows=slice(500)`.
Perhaps you could provide an example illustrating the issues this reply overlooks.
Cheers, Alan
OK, so `slice_rows=slice(n)` should behave the same as `max_rows=n`. Here's my take on how `slice_rows` could be handled. I intended the result of `genfromtxt(..., max_rows=n)` to produce the same array as produced by `genfromtxt(...)[:n]`. So a reasonable way to define the behavior of `slice_rows` is that `gengromtxt(..., slice_rows=arg)` returns the same array as `genfromtxt(...)[arg]`. With that specification, it is natural for `slice_rows` to accept any object that is valid for indexing, e.g. `slice_rows=[0,2,3]` or `slice_rows=10`. (But that wouldn't necessarily have to be implemented.) The two differences between `genfromtxt(..., slice_rows=arg)` and `genfromtxt(...)[arg]` are (1) the former is more efficient--it can simply ignore the rows that won't be part of the final result; and (2) the former doesn't consume the input iterator beyond what is requested by `arg`. For example, `slice_rows=(2,10,2)` would consume 10 items from the input (or fewer, if there aren't 10 items in the input). Note that the actual indices for that slice are [2, 4, 6, 8]; even though index 9 is not included in the result, the corresponding item is consumed from the input iterator. (That's how I would interpret it, anyway.) Because the input argument to `genfromtxt` can be an arbitrary iterator, the use of `slice_rows=slice(n)` is not compatible with the use of `skip_footer=m`. Handling `skip_footer=m` requires looking ahead in the iterator to see if the end of the input is within `m` items, but in general, looking ahead is not possible without consuming the items. (The `max_rows` argument has the same problem. In the current PR, a ValueError is raised if both `skip_footer` and `max_rows` are given.) Related to this is how to handle `slice_rows=slice(-3)`. Either this is not allowed (for the same reason that `slice_rows=slice(n), skip_footer=m` is disallowed), or it results in the entire iterator being consumed (and it is explained in the docstring that this is the effect of a negative `stop` value in a slice). Is there wider interest in such an argument to `genfromtxt`? For my use-cases, `max_rows` is sufficient. I can't recall ever needing the full generality of a slice for pulling apart a text file. Does anyone have compelling use-cases that are not handled by `max_rows`? Warren
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sat, Nov 1, 2014 at 3:15 PM, Warren Weckesser <warren.weckesser@gmail.com
wrote:
Is there wider interest in such an argument to `genfromtxt`? For my use-cases, `max_rows` is sufficient. I can't recall ever needing the full generality of a slice for pulling apart a text file. Does anyone have compelling use-cases that are not handled by `max_rows`?
It is occasionally useful to be able to skip rows after the header. Maybe we should de-deprecate skip_rows and give it the meaning different from skip_header in case of names = None? For example, genfromtxt(fname, skip_header= 3, skip_rows = 1, max_rows = 100) would mean skip 3 lines, read column names from the 4-th, skip 5-th, process up to 100 more lines. This may be useful if the file contains some meta-data about the column below the header line. For example, it is common to put units of measurement below the column names. Another application could be processing a large text file in chunks, which again can be covered nicely by skip_rows/max_rows. I cannot think of a situation where I would need more generality such as reading every 3rd row or rows with the given numbers. Such processing is normally done after the text data is loaded into an array.

On 11/1/2014 4:41 PM, Alexander Belopolsky wrote:
I cannot think of a situation where I would need more generality such as reading every 3rd row or rows with the given numbers. Such processing is normally done after the text data is loaded into an array.
I have done this as cheaper than random selection for a quick and dirty look at large data sets. Setting maxrows can be very different if the data has been stored in some structured manner. I suppose my view is something like this. We are considering adding a keyword. If we can get greater functionality at about the same cost, why not? In that case, it is not really useful to speculate about use cases. If the costs are substantially greater, then that should be stated. Cost is a good reason not to do something. fwiw, Alan Isaac

On 11/1/14, Alan G Isaac <alan.isaac@gmail.com> wrote:
On 11/1/2014 4:41 PM, Alexander Belopolsky wrote:
I cannot think of a situation where I would need more generality such as reading every 3rd row or rows with the given numbers. Such processing is normally done after the text data is loaded into an array.
I have done this as cheaper than random selection for a quick and dirty look at large data sets. Setting maxrows can be very different if the data has been stored in some structured manner.
I suppose my view is something like this. We are considering adding a keyword. If we can get greater functionality at about the same cost, why not? In that case, it is not really useful to speculate about use cases. If the costs are substantially greater, then that should be stated. Cost is a good reason not to do something.
`slice_rows` is a generalization of `max_rows`. It will probably take a bit more code to implement, and it will require more tests and more documentation. So the cost isn't really the same. But if it solves real problems for users, the cost may be worth it. Warren
fwiw, Alan Isaac
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sat, Nov 1, 2014 at 4:41 PM, Alexander Belopolsky <ndarray@mac.com> wrote:
On Sat, Nov 1, 2014 at 3:15 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:
Is there wider interest in such an argument to `genfromtxt`? For my use-cases, `max_rows` is sufficient. I can't recall ever needing the full generality of a slice for pulling apart a text file. Does anyone have compelling use-cases that are not handled by `max_rows`?
It is occasionally useful to be able to skip rows after the header. Maybe we should de-deprecate skip_rows and give it the meaning different from skip_header in case of names = None? For example,
genfromtxt(fname, skip_header= 3, skip_rows = 1, max_rows = 100)
would mean skip 3 lines, read column names from the 4-th, skip 5-th, process up to 100 more lines. This may be useful if the file contains some meta-data about the column below the header line. For example, it is common to put units of measurement below the column names.
Or you could just call genfromtxt() once with `max_rows=1` to skip a row. (I'm assuming that the first argument to genfromtxt is the open file object--or some other iterator--and not the filename.)
Another application could be processing a large text file in chunks, which again can be covered nicely by skip_rows/max_rows.
You don't really need `skip_rows` for this. In a previous email (and in https://github.com/numpy/numpy/pull/5103) I gave an example of using `max_rows` for handling a file that doesn't have a header. If the file has a header, you could process the file in batches using something like the following example, where the dtype determined in the first batch is used when reading the subsequent batches: In [12]: !cat foo.dat a b c 1.0 2.0 -9.0 3.0 4.0 -7.6 5.0 6.0 -1.0 7.0 8.0 -3.3 9.0 0.0 -3.4 In [13]: f = open("foo.dat", "r") In [14]: batch1 = genfromtxt(f, dtype=None, names=True, max_rows=2) In [15]: batch1 Out[15]: array([(1.0, 2.0, -9.0), (3.0, 4.0, -7.6)], dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) In [16]: batch2 = genfromtxt(f, dtype=batch1.dtype, max_rows=2) In [17]: batch2 Out[17]: array([(5.0, 6.0, -1.0), (7.0, 8.0, -3.3)], dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) In [18]: batch3 = genfromtxt(f, dtype=batch1.dtype, max_rows=2) In [19]: batch3 Out[19]: array((9.0, 0.0, -3.4), dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) Warren
I cannot think of a situation where I would need more generality such as reading every 3rd row or rows with the given numbers. Such processing is normally done after the text data is loaded into an array.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser <warren.weckesser@gmail.com
wrote:
Or you could just call genfromtxt() once with `max_rows=1` to skip a row. (I'm assuming that the first argument to genfromtxt is the open file object--or some other iterator--and not the filename.)
That's hackish. If I have to resort to something like this, I would just call next() on the open file object or iterator. Still, the case of dtype=None, name=None is problematic. Suppose I want genfromtxt() to detect the column names from the 1-st row and data types from the 3-rd. How would you do that?

Sorry, I meant names=True, not name=None. On Sun, Nov 2, 2014 at 2:18 PM, Alexander Belopolsky <ndarray@mac.com> wrote:
On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:
Or you could just call genfromtxt() once with `max_rows=1` to skip a row. (I'm assuming that the first argument to genfromtxt is the open file object--or some other iterator--and not the filename.)
That's hackish. If I have to resort to something like this, I would just call next() on the open file object or iterator.
Still, the case of dtype=None, name=None is problematic. Suppose I want genfromtxt() to detect the column names from the 1-st row and data types from the 3-rd. How would you do that?

On Sun, Nov 2, 2014 at 2:18 PM, Alexander Belopolsky <ndarray@mac.com> wrote:
On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:
Or you could just call genfromtxt() once with `max_rows=1` to skip a row. (I'm assuming that the first argument to genfromtxt is the open file object--or some other iterator--and not the filename.)
That's hackish. If I have to resort to something like this, I would just call next() on the open file object or iterator.
I agree, calling genfromtxt to skip a line is silly. Calling next() makes much more sense.
Still, the case of dtype=None, name=None is problematic. Suppose I want genfromtxt() to detect the column names from the 1-st row and data types from the 3-rd. How would you do that?
This may sound like a cop out, but at some point, I stop trying to make genfromtxt() handle every possible case, and instead I would write a custom header reader to handle this. Warren
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser <warren.weckesser@gmail.com
wrote:
Still, the case of dtype=None, name=None is problematic. Suppose I want genfromtxt() to detect the column names from the 1-st row and data types from the 3-rd. How would you do that?
This may sound like a cop out, but at some point, I stop trying to make genfromtxt() handle every possible case, and instead I would write a custom header reader to handle this.
In the abstract, I would agree with you. It is often the case that 2-3 lines of clear Python code is better than a terse function call with half a dozen non-obvious options. Specifically, I would be against the proposed slice_rows because it is either equivalent to genfromtxt(islice(..), ..) or hard to specify. On the other hand, skip_rows is different for two reasons: 1. It is not a new option. It is currently a deprecated alias to skip_header, so a change is expected - either removal or redefinition. 2. The intended use-case - inferring column names and type information from a file where data is separated from the column names is hard to code explicitly. (Try it!)

On 11/2/14, Alexander Belopolsky <ndarray@mac.com> wrote:
On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser <warren.weckesser@gmail.com
wrote:
Still, the case of dtype=None, name=None is problematic. Suppose I want genfromtxt() to detect the column names from the 1-st row and data types from the 3-rd. How would you do that?
This may sound like a cop out, but at some point, I stop trying to make genfromtxt() handle every possible case, and instead I would write a custom header reader to handle this.
In the abstract, I would agree with you. It is often the case that 2-3 lines of clear Python code is better than a terse function call with half a dozen non-obvious options. Specifically, I would be against the proposed slice_rows because it is either equivalent to genfromtxt(islice(..), ..) or hard to specify.
I don't have much more to add to the API discussion at the moment, but I want to make sure one aspect is clear. (Sorry for the noise if the following is obvious.) In an earlier email, I gave my interpretation of the semantics of `slice_rows` (and `max_rows`), which is that `genfromtxt(f, ..., slice_rows=arg)` produces the same result as `genfromtxt(f, ...)[arg]`. (The difference is that it only consumes items from the input iterator f as required by `arg`). This isn't the same as `genfromtxt(islice(f, <slice args>), ...)`, because `genfromtxt` skips comments and blank lines. (It also skips invalid lines if the argument `invalid_raise=False` is used.) So if the input file was ----- 1 10 # A comment. 2 20 3 30 4 40 5 50 ----- Then `genfromtxt(f, dtype=int, slice_rows=slice(4))` would produce `array([[1, 10], [2, 20], [3, 30], [4, 40]])`, while `genfromtxt(islice(f, 4), dtype=int)` would produce `array([1, 10], [2, 20]])`. That's my interpretation of how `max_rows` or `slice_rows` should work. If that is not what other folks expect, than that should also be part of the discussion. Warren
On the other hand, skip_rows is different for two reasons:
1. It is not a new option. It is currently a deprecated alias to skip_header, so a change is expected - either removal or redefinition. 2. The intended use-case - inferring column names and type information from a file where data is separated from the column names is hard to code explicitly. (Try it!)

On 11/1/14, Alan G Isaac <alan.isaac@gmail.com> wrote:
On 11/1/2014 3:15 PM, Warren Weckesser wrote:
I intended the result of `genfromtxt(..., max_rows=n)` to produce the same array as produced by `genfromtxt(...)[:n]`.
I find that counterintuitive. I would first honor skip_header.
Sorry for the terse explanation. I meant for `...` to indicate any other arguments, including skip_header. Warren
Cheers, Alan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sat, Nov 1, 2014 at 7:31 AM, Warren Weckesser <warren.weckesser@gmail.com
wrote:
(2) Multiple arrays in a single file:
...
The file contains multiple arrays. Each array is preceded by a line containing the number of rows and columns in that array. The `max_rows` argument would make it easy to read this file with genfromtxt:
+inf on this one -- this is a use case I've been looking for support for ages! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Alan G Isaac
-
Alexander Belopolsky
-
Chris Barker
-
Jaime Fernández del Río
-
Warren Weckesser