[Numpy-discussion] Add `nrows` to `genfromtxt`

Sat Nov 1 15:15:26 EDT 2014

On Sat, Nov 1, 2014 at 10:54 AM, Alan G Isaac <alan.isaac at gmail.com> wrote:

> On 11/1/2014 10:31 AM, Warren Weckesser wrote:
> > Alan's suggestion to use a slice is interesting, but I'd like to
> > see a more concrete proposal for the API.  For example, how does
> > it interact with `skip_header` and `skip_footer`?  How would one
> > use it to read a file in batches?
>
>
> I'm probably just not understanding the question, but the initial
> answer I will give is, "just like the proposal for `max_rows`".
>
> That is, skip_header and skip_footer are honored, and the remainder
> of the file is sliced. For the equivalent of say `max_rows=500`,
> one would say `slice_rows=slice(500)`.
>
> Perhaps you could provide an example illustrating the issues this
> reply overlooks.
>
> Cheers,
> Alan
>

OK, so `slice_rows=slice(n)` should behave the same as `max_rows=n`.
Here's my take on how `slice_rows` could be handled.

I intended the result of `genfromtxt(..., max_rows=n)` to produce the same
array as produced by `genfromtxt(...)[:n]`.  So a reasonable way to define
the behavior of `slice_rows` is that `gengromtxt(..., slice_rows=arg)`
returns the same array as `genfromtxt(...)[arg]`.   With that
specification, it is natural for `slice_rows` to accept any object that is
valid for indexing, e.g. `slice_rows=[0,2,3]` or `slice_rows=10`. (But that
wouldn't necessarily have to be implemented.)

The two differences between `genfromtxt(..., slice_rows=arg)` and
`genfromtxt(...)[arg]` are (1) the former is more efficient--it can simply
ignore the rows that won't be part of the final result; and (2) the former
doesn't consume the input iterator beyond what is requested by `arg`.  For
example, `slice_rows=(2,10,2)` would consume 10 items from the input (or
fewer, if there aren't 10 items in the input). Note that the actual indices
for that slice are [2, 4, 6, 8]; even though index 9 is not included in the
result, the corresponding item is consumed from the input iterator.
(That's how I would interpret it, anyway.)

Because the input argument to `genfromtxt` can be an arbitrary iterator,
the use of `slice_rows=slice(n)` is not compatible with the use of
`skip_footer=m`.  Handling `skip_footer=m` requires looking ahead in the
iterator to see if the end of the input is within `m` items, but in
general, looking ahead is not possible without consuming the items. (The
`max_rows` argument has the same problem.  In the current PR, a ValueError
is raised if both `skip_footer` and `max_rows` are given.)

Related to this is how to handle `slice_rows=slice(-3)`.   Either this is
not allowed (for the same reason that `slice_rows=slice(n), skip_footer=m`
is disallowed), or it results in the entire iterator being consumed (and it
is explained in the docstring that this is the effect of a negative `stop`
value in a slice).

Is there wider interest in such an argument to `genfromtxt`?  For my
use-cases, `max_rows` is sufficient.  I can't recall ever needing the full
generality of a slice for pulling apart a text file.  Does anyone have
compelling use-cases that are not handled by `max_rows`?

Warren

>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20141101/208d4bc8/attachment.html>