[Numpy-discussion] Add `nrows` to `genfromtxt`

Sun Nov 2 17:24:38 EST 2014

On 11/2/14, Alexander Belopolsky <ndarray at mac.com> wrote:
> On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser
> <warren.weckesser at gmail.com
>> wrote:
>
>>
>>> Still, the case of dtype=None, name=None is problematic.   Suppose I
>>> want
>>> genfromtxt()  to detect the column names from the 1-st row and data
>>> types
>>> from the 3-rd.  How would you do that?
>>>
>>>
>>
>> This may sound like a cop out, but at some point, I stop trying to make
>> genfromtxt() handle every possible case, and instead I would write a
>> custom
>> header reader to handle this.
>>
>
> In the abstract, I would agree with you.  It is often the case that 2-3
> lines of clear Python code is better than a terse function call with half a
> dozen non-obvious options.  Specifically, I would be against the proposed
> slice_rows because it is either equivalent to  genfromtxt(islice(..), ..)
> or hard to specify.

I don't have much more to add to the API discussion at the moment, but
I want to make sure one aspect is clear. (Sorry for the noise if the
following is obvious.)

In an earlier email, I gave my interpretation of the semantics of
`slice_rows` (and `max_rows`), which is that `genfromtxt(f, ...,
slice_rows=arg)` produces the same result as `genfromtxt(f,
...)[arg]`. (The difference is that it only consumes items from the
input iterator f as required by `arg`).  This isn't the same as
`genfromtxt(islice(f, <slice args>), ...)`, because `genfromtxt` skips
comments and blank lines.  (It also skips invalid lines if the
argument `invalid_raise=False` is used.)  So if the input file was

-----
 1  10
# A comment.
 2  20

 3  30
 4  40
 5  50
-----

Then `genfromtxt(f, dtype=int, slice_rows=slice(4))` would produce
`array([[1, 10], [2, 20], [3, 30], [4, 40]])`, while
`genfromtxt(islice(f, 4), dtype=int)` would produce `array([1, 10],
[2, 20]])`.

That's my interpretation of how `max_rows` or `slice_rows` should
work.  If that is not what other folks expect, than that should also
be part of the discussion.

Warren

>
> On the other hand, skip_rows is different for two reasons:
>
> 1. It is not a new option.  It is currently a deprecated alias to
> skip_header, so a change is expected - either removal or redefinition.
> 2. The intended use-case - inferring column names and type information from
> a file where data is separated from the column names is hard to code
> explicitly.  (Try it!)
>