[Numpy-discussion] Add `nrows` to `genfromtxt`

Warren Weckesser warren.weckesser at gmail.com
Sun Nov 2 13:56:31 EST 2014


On Sat, Nov 1, 2014 at 4:41 PM, Alexander Belopolsky <ndarray at mac.com>
wrote:

>
> On Sat, Nov 1, 2014 at 3:15 PM, Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
>
>> Is there wider interest in such an argument to `genfromtxt`?  For my
>> use-cases, `max_rows` is sufficient.  I can't recall ever needing the full
>> generality of a slice for pulling apart a text file.  Does anyone have
>> compelling use-cases that are not handled by `max_rows`?
>>
>
> It is occasionally useful to be able to skip rows after the header.  Maybe
> we should de-deprecate skip_rows and give it the meaning different from
> skip_header in case of names = None?  For example,
>
> genfromtxt(fname,  skip_header= 3, skip_rows = 1, max_rows = 100)
>
> would mean skip 3 lines, read column names from the 4-th, skip 5-th,
> process up to 100 more lines.  This may be useful if the file contains some
> meta-data about the column below the header line.  For example, it is
> common to put units of measurement below the column names.
>


Or you could just call genfromtxt() once with `max_rows=1` to skip a row.
(I'm assuming that the first argument to genfromtxt is the open file
object--or some other iterator--and not the filename.)



>
> Another application could be processing a large text file in chunks, which
> again can be covered nicely by  skip_rows/max_rows.
>


You don't really need `skip_rows` for this.  In a previous email (and in
https://github.com/numpy/numpy/pull/5103) I gave an example of using
`max_rows` for handling a file that doesn't have a header.  If the file has
a header, you could process the file in batches using something like the
following example, where the dtype determined in the first batch is used
when reading the subsequent batches:

In [12]: !cat foo.dat
  a    b     c
1.0  2.0  -9.0
3.0  4.0  -7.6
5.0  6.0  -1.0
7.0  8.0  -3.3
9.0  0.0  -3.4

In [13]: f = open("foo.dat", "r")

In [14]: batch1 = genfromtxt(f, dtype=None, names=True, max_rows=2)

In [15]: batch1
Out[15]:
array([(1.0, 2.0, -9.0), (3.0, 4.0, -7.6)],
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

In [16]: batch2 = genfromtxt(f, dtype=batch1.dtype, max_rows=2)

In [17]: batch2
Out[17]:
array([(5.0, 6.0, -1.0), (7.0, 8.0, -3.3)],
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

In [18]: batch3 = genfromtxt(f, dtype=batch1.dtype, max_rows=2)

In [19]: batch3
Out[19]:
array((9.0, 0.0, -3.4),
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])



Warren





> I cannot think of a situation where I would need more generality such as
> reading every 3rd row or rows with the given numbers.  Such processing is
> normally done after the text data is loaded into an array.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20141102/d45385d7/attachment.html>


More information about the NumPy-Discussion mailing list