[Numpy-discussion] genfromtxt documentation : review needed

Fri Oct 16 08:29:29 EDT 2009

On Thu, Oct 15, 2009 at 7:08 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
> All,
> Here's a first draft for the documentation of np.genfromtxt.
> It took me longer than I thought, but that way I uncovered and fix some
> bugs.
> Please send me your comments/reviews/etc
> I count especially on our documentation specialist to let me know where to
> put it.
> Thx in advance
> P.
>

Great work!  I am especially glad to see the better documentation on
missing values, as I didn't fully understand how to do this.  A few
small comments and a small attached diff with a few nitpicking
grammatical changes and some of what's proposed below.

On the actual function, I am wondering if white space shouldn't be
stripped by default, or at least if we have fixed width columns.  I
ran into a problem recently, where I was reading in a lot of strings
that were in a fixed width format and my 4 gb of memory were soon
consumed.  I also can't think of a case where I'd ever care about
leading or trailing white space.

I always get confused going back and forth from zero-indexed to non
zero-indexed, which might not be a good enough reason to worry about
this, but it might be helpful to explicitly say that skip_header is
not zero-indexed, though it doesn't raise an exception if you try.

data = "junk1,junk2,junk3\n1.2,1.5,1"
from StringIO import StringIO
import numpy as np
d = np.genfromtxt(StringIO(data), delimiter=",", skip_header=0)

In [5]: d
Out[5]:
array([[ NaN,  NaN,  NaN],
       [ 1.2,  1.5,  1. ]])

d = np.genfromtxt(StringIO(data), delimiter=",", skip_header=1)

In [7]: d
Out[7]: array([ 1.2,  1.5,  1. ])

d = np.genfromtxt(StringIO(data), delimiter=",", skip_header=-1)

In [9]: d
Out[9]:
array([[ NaN,  NaN,  NaN],
       [ 1.2,  1.5,  1. ]])

Also, I don't know if this is even something that should be worried
about in the io, but recarray names also can't start with a number to
preserve attribute names look up, but I thought I would bring it up
anyway, since I ran across this recently.

data = "1var1,var2,var3\n1.2,1.5,1"
d = np.recfromtxt(StringIO(data), dtype=float, delimiter=",", names=True)

In [36]: d
Out[36]:
rec.array((1.2, 1.5, 1.0),
      dtype=[('1var1', '<f8'), ('var2', '<f8'), ('var3', '<f8')])

In [37]: d.1var1
------------------------------------------------------------
   File "<ipython console>", line 1
     d.1var1
       ^
SyntaxError: invalid syntax

In [38]: d.var2
Out[38]: array(1.5)

In [39]: d['1var1']
Out[39]: array(1.2)

I didn't know about being able to specify the dtype as a dict.  That
might be handy.  Is there any way to cross-link to the dtype
documentation in rst?  I can't remember.  That might be helpful to
have.

I never did figure out what the loose keyword did, but I guess it's
not that important to me if I've never needed it.

Cheers,

Skipper
-------------- next part --------------
57c57
< By default, :func:`genfromtxt` assumes ``delimiter=None``, meaning that the line is splitted along white-spaces (including tabs) and that consecutive white-spaces are considered as a single white-space.
---
> By default, :func:`genfromtxt` assumes ``delimiter=None``, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.
76c76
< By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading or tailing white spaces.
---
> By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading or trailing white spaces.
129c129
< The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed.
---
> The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed.  Note that this is not zero-indexed so that the first line is 1.
147c147
< Acceptable values for the argument are a single integer or a sequence of integers corresponding to the indices of the columns to import.
---
> An acceptable values for the argument is a single integer or a sequence of integers corresponding to the indices of the columns to import.
195c195
< This behavior may be changed by modifying the default mapper of the :class:`~numpi.lib._iotools.StringConverter` class
---
> This behavior may be changed by modifying the default mapper of the :class:`~numpy.lib._iotools.StringConverter` class
343c343
< .. However, user-defined converters may rapidly become cumbersome to manage when
---
> .. However, user-defined converters may rapidly become cumbersome to manage.
389c389
<       Each key can be a column index or a column name, and the corresponding value should eb a single object.
---
>       Each key can be a column index or a column name, and the corresponding value should be a single object.