[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Paul Natsuo Kishimoto mail at paul.kishimoto.name
Mon Jul 16 17:00:40 EDT 2012


On Mon, 2012-07-16 at 21:14 +0100, Nathaniel Smith wrote:
> On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
> > Well, as `skip_header` is a number of lines, I don't really see anything
> > particular magical about a `skip_header=-1`.
> 
> The logic here is:
> - if names=True, then genfromtext expects the names to be given in the
> first line, and they may or may not be commented out
> - BUT, if skip_header=<some special value>, then any all-comment lines
> will be skipped before looking for names, i.e. the names are not
> expected to be commented out and comments are given their original
> meaning again.
> 
> I have no idea how one could derive this understanding by looking at
> skip_header=-1. "Ah, that -1 is a number of lines, everyone knows that
> skipping -1 lines is equivalent to toggling our expectation for
> whether the column names will appear inside a comment"? The API is
> pretty convoluted at this point and I'm not convinced we wouldn't be
> better off with adding a new argument like
> names_in_comment=False/True, but skip_header="comments" at least gives
> the reader a fighting chance...
> 
> -n

Another option is to use skip_header=True. The internal monologue
accompanying this is "Ah, do I want it to skip the header? Yes, true, I
do," with no thought needed on the number of lines involved. Pierre,
checking the type of the argument is trivial. Nathaniel, is this less
weird/magical? Anyone else?

	I don't care what the value is and it's easy to change, but I'll await
some agreement on this point so I don't have to change it yet again in
response to more objections.

	Pierre, for a line "# A B C #1 #2 #3" the user gets six columns 'A',
'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for
using such messy input :) Also, if you look closely, the use of index()
you propose is equivalent to my current code, just more verbose.

	Tom, in my branch I rewrote the documentation for the `names` kwarg in
an attempt to be more clear, but I agree a documentation example of the
non-legacy use would go a long way. I've also realized I neglected to
update the documentation for `skip_header`. I'll do these once there is
consensus on the value to use.

	If there was willingness to tolerate a backwards-incompatible change,
the resulting behaviour would be quite simple and intuitive overall, but
that's out of my hands. At the moment I'm just concerned with making an
intuitive behaviour *possible*.

Thanks everyone for your input,
-- 
Paul Natsuo Kishimoto

SM candidate, Technology & Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name      +1 617 302 6105
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120716/15f6d678/attachment.sig>


More information about the NumPy-Discussion mailing list