[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Mon Jul 16 02:54:38 EDT 2012

On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:

> Hello,
> I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad.
> Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default).
> I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one).
> What y'all think?
> 

That seems like an acceptable proposal --- it is consistent with current behavior but also satisfies the use-case (without another keyword which is a bonus). 

So, 

+1 from me.

-Travis

> On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto" <mail at paul.kishimoto.name> wrote:
> On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
> > On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
> > <mail at paul.kishimoto.name> wrote:
> > > Hello everyone,
> > >
> > >         I am a longtime NumPy user, and I just filed my first contribution to
> > > the code as pull request to fix what I felt was a bug in the behaviour
> > > of genfromtxt() https://github.com/numpy/numpy/pull/351
> > > It turns out this alters existing behaviour that some people may depend
> > > on, so I was encouraged to raise the issue on this list to see what the
> > > consensus was.
> > >
> > > This behaviour happens in the specific situation where:
> > >       * Comments are used in the file (the default comment character is
> > >         '#', which I'll use here), AND
> > >       * The kwarg names=True is given. In this case, genfromtxt() is
> > >         supposed to read an initial row containing the names of the
> > >         columns and return an array with a structured dtype.
> > >
> > > Currently, these options work with a file like (Example #1):
> > >
> > >         # gender age weight
> > >         M   21 72.100000
> > >         F   35  58.330000
> > >         M   33  21.99
> > >
> > > …but NOT with a file like (Example #2):
> > >
> > >         # here is a general file comment
> > >         # it is spread over multiple lines
> > >         gender age weight
> > >         M   21 72.100000
> > >         F   35  58.330000
> > >         M   33  21.99
> > >
> > > …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
> > > thinks all of the columns are strings because 'gender', 'age' and
> > > 'weight' are not numbers.
> > >
> > >         This is because genfromtxt() (after skipping a number of lines as
> > > specified in the optional kwarg skip_header) will use the *first* line
> > > it encounters to produce column names. If that line contains a comment
> > > character, genfromtxt() discards everything *up to and including* the
> > > comment character, and tries to use the content *after* the comment
> > > character as headers (Example 3):
> > >
> > >         gender age weight # wrong column names
> > >         M   21  72.100000
> > >         F   35  58.330000
> > >         M   33  21.99
> > >
> > > …the resulting column names are 'wrong', 'column' and 'names'.
> > >
> > > My proposed change was that, if the first (or any subsequent) line
> > > contains a comment character, it should be treated as an *actual
> > > comment*, and discarded along with anything that follows it on the line.
> > >
> > >         In Example 2, the result would be that the first two lines appear empty
> > > (no text before '#'), and the third line ("gender age weight") is used
> > > for column names.
> > >
> > >         In Example 3, the result would be that "gender age weight" is used for
> > > column names while "# wrong column names" is ignored.
> > >
> > > BUT!
> > >
> > >         In Example 1, the result would be that the first line appears empty,
> > > and "M   21  72.100000" are used for column names.
> > >
> > > In other words, this change would do away with the previous behaviour
> > > where the very first commented line was (magically?) treated not as a
> > > comment but instead as column headers. This might break some existing
> > > code. On the positive side, it would allow the user to be more liberal
> > > with the format of input files (Example 4):
> > >
> > >         # here is a general file comment
> > >         # the columns in this table are
> > >         gender age weight # here is a comment on the header line
> > >         # following this line are the data
> > >         M   21  72.100000
> > >         F   35  58.330000 # here is a comment on a data line
> > >         M   33  21.99
> > >
> > > I feel that this is a better/more flexible behaviour for genfromtxt(),
> > > but—as stated—I am interested in your thoughts.
> > >
> > > Cheers,
> > > --
> > > Paul Natsuo Kishimoto
> > >
> > > SM candidate, Technology & Policy Program (2012)
> > > Research assistant,  http://globalchange.mit.edu
> > > https://paul.kishimoto.name      +1 617 302 6105
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at scipy.org
> > > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> > >
> >
> > Hi Paul,
> >
> > At least in astronomy tabular files with the column definitions in the
> > first commented line are reasonably common.  This is driven in part by
> > wide use of legacy packages like supermongo etc that don't have
> > intelligent table readers, so users document the column names as a
> > comment line.  I think making this break might be unfortunate for
> > users in astronomy.
> >
> > Dealing with commented header definitions is annoying.  Not that it
> > matters specifically for your genfromtext() proposal, but in the
> > asciitable reader this case is handled with a particular reader class
> > that expects the first comment line to contain the column definitions:
> >
> >  http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
> >
> > Cheers,
> > Tom
> 
> Tom,
> 
> Thanks for this information. In thinking about how people would work
> around this, I figured it would be fairly easy to discard a comment
> character that occurred as the very first character in a file, e.g.:
> 
>         raw = StringIO(open('example.txt').read()[1:])
>         data = numpy.genfromtxt(raw, comment='#', names=True)
> 
> …but I realize that making this change in many places would still be an
> annoyance.
> 
>         I should perhaps also add that my view of 'proper' table formats is
> partly influenced by another plotting package, namely pgfplots for LaTeX
> (http://pgfplots.sourceforge.net/ ,
> http://pgfplots.sourceforge.net/gallery.html) which uses uncommented
> headers. To the extent NumPy users are also LaTeX users, similar
> semantics could be more friendly.
> 
> Looking forward to more input from other users,
> --
> Paul Natsuo Kishimoto
> 
> SM candidate, Technology & Policy Program (2012)
> Research assistant,  http://globalchange.mit.edu
> https://paul.kishimoto.name      +1 617 302 6105
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120716/25baacb1/attachment.html>