Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Hello everyone,
I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was.
This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype.
Currently, these options work with a file like (Example #1):
# gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…but NOT with a file like (Example #2):
# here is a general file comment # it is spread over multiple lines gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers.
This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3):
gender age weight # wrong column names M 21 72.100000 F 35 58.330000 M 33 21.99
…the resulting column names are 'wrong', 'column' and 'names'.
My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line.
In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line ("gender age weight") is used for column names.
In Example 3, the result would be that "gender age weight" is used for column names while "# wrong column names" is ignored.
BUT!
In Example 1, the result would be that the first line appears empty, and "M 21 72.100000" are used for column names.
In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4):
# here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.100000 F 35 58.330000 # here is a comment on a data line M 33 21.99
I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts.
Cheers,

On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
Hello everyone,
I am a longtime NumPy user, and I just filed my first contribution to
the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was.
This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype.
Currently, these options work with a file like (Example #1):
# gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…but NOT with a file like (Example #2):
# here is a general file comment # it is spread over multiple lines gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers.
This is because genfromtxt() (after skipping a number of lines as
specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3):
gender age weight # wrong column names M 21 72.100000 F 35 58.330000 M 33 21.99
…the resulting column names are 'wrong', 'column' and 'names'.
My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line.
In Example 2, the result would be that the first two lines appear empty
(no text before '#'), and the third line ("gender age weight") is used for column names.
In Example 3, the result would be that "gender age weight" is used for
column names while "# wrong column names" is ignored.
BUT!
In Example 1, the result would be that the first line appears empty,
and "M 21 72.100000" are used for column names.
In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4):
# here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.100000 F 35 58.330000 # here is a comment on a data line M 33 21.99
I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts.
Cheers,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Paul,
At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy.
Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions:
http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
Cheers, Tom

On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
Hello everyone,
I am a longtime NumPy user, and I just filed my first contribution to
the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was.
This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype.
Currently, these options work with a file like (Example #1):
# gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…but NOT with a file like (Example #2):
# here is a general file comment # it is spread over multiple lines gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers.
This is because genfromtxt() (after skipping a number of lines as
specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3):
gender age weight # wrong column names M 21 72.100000 F 35 58.330000 M 33 21.99
…the resulting column names are 'wrong', 'column' and 'names'.
My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line.
In Example 2, the result would be that the first two lines appear empty
(no text before '#'), and the third line ("gender age weight") is used for column names.
In Example 3, the result would be that "gender age weight" is used for
column names while "# wrong column names" is ignored.
BUT!
In Example 1, the result would be that the first line appears empty,
and "M 21 72.100000" are used for column names.
In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4):
# here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.100000 F 35 58.330000 # here is a comment on a data line M 33 21.99
I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts.
Cheers,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Paul,
At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy.
Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions:
http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
Cheers, Tom
Tom,
Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.:
raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#', names=True)
…but I realize that making this change in many places would still be an annoyance.
I should perhaps also add that my view of 'proper' table formats is partly influenced by another plotting package, namely pgfplots for LaTeX (http://pgfplots.sourceforge.net/ , http://pgfplots.sourceforge.net/gallery.html) which uses uncommented headers. To the extent NumPy users are also LaTeX users, similar semantics could be more friendly.
Looking forward to more input from other users,

Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think? On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto" mail@paul.kishimoto.name wrote:
On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
Hello everyone,
I am a longtime NumPy user, and I just filed my first
contribution to
the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was.
This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype.
Currently, these options work with a file like (Example #1):
# gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…but NOT with a file like (Example #2):
# here is a general file comment # it is spread over multiple lines gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…genfromtxt() believes the column names are 'here', 'is', 'a', etc.,
and
thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers.
This is because genfromtxt() (after skipping a number of lines
as
specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3):
gender age weight # wrong column names M 21 72.100000 F 35 58.330000 M 33 21.99
…the resulting column names are 'wrong', 'column' and 'names'.
My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the
line.
In Example 2, the result would be that the first two lines
appear empty
(no text before '#'), and the third line ("gender age weight") is used for column names.
In Example 3, the result would be that "gender age weight" is
used for
column names while "# wrong column names" is ignored.
BUT!
In Example 1, the result would be that the first line appears
empty,
and "M 21 72.100000" are used for column names.
In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4):
# here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.100000 F 35 58.330000 # here is a comment on a data line M 33 21.99
I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts.
Cheers,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Paul,
At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy.
Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions:
http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
Cheers, Tom
Tom,
Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.:
raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#', names=True)
…but I realize that making this change in many places would still be an annoyance.
I should perhaps also add that my view of 'proper' table formats is
partly influenced by another plotting package, namely pgfplots for LaTeX (http://pgfplots.sourceforge.net/ , http://pgfplots.sourceforge.net/gallery.html) which uses uncommented headers. To the extent NumPy users are also LaTeX users, similar semantics could be more friendly.
Looking forward to more input from other users,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:
Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think?
That seems like an acceptable proposal --- it is consistent with current behavior but also satisfies the use-case (without another keyword which is a bonus).
So,
+1 from me.
-Travis
On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto" mail@paul.kishimoto.name wrote: On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
Hello everyone,
I am a longtime NumPy user, and I just filed my first contribution to
the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was.
This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype.
Currently, these options work with a file like (Example #1):
# gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…but NOT with a file like (Example #2):
# here is a general file comment # it is spread over multiple lines gender age weight M 21 72.100000 F 35 58.330000 M 33 21.99
…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers.
This is because genfromtxt() (after skipping a number of lines as
specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3):
gender age weight # wrong column names M 21 72.100000 F 35 58.330000 M 33 21.99
…the resulting column names are 'wrong', 'column' and 'names'.
My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line.
In Example 2, the result would be that the first two lines appear empty
(no text before '#'), and the third line ("gender age weight") is used for column names.
In Example 3, the result would be that "gender age weight" is used for
column names while "# wrong column names" is ignored.
BUT!
In Example 1, the result would be that the first line appears empty,
and "M 21 72.100000" are used for column names.
In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4):
# here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.100000 F 35 58.330000 # here is a comment on a data line M 33 21.99
I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts.
Cheers,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Paul,
At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy.
Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions:
http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
Cheers, Tom
Tom,
Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.:
raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#', names=True)
…but I realize that making this change in many places would still be an annoyance.
I should perhaps also add that my view of 'proper' table formats is
partly influenced by another plotting package, namely pgfplots for LaTeX (http://pgfplots.sourceforge.net/ , http://pgfplots.sourceforge.net/gallery.html) which uses uncommented headers. To the extent NumPy users are also LaTeX users, similar semantics could be more friendly.
Looking forward to more input from other users,
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hi Pierre,
On Mon, 2012-07-16 at 01:54 -0500, Travis Oliphant wrote:
On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:
Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think?
That seems like an acceptable proposal --- it is consistent with current behavior but also satisfies the use-case (without another keyword which is a bonus).
So,
+1 from me.
-Travis
Thanks for jumping in, and for offering a compromise solution. I agree that genfromtxt() has too many kwargs—it took me several minutes of reading the docs to realize why it wasn't behaving as expected!
To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest?
---
If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; THEN apply the behaviour you describe.
The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with:
1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True)
…crucially #1 avoids the regression.
Does this seem good to everyone?
---
But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc.
In short, the code can't resolve the ambiguity without some extra information from the user.
On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto" mail@paul.kishimoto.name wrote: On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: > On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto > mail@paul.kishimoto.name wrote: > > Hello everyone, > > > > I am a longtime NumPy user, and I just filed my first contribution to > > the code as pull request to fix what I felt was a bug in the behaviour > > of genfromtxt() https://github.com/numpy/numpy/pull/351 > > It turns out this alters existing behaviour that some people may depend > > on, so I was encouraged to raise the issue on this list to see what the > > consensus was. > > > > This behaviour happens in the specific situation where: > > * Comments are used in the file (the default comment character is > > '#', which I'll use here), AND > > * The kwarg names=True is given. In this case, genfromtxt() is > > supposed to read an initial row containing the names of the > > columns and return an array with a structured dtype. > > > > Currently, these options work with a file like (Example #1): > > > > # gender age weight > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …but NOT with a file like (Example #2): > > > > # here is a general file comment > > # it is spread over multiple lines > > gender age weight > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and > > thinks all of the columns are strings because 'gender', 'age' and > > 'weight' are not numbers. > > > > This is because genfromtxt() (after skipping a number of lines as > > specified in the optional kwarg skip_header) will use the *first* line > > it encounters to produce column names. If that line contains a comment > > character, genfromtxt() discards everything *up to and including* the > > comment character, and tries to use the content *after* the comment > > character as headers (Example 3): > > > > gender age weight # wrong column names > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …the resulting column names are 'wrong', 'column' and 'names'. > > > > My proposed change was that, if the first (or any subsequent) line > > contains a comment character, it should be treated as an *actual > > comment*, and discarded along with anything that follows it on the line. > > > > In Example 2, the result would be that the first two lines appear empty > > (no text before '#'), and the third line ("gender age weight") is used > > for column names. > > > > In Example 3, the result would be that "gender age weight" is used for > > column names while "# wrong column names" is ignored. > > > > BUT! > > > > In Example 1, the result would be that the first line appears empty, > > and "M 21 72.100000" are used for column names. > > > > In other words, this change would do away with the previous behaviour > > where the very first commented line was (magically?) treated not as a > > comment but instead as column headers. This might break some existing > > code. On the positive side, it would allow the user to be more liberal > > with the format of input files (Example 4): > > > > # here is a general file comment > > # the columns in this table are > > gender age weight # here is a comment on the header line > > # following this line are the data > > M 21 72.100000 > > F 35 58.330000 # here is a comment on a data line > > M 33 21.99 > > > > I feel that this is a better/more flexible behaviour for genfromtxt(), > > but—as stated—I am interested in your thoughts. > > > > Cheers, > > -- > > Paul Natsuo Kishimoto > > > > SM candidate, Technology & Policy Program (2012) > > Research assistant, http://globalchange.mit.edu > > https://paul.kishimoto.name +1 617 302 6105 > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Hi Paul, > > At least in astronomy tabular files with the column definitions in the > first commented line are reasonably common. This is driven in part by > wide use of legacy packages like supermongo etc that don't have > intelligent table readers, so users document the column names as a > comment line. I think making this break might be unfortunate for > users in astronomy. > > Dealing with commented header definitions is annoying. Not that it > matters specifically for your genfromtext() proposal, but in the > asciitable reader this case is handled with a particular reader class > that expects the first comment line to contain the column definitions: > > http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader > > Cheers, > Tom
Tom, Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.: raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#', names=True) …but I realize that making this change in many places would still be an annoyance. I should perhaps also add that my view of 'proper' table formats is partly influenced by another plotting package, namely pgfplots for LaTeX (http://pgfplots.sourceforge.net/ , http://pgfplots.sourceforge.net/gallery.html) which uses uncommented headers. To the extent NumPy users are also LaTeX users, similar semantics could be more friendly. Looking forward to more input from other users, -- Paul Natsuo Kishimoto SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105

To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest?
Nope, I was just referring to some hypothetical variable name. I meant that:
first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ...
(it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry)
If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True.
In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines;
I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient.
The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with:
1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True)
…crucially #1 avoids the regression.
Does this seem good to everyone?
Sounds good w/ `skip_header=-1`
But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc.
In that case, you could always use `skip_header=2`
In short, the code can't resolve the ambiguity without some extra information from the user.
It's always best not to let the code guess too much anyway...
Well, no regression, and you have a nice plan. I'm for it. Anybody else?

I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351
On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest?
Nope, I was just referring to some hypothetical variable name. I meant that:
first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ...
(it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry)
If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines;
I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient.
The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone?
Sounds good w/ `skip_header=-1`
But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc.
In that case, you could always use `skip_header=2`
In short, the code can't resolve the ambiguity without some extra information from the user.
It's always best not to let the code guess too much anyway...
Well, no regression, and you have a nice plan. I'm for it. Anybody else?
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Jul 16, 2012 at 8:06 PM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351
Can we please not use weird magic values like this? This isn't C. skip_header="comments" would work just as well and be far more self-explanatory...
-n

Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`. Plus, range(-1) == [], while range("comments") raises a TypeError. And then you'd have to figure why the exception was raised.

On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevlist@gmail.com wrote:
Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`.
The logic here is: - if names=True, then genfromtext expects the names to be given in the first line, and they may or may not be commented out - BUT, if skip_header=<some special value>, then any all-comment lines will be skipped before looking for names, i.e. the names are not expected to be commented out and comments are given their original meaning again.
I have no idea how one could derive this understanding by looking at skip_header=-1. "Ah, that -1 is a number of lines, everyone knows that skipping -1 lines is equivalent to toggling our expectation for whether the column names will appear inside a comment"? The API is pretty convoluted at this point and I'm not convinced we wouldn't be better off with adding a new argument like names_in_comment=False/True, but skip_header="comments" at least gives the reader a fighting chance...
-n

On Mon, 2012-07-16 at 21:14 +0100, Nathaniel Smith wrote:
On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevlist@gmail.com wrote:
Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`.
The logic here is:
- if names=True, then genfromtext expects the names to be given in the
first line, and they may or may not be commented out
- BUT, if skip_header=<some special value>, then any all-comment lines
will be skipped before looking for names, i.e. the names are not expected to be commented out and comments are given their original meaning again.
I have no idea how one could derive this understanding by looking at skip_header=-1. "Ah, that -1 is a number of lines, everyone knows that skipping -1 lines is equivalent to toggling our expectation for whether the column names will appear inside a comment"? The API is pretty convoluted at this point and I'm not convinced we wouldn't be better off with adding a new argument like names_in_comment=False/True, but skip_header="comments" at least gives the reader a fighting chance...
-n
Another option is to use skip_header=True. The internal monologue accompanying this is "Ah, do I want it to skip the header? Yes, true, I do," with no thought needed on the number of lines involved. Pierre, checking the type of the argument is trivial. Nathaniel, is this less weird/magical? Anyone else?
I don't care what the value is and it's easy to change, but I'll await some agreement on this point so I don't have to change it yet again in response to more objections.
Pierre, for a line "# A B C #1 #2 #3" the user gets six columns 'A', 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for using such messy input :) Also, if you look closely, the use of index() you propose is equivalent to my current code, just more verbose.
Tom, in my branch I rewrote the documentation for the `names` kwarg in an attempt to be more clear, but I agree a documentation example of the non-legacy use would go a long way. I've also realized I neglected to update the documentation for `skip_header`. I'll do these once there is consensus on the value to use.
If there was willingness to tolerate a backwards-incompatible change, the resulting behaviour would be quite simple and intuitive overall, but that's out of my hands. At the moment I'm just concerned with making an intuitive behaviour *possible*.
Thanks everyone for your input,

I don't really have any deep issue with `skip_header=True`, besides not really liking having an argument whose type can vary. But that's only a matter of personal taste. And yes, we could always check the type…
Pierre, for a line "# A B C #1 #2 #3" the user gets six columns 'A', 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for using such messy input :)
OK, we're on the same page.
Also, if you look closely, the use of index() you propose is equivalent to my current code, just more verbose.
I'm not convinced by line 1353: unless you change it to asbyte(comment).join(first_line.split(comments)[1:]) you gonna lose the '#', aren't you ? With the 'index' way, we just pick the first one, as intended. But it's late and multitasking isn't really working for me now.
Tom, in my branch I rewrote the documentation for the `names` kwarg in an attempt to be more clear, but I agree a documentation example of the non-legacy use would go a long way. I've also realized I neglected to update the documentation for `skip_header`. I'll do these once there is consensus on the value to use.
Good. Don't forget to add some unit-tests too.
If there was willingness to tolerate a backwards-incompatible change, the resulting behaviour would be quite simple and intuitive overall, but that's out of my hands.
See, that't the problem here: we can't have any backwards incompatibility, lest we upset users that may not even be on the list. So that's a no-no. Nevertheless, you're raising some interesting cases, and I'm sure a consensus will be found quite soon.

On Mon, Jul 16, 2012 at 10:39 PM, Pierre GM pgmdevlist@gmail.com wrote:
I don't really have any deep issue with `skip_header=True`, besides not really liking having an argument whose type can vary. But that's only a matter of personal taste. And yes, we could always check the type…
I guess I still have a small preference for skip_header="comments" over skip_header=True, since the latter is more opaque for no purpose. Also it makes me slightly antsy since skip_header is normally an integer, and True is, in fact, just an integer with a special __repr__:
In [2]: isinstance(True, int) Out[2]: True
In [3]: True + True Out[3]: 2
Not that there are likely to be people using skip_header=True as an alias for skip_header=1, but if they were it would currently work.
Pierre, for a line "# A B C #1 #2 #3" the user gets six columns 'A', 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for using such messy input :)
OK, we're on the same page.
Also, if you look closely, the use of index() you propose is equivalent to my current code, just more verbose.
I'm not convinced by line 1353: unless you change it to asbyte(comment).join(first_line.split(comments)[1:]) you gonna lose the '#', aren't you ? With the 'index' way, we just pick the first one, as intended. But it's late and multitasking isn't really working for me now.
I think you guys are looking for .split(comments, 1).
-n

On Tuesday, July 17, 2012 at 17:47 , Nathaniel Smith wrote:
I guess I still have a small preference for skip_header="comments"
over skip_header=True, since the latter is more opaque for no purpose.
Also it makes me slightly antsy since skip_header is normally an
integer, and True is, in fact, just an integer with a special
__repr__:
Nathaniel, that's basically my problem with `skip_header=True`. I still prefer my -1 to your "comments", but whatever, personal taste again.
I'm not convinced by line 1353: unless you change it to
asbyte(comment).join(first_line.split(comments)[1:])
you gonna lose the '#', aren't you ? With the 'index' way, we just pick the
first one, as intended. But it's late and multitasking isn't really working
for me now.
I think you guys are looking for .split(comments, 1).
-n
Tadah! Thanks a lot!

On Jul 17, 2012, at 10:47 AM, Nathaniel Smith wrote:
Not that there are likely to be people using skip_header=True as an alias for skip_header=1, but if they were it would currently work.
<confession> I write messy code like that all the time. </confession>
Best,
Travis

On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto mail@paul.kishimoto.name wrote:
I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351
On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest?
Nope, I was just referring to some hypothetical variable name. I meant that:
first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ...
(it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry)
If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines;
I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient.
The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone?
Sounds good w/ `skip_header=-1`
But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc.
In that case, you could always use `skip_header=2`
In short, the code can't resolve the ambiguity without some extra information from the user.
It's always best not to let the code guess too much anyway...
Well, no regression, and you have a nice plan. I'm for it. Anybody else?
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I think that the proposed solution is OK, but it does make it even trickier for the average user to predict the behavior of genfromtxt() for different situations. Perhaps as part of this pull request Paul should also update the documentation to include a section describing this behavior and usage with examples 1 to 4.
- Tom

Tom, I agree that the documentation should be updated (both the doctoring and the relevant parts of the user manual), and specific unit-tests added. Paul, that's a direct nudge ;) (I'm sure you don't mind).
I was also considering the weird case
first_line = "# A B C #1 #2 #3"
How many columns in that case ? 6 ? 3 ? So, instead of using a `split`, maybe we should just check
index=first_line.index(comment)
and take `first_line[:index]` (or `first_line[index+1:]` after depending on the case).
But then again, it's a weird case.
-- Pierre GM
On Monday, July 16, 2012 at 22:00 , Tom Aldcroft wrote:
On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto <mail@paul.kishimoto.name (mailto:mail@paul.kishimoto.name)> wrote:
I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351
On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest?
Nope, I was just referring to some hypothetical variable name. I meant that:
first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ...
(it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry)
If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True.
In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines;
I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient.
The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with:
- genfromtxt(..., names=True)
- genfromtxt(..., names=True, skip_header=True)
- genfromtxt(..., names=True)
- genfromtxt(..., names=True, skip_header=True)
…crucially #1 avoids the regression.
Does this seem good to everyone?
Sounds good w/ `skip_header=-1`
But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc.
In that case, you could always use `skip_header=2`
In short, the code can't resolve the ambiguity without some extra information from the user.
It's always best not to let the code guess too much anyway...
Well, no regression, and you have a nice plan. I'm for it. Anybody else?
NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion
I think that the proposed solution is OK, but it does make it even trickier for the average user to predict the behavior of genfromtxt() for different situations. Perhaps as part of this pull request Paul should also update the documentation to include a section describing this behavior and usage with examples 1 to 4.
- Tom
NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (6)
-
Nathaniel Smith
-
Paul Natsuo Kishimoto
-
Pierre GM
-
Tom Aldcroft
-
Travis Oliphant
-
Travis Vaught