np.savetxt: apply patch in enhancement ticket 1079 to add headers?
Hi all, I am assuming that this is ok to request via the list... Could we discuss or could someone apply the patch in enhancement ticket 1079? http://projects.scipy.org/numpy/ticket/1079 I needed this functionality recently, and this is a quick and easy fix that may have been overlooked. There is also another enhancement request about this here: http://projects.scipy.org/numpy/ticket/1236 The only thing that I can think of that might need to be added is a test to see that the header length is the same as the number of columns, but really that might just be up to the user to supply the right headers. It might also be nice to have a header = True, that uses the field names for a structured array, but I can live without that. Cheers, Skipper
Skipper Seabold <jsseabold <at> gmail.com> writes:
Hi all,
I am assuming that this is ok to request via the list... Could we discuss or could someone apply the patch in enhancement ticket 1079?
http://projects.scipy.org/numpy/ticket/1079
I needed this functionality recently, and this is a quick and easy fix that may have been overlooked.
There is also another enhancement request about this here: http://projects.scipy.org/numpy/ticket/1236
The only thing that I can think of that might need to be added is a test to see that the header length is the same as the number of columns, but really that might just be up to the user to supply the right headers. It might also be nice to have a header = True, that uses the field names for a structured array, but I can live without that.
Cheers,
Skipper
Hi, +1; we have the same problem quite frequently. Our current solution looks similar to what has been proposed in ticket 1079, and we wonder why a solution has not yet found its way into the official release of numpy. We can, however, image a slightly different implementation and would like to hear the community's opinion on it. If the header is given as a plane string (such as envisaged in ticket 1079), the user has to care for the correct formatting, in particular, the user has to supply the comment character(s) and the new line formatting. This might be against intuition, because many users will at first try to supply their header(s) without specifying those formatting characters. The result will be a file not readable with numpy.loadtxt, and the error might not be detected right away. As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety. Cheers, Stefan & Christian
On Tue, Jun 1, 2010 at 1:05 PM, Stefan <stefan.czesla@hs.uni-hamburg.de> wrote:
Skipper Seabold <jsseabold <at> gmail.com> writes:
Hi all,
I am assuming that this is ok to request via the list... Could we discuss or could someone apply the patch in enhancement ticket 1079?
http://projects.scipy.org/numpy/ticket/1079
I needed this functionality recently, and this is a quick and easy fix that may have been overlooked.
There is also another enhancement request about this here: http://projects.scipy.org/numpy/ticket/1236
The only thing that I can think of that might need to be added is a test to see that the header length is the same as the number of columns, but really that might just be up to the user to supply the right headers. It might also be nice to have a header = True, that uses the field names for a structured array, but I can live without that.
Cheers,
Skipper
Hi,
And here I was thinking no one was listening so long ago.
+1; we have the same problem quite frequently. Our current solution looks similar to what has been proposed in ticket 1079, and we wonder why a solution has not yet found its way into the official release of numpy.
We can, however, image a slightly different implementation and would like to hear the community's opinion on it.
If the header is given as a plane string (such as envisaged in ticket 1079), the user has to care for the correct formatting, in particular, the user has to supply the comment character(s) and the new line formatting. This might be against intuition, because many users will at first try to supply their header(s) without specifying those formatting characters. The result will be a file not readable with numpy.loadtxt, and the error might not be detected right away.
I'm not sure I understand why I would want to specify a comment character for writing a csv file (unless of course I had some comments to add). Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often. Skipper
If the header is given as a plane string (such as envisaged in ticket 1079), the user has to care for the correct formatting, in particular, the user has to supply the comment character(s) and the new line formatting. This might be against intuition, because many users will at first try to supply their header(s) without specifying those formatting characters. The result will be a file not readable with numpy.loadtxt, and the error might not be detected right away.
I'm not sure I understand why I would want to specify a comment character for writing a csv file (unless of course I had some comments to add).
We are possibly talking about different things. In our approach of using numpy.savetxt comments (preceeding the actual data) and a header are essentially the same, such as in the following example. Basically, we want to add some lines of additional information at the top of the file written with numpy.savetxt, and be able to recover the data with numpy.loadtxt (for which the 'header' would then be irrelevant, what may not be your intention, or is it?). #Now comes the data #column1 [kg] column2 [apple] 1 2 3 5
Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
True, we were not aware of this, but this does not help much for the comment/header.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often.
Although, we have not been using record arrays too often, we see their advantages and agree that it should be possible to use them as you described it. We also thought about a solution, using the __str__ method for the 'header object'. In this vain, an arbitrary header class (including a plane string) providing an __str__ member may be handed to numpy.savetxt, which can use it to write the header.
Skipper
On 06/02/2010 06:21 AM, Stefan wrote:
If the header is given as a plane string (such as envisaged in ticket 1079), the user has to care for the correct formatting, in particular, the user has to supply the comment character(s) and the new line formatting. This might be against intuition, because many users will at first try to supply their header(s) without specifying those formatting characters. The result will be a file not readable with numpy.loadtxt, and the error might not be detected right away.
I'm not sure I understand why I would want to specify a comment character for writing a csv file (unless of course I had some comments to add).
We are possibly talking about different things. In our approach of using numpy.savetxt comments (preceeding the actual data) and a header are essentially the same, such as in the following example. Basically, we want to add some lines of additional information at the top of the file written with numpy.savetxt, and be able to recover the data with numpy.loadtxt (for which the 'header' would then be irrelevant, what may not be your intention, or is it?).
#Now comes the data #column1 [kg] column2 [apple] 1 2 3 5
Not that I am complaining rather trying to understand what is expected to happen. Under the patch, it is very much user beware. The header argument can be anything or nothing. There is no check for the contents or if the delimiter used is the same as the rest of the output. Further with the newline option there is no guarantee that the lines in the header will have the same line endings throughout the file. So what should a user be allowed to use as a header? You could write a whole program there or an explanation of the following output - which is very appealing. You could force a list of strings so that you print out newline.join(header) - okay not quite because it should include the comment argument. Should savetxt be restricted to something that loadtxt can read? This is potentially problematic if you want a header line. Although it could return the number of header lines. [savetxt should also be updated to allow bz2 as loadtxt handles those now - not that I have used it]
Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
True, we were not aware of this, but this does not help much for the comment/header.
Entered as ~3 months ago: http://projects.scipy.org/numpy/changeset/8180 Should this be forced to check for valid options for new lines? Otherwise you from this 'np.savetxt('junk.text', [1,2,3,4,5], newline='what')' you get: 1.000000000000000000e+00what2.000000000000000000e+00what3.000000000000000000e+00what4.000000000000000000e+00what5.000000000000000000e+00what Which is not going to be read back by loadtxt.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often.
Although, we have not been using record arrays too often, we see their advantages and agree that it should be possible to use them as you described it. We also thought about a solution, using the __str__ method for the 'header object'. In this vain, an arbitrary header class (including a plane string) providing an __str__ member may be handed to numpy.savetxt, which can use it to write the header.
Skipper
It would nice if savetxt used the dtype of the input to get a header and format by default unless overwritten by the user. Bruce
Not that I am complaining rather trying to understand what is expected to happen. Under the patch, it is very much user beware. The header argument can be anything or nothing. There is no check for the contents or if the delimiter used is the same as the rest of the output. Further with the newline option there is no guarantee that the lines in the header will have the same line endings throughout the file. So what should a user be allowed to use as a header? You could write a whole program there or an explanation of the following output - which is very appealing. You could force a list of strings so that you print out newline.join(header) - okay not quite because it should include the comment argument. Should savetxt be restricted to something that loadtxt can read? This is potentially problematic if you want a header line. Although it could return the number of header lines. [savetxt should also be updated to allow bz2 as loadtxt handles those now - not that I have used it]
Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
True, we were not aware of this, but this does not help much for the comment/header.
Entered as ~3 months ago:http://projects.scipy.org/numpy/changeset/8180 Should this be forced to check for valid options for new lines? Otherwise you from this 'np.savetxt('junk.text', [1,2,3,4,5], newline='what')' you get:
1.000000000000000000e+00what2.000000000000000000e+00what 3.000000000000000000e+00what4.000000000000000000e+00 what5.000000000000000000e+00what
Which is not going to be read back by loadtxt.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often.
Although, we have not been using record arrays too often, we see their advantages and agree that it should be possible to use them as you described it. We also thought about a solution, using the __str__ method for the 'header object'. In this vain, an arbitrary header class (including a plane string) providing an __str__ member may be handed to numpy.savetxt, which can use it to write the header.
So let us briefly summarize whats on the table. It appears to us that there are basically three open issues: (1) a csv like header for savetxt written files (first line contains column names) (2) comments (introduced by comment character e.g. '#') at the beginning of the file (preceding the data) (3) the role of the 'newline' option As was noted, the patch (ticket 1079) enables both to write a csv like header (1) and comment line(s) introduced by a comment character (e.g. '#'). Nonetheless, this solution is quite unsatisfactory in our opinion, because it may be error prone, as the user is in charge of the entire formatting. Despite this, we think that it should be up to the user what amount of information is to be put at the top of the file, but the format should be checked as far as possible. Using either a string or a list/tuple of strings, as proposed by Bruce, seems to be a reasonable possibility to implement the desired functionality. Maybe two individual keywords ('header' and 'comment') should exist to distinguish whether the the user requests case (1) or (2). As for loadtxt the default comment character should be '#', but it may be changed by the user. We think that savetxt should not be restricted to output, which can be read by loadtxt. Although it should be possible to add commments to the output file, so that it remains readable by loadtxt (without tweaking it e.g. with the skiprows keyword). We agree that the newline keyword may cause inconsistencies in the file (if ticket 1079 were applied), and possibly strange behavior such as when newline='what' is specified. Yet, this question does not only concern the header/comments. Stefan & Christian
On 06/02/2010 12:14 PM, Stefan wrote:
Not that I am complaining rather trying to understand what is expected to happen. Under the patch, it is very much user beware. The header argument can be anything or nothing. There is no check for the contents or if the delimiter used is the same as the rest of the output. Further with the newline option there is no guarantee that the lines in the header will have the same line endings throughout the file. So what should a user be allowed to use as a header? You could write a whole program there or an explanation of the following output - which is very appealing. You could force a list of strings so that you print out newline.join(header) - okay not quite because it should include the comment argument. Should savetxt be restricted to something that loadtxt can read? This is potentially problematic if you want a header line. Although it could return the number of header lines. [savetxt should also be updated to allow bz2 as loadtxt handles those now - not that I have used it]
Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
True, we were not aware of this, but this does not help much for the comment/header.
Entered as ~3 months ago:http://projects.scipy.org/numpy/changeset/8180 Should this be forced to check for valid options for new lines? Otherwise you from this 'np.savetxt('junk.text', [1,2,3,4,5], newline='what')' you get:
1.000000000000000000e+00what2.000000000000000000e+00what 3.000000000000000000e+00what4.000000000000000000e+00 what5.000000000000000000e+00what
Which is not going to be read back by loadtxt.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often.
Although, we have not been using record arrays too often, we see their advantages and agree that it should be possible to use them as you described it. We also thought about a solution, using the __str__ method for the 'header object'. In this vain, an arbitrary header class (including a plane string) providing an __str__ member may be handed to numpy.savetxt, which can use it to write the header.
So let us briefly summarize whats on the table. It appears to us that there are basically three open issues: (1) a csv like header for savetxt written files (first line contains column names) (2) comments (introduced by comment character e.g. '#') at the beginning of the file (preceding the data) (3) the role of the 'newline' option
As was noted, the patch (ticket 1079) enables both to write a csv like header (1) and comment line(s) introduced by a comment character (e.g. '#'). Nonetheless, this solution is quite unsatisfactory in our opinion, because it may be error prone, as the user is in charge of the entire formatting. Despite this, we think that it should be up to the user what amount of information is to be put at the top of the file, but the format should be checked as far as possible.
Using either a string or a list/tuple of strings, as proposed by Bruce, seems to be a reasonable possibility to implement the desired functionality. Maybe two individual keywords ('header' and 'comment') should exist to distinguish whether the the user requests case (1) or (2). As for loadtxt the default comment character should be '#', but it may be changed by the user.
We think that savetxt should not be restricted to output, which can be read by loadtxt. Although it should be possible to add commments to the output file, so that it remains readable by loadtxt (without tweaking it e.g. with the skiprows keyword).
We agree that the newline keyword may cause inconsistencies in the file (if ticket 1079 were applied), and possibly strange behavior such as when newline='what' is specified. Yet, this question does not only concern the header/comments.
Stefan& Christian
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
I am in agreement with what you suggest so post a patch. :-) Some of what I suggested was over thinking what can really be done and keep the function relatively simple and easy to use. My wish list would be that: 1) If the header is added that it allows names from structured/record arrays to be used and perhaps autogenerated (such as var1, var2, ..., varn). 2) That the dtype of the array_like input be used in the fmt when fmt is not provided. Bruce
On Wed, Jun 2, 2010 at 1:14 PM, Stefan <stefan.czesla@hs.uni-hamburg.de> wrote:
Not that I am complaining rather trying to understand what is expected to happen. Under the patch, it is very much user beware. The header argument can be anything or nothing. There is no check for the contents or if the delimiter used is the same as the rest of the output. Further with the newline option there is no guarantee that the lines in the header will have the same line endings throughout the file. So what should a user be allowed to use as a header? You could write a whole program there or an explanation of the following output - which is very appealing. You could force a list of strings so that you print out newline.join(header) - okay not quite because it should include the comment argument. Should savetxt be restricted to something that loadtxt can read? This is potentially problematic if you want a header line. Although it could return the number of header lines. [savetxt should also be updated to allow bz2 as loadtxt handles those now - not that I have used it]
Also note that since that patch was written, savetxt takes a user supplied newline keyword, so you can just append that to the header string.
True, we were not aware of this, but this does not help much for the comment/header.
Entered as ~3 months ago:http://projects.scipy.org/numpy/changeset/8180 Should this be forced to check for valid options for new lines? Otherwise you from this 'np.savetxt('junk.text', [1,2,3,4,5], newline='what')' you get:
1.000000000000000000e+00what2.000000000000000000e+00what 3.000000000000000000e+00what4.000000000000000000e+00 what5.000000000000000000e+00what
Which is not going to be read back by loadtxt.
As numpy.loadtxt has a default comment character ('#'), the same may be implemented for numpy.savetxt. In this case, numpy.savetxt would get two additional keywords (e.g. header, comment(character)), which bloats the interface, but potentially provides more safety.
FWIW, I ended up rolling my own using the most recent pre-Python 3 changes for savetxt that accepts a list of names instead of one string or if the provided array has the attribute dtype.names (non-nested rec or structured arrays) it uses those. Whatever is done I think the support for structured arrays is nice, and I think having this functionality is a no-brainer. I need it quite often.
Although, we have not been using record arrays too often, we see their advantages and agree that it should be possible to use them as you described it. We also thought about a solution, using the __str__ method for the 'header object'. In this vain, an arbitrary header class (including a plane string) providing an __str__ member may be handed to numpy.savetxt, which can use it to write the header.
So let us briefly summarize whats on the table. It appears to us that there are basically three open issues: (1) a csv like header for savetxt written files (first line contains column names) (2) comments (introduced by comment character e.g. '#') at the beginning of the file (preceding the data) (3) the role of the 'newline' option
As was noted, the patch (ticket 1079) enables both to write a csv like header (1) and comment line(s) introduced by a comment character (e.g. '#'). Nonetheless, this solution is quite unsatisfactory in our opinion, because it may be error prone, as the user is in charge of the entire formatting. Despite this, we think that it should be up to the user what amount of information is to be put at the top of the file, but the format should be checked as far as possible.
Using either a string or a list/tuple of strings, as proposed by Bruce, seems to be a reasonable possibility to implement the desired functionality. Maybe two individual keywords ('header' and 'comment') should exist to distinguish whether the the user requests case (1) or (2). As for loadtxt the default comment character should be '#', but it may be changed by the user.
We think that savetxt should not be restricted to output, which can be read by loadtxt. Although it should be possible to add commments to the output file, so that it remains readable by loadtxt (without tweaking it e.g. with the skiprows keyword).
Thanks. This does clear up my confusion and I think having both a header and a comments keyword makes sense. For the form, as I said, I went with a list of strings, as I encounter this more often than one string, but in the end it's all the same to me. Glad this is getting some attention.
We agree that the newline keyword may cause inconsistencies in the file (if ticket 1079 were applied), and possibly strange behavior such as when newline='what' is specified. Yet, this question does not only concern the header/comments.
Stefan & Christian
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Dear all, as a consequence of our discussion, we developed a patch (attached to ticket 1079), which implements some of the features discussed here. We concentrated on comments and the header. Please have a look at the patch. We are looking forward to hearing your opinion and suggestions, and whether you see any problems, which could prevent it from entering the official release. We agree with Bruce that the format string should be inferred from the data type of the array. Yet, we believe that this point should be addressed in a different patch focussing on that topic. Also we noted that there is no error checking, when an array of dimension larger 2 is handed to np.savetxt, which may be implemented easily. Stefan & Christian
On Fri, Jun 4, 2010 at 11:49 AM, Stefan <stefan.czesla@hs.uni-hamburg.de> wrote:
Dear all,
as a consequence of our discussion, we developed a patch (attached to ticket 1079), which implements some of the features discussed here. We concentrated on comments and the header. Please have a look at the patch. We are looking forward to hearing your opinion and suggestions, and whether you see any problems, which could prevent it from entering the official release.
Link: http://projects.scipy.org/numpy/ticket/1079 One comment. Maybe you can add in the notes that the comment keyword can be used to write a header and still preserve compatibility with loadtxt. This wasn't obvious to me at first, though maybe that's just me. Other than that I think it looks like a good first effort towards making this a better function and I appreciate the attention here. Skipper
We agree with Bruce that the format string should be inferred from the data type of the array. Yet, we believe that this point should be addressed in a different patch focussing on that topic.
Also we noted that there is no error checking, when an array of dimension larger 2 is handed to np.savetxt, which may be implemented easily.
Stefan & Christian
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On 06/04/2010 11:03 AM, Skipper Seabold wrote:
On Fri, Jun 4, 2010 at 11:49 AM, Stefan<stefan.czesla@hs.uni-hamburg.de> wrote:
Dear all,
as a consequence of our discussion, we developed a patch (attached to ticket 1079), which implements some of the features discussed here. We concentrated on comments and the header. Please have a look at the patch. We are looking forward to hearing your opinion and suggestions, and whether you see any problems, which could prevent it from entering the official release.
Link: http://projects.scipy.org/numpy/ticket/1079
One comment. Maybe you can add in the notes that the comment keyword can be used to write a header and still preserve compatibility with loadtxt. This wasn't obvious to me at first, though maybe that's just me.
Other than that I think it looks like a good first effort towards making this a better function and I appreciate the attention here.
Skipper
We agree with Bruce that the format string should be inferred from the data type of the array. Yet, we believe that this point should be addressed in a different patch focussing on that topic.
Also we noted that there is no error checking, when an array of dimension larger 2 is handed to np.savetxt, which may be implemented easily.
Stefan& Christian
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Hi, For the sake of similarity to loadtxt keywords (because loadtxt has them and changing those is harder than adding new ones to savetxt): 1) 'comment_character' should be 'comments' 2) instead of 'comment' perhaps use 'preamble' Thanks for doing the patch so quickly! Bruce
Hi all, dear Bruce and Skipper, we very much appreciate your feedback. In response to Skipper's annotation we added a paragraph in the notes section and also tried to indicate the purpose of the keywords more precisely in the parameter section. The keyword renaming suggested by Bruce lead to some internal discussions here. We also were not 100% satisfied with the 'comments-comment_character' solution proposed in the first patch, and we see the conflict with loadtxt. Yet, also the combination of 'Preamble-Comments' appears, somewhat, awkward, because both seem to indicate the same, at least in our opinion. We appreciate Bruce's suggestion to call the keyword Preamble, because it expresses its purpose much more clearly than 'Comments' did. For the same reason, we decided to stay with 'comment_character' instead of 'Comments'. For the sake of clarity, this solution sacrifices full compatibility with np.loadtxt, but it does not create a conflict either. An adapted patch is available via ticket 1079 at: http://projects.scipy.org/numpy/ticket/1079 Christian & Stefan
participants (3)
-
Bruce Southey -
Skipper Seabold -
Stefan