
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote:
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket.
Cheers, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in. Chuck