
On 04/04/2011 11:20 AM, Charles R Harris wrote:
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com <mailto:bsouthey@gmail.com>> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote: > On 31 Mar 2011, at 17:03, Bruce Southey wrote: > >> This is an invalid ticket because the docstring clearly states that in >> 3 different, yet critical places, that missing values are not handled >> here: >> >> "Each row in the text file must have the same number of values." >> "genfromtxt : Load data with missing values handled as specified." >> " This function aims to be a fast reader for simply formatted >> files. The >> `genfromtxt` function provides more sophisticated handling of, >> e.g., >> lines with missing values." >> >> Really I am trying to separate the usage of loadtxt and genfromtxt to >> avoid unnecessary duplication and confusion. Part of this is >> historical because loadtxt was added in 2007 and genfromtxt was added >> in 2009. So really certain features of loadtxt have been 'kept' for >> backwards compatibility purposes yet these features can be 'abused' to >> handle missing data. But I really consider that any missing values >> should cause loadtxt to fail. >> > OK, I was not aware of the design issues of loadtxt vs. genfromtxt - > you could probably say also for historical reasons since I have not > used genfromtxt much so far. > Anyway the docstring statement "Converters can also be used to > provide a default value for missing data:" > then appears quite misleading, or an invitation to abuse, if you will. > This should better be removed from the documentation then, or users > explicitly discouraged from using converters instead of genfromtxt > (I don't see how you could completely prevent using converters in > this way). > >> The patch is incorrect because it should not include a space in the >> split() as indicated in the comment by the original reporter. Of > The split('\r\n') alone caused test_dtype_with_object(self) to fail, > probably > because it relies on stripping the blanks. But maybe the test is ill- > formed? > >> course a corrected patch alone still is not sufficient to address the >> problem without the user providing the correct converter. Also you >> start to run into problems with multiple delimiters (such as one space >> versus two spaces) so you start down the path to add all the features >> that duplicate genfromtxt. > Given that genfromtxt provides that functionality more conveniently, > I agree again users should be encouraged to use this instead of > converters. > But the actual tab-problem causes in fact an issue not related to > missing > values at all (well, depending on what you call a missing value). > I am describing an example on the ticket. > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in.
Chuck
I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array. I do agree with you Pierre but this is a nice corner case that Derek raised where a space does not necessarily mean a missing value when there is a tab delimiter: data = StringIO("aa\tbb\n \t \ncc\tdd") dt=np.dtype([('label', 'S2'), ('comment', 'S2')]) test = np.loadtxt(data, delimiter="\t", dtype=dt) control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt) So 'test' and 'control' should give the same array. Bruce