[Numpy-discussion] loadtxt/savetxt tickets
Bruce Southey
bsouthey at gmail.com
Mon Apr 4 13:01:19 EDT 2011
On 04/04/2011 11:20 AM, Charles R Harris wrote:
>
>
> On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey at gmail.com
> <mailto:bsouthey at gmail.com>> wrote:
>
> On 03/31/2011 12:02 PM, Derek Homeier wrote:
> > On 31 Mar 2011, at 17:03, Bruce Southey wrote:
> >
> >> This is an invalid ticket because the docstring clearly states
> that in
> >> 3 different, yet critical places, that missing values are not
> handled
> >> here:
> >>
> >> "Each row in the text file must have the same number of values."
> >> "genfromtxt : Load data with missing values handled as specified."
> >> " This function aims to be a fast reader for simply formatted
> >> files. The
> >> `genfromtxt` function provides more sophisticated handling of,
> >> e.g.,
> >> lines with missing values."
> >>
> >> Really I am trying to separate the usage of loadtxt and
> genfromtxt to
> >> avoid unnecessary duplication and confusion. Part of this is
> >> historical because loadtxt was added in 2007 and genfromtxt was
> added
> >> in 2009. So really certain features of loadtxt have been
> 'kept' for
> >> backwards compatibility purposes yet these features can be
> 'abused' to
> >> handle missing data. But I really consider that any missing values
> >> should cause loadtxt to fail.
> >>
> > OK, I was not aware of the design issues of loadtxt vs. genfromtxt -
> > you could probably say also for historical reasons since I have not
> > used genfromtxt much so far.
> > Anyway the docstring statement "Converters can also be used to
> > provide a default value for missing data:"
> > then appears quite misleading, or an invitation to abuse, if you
> will.
> > This should better be removed from the documentation then, or users
> > explicitly discouraged from using converters instead of genfromtxt
> > (I don't see how you could completely prevent using converters in
> > this way).
> >
> >> The patch is incorrect because it should not include a space in the
> >> split() as indicated in the comment by the original reporter. Of
> > The split('\r\n') alone caused test_dtype_with_object(self) to fail,
> > probably
> > because it relies on stripping the blanks. But maybe the test is
> ill-
> > formed?
> >
> >> course a corrected patch alone still is not sufficient to
> address the
> >> problem without the user providing the correct converter. Also you
> >> start to run into problems with multiple delimiters (such as
> one space
> >> versus two spaces) so you start down the path to add all the
> features
> >> that duplicate genfromtxt.
> > Given that genfromtxt provides that functionality more conveniently,
> > I agree again users should be encouraged to use this instead of
> > converters.
> > But the actual tab-problem causes in fact an issue not related to
> > missing
> > values at all (well, depending on what you call a missing value).
> > I am describing an example on the ticket.
> >
> > Cheers,
> > Derek
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> Okay I see that 1071 got closed which I am fine with.
>
> I think that your following example should be a test because the two
> spaces should not be removed with a tab delimiter:
> np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t',
> dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
>
>
> Make a test and we'll put it in.
>
> Chuck
>
>
I know!
Trying to write one made me realize that loadtxt is not handling string
arrays correctly. So I have to check more on this as I think loadtxt is
giving a 1-d array instead of a 2-d array.
I do agree with you Pierre but this is a nice corner case that Derek
raised where a space does not necessarily mean a missing value when
there is a tab delimiter:
data = StringIO("aa\tbb\n \t \ncc\tdd")
dt=np.dtype([('label', 'S2'), ('comment', 'S2')])
test = np.loadtxt(data, delimiter="\t", dtype=dt)
control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt)
So 'test' and 'control' should give the same array.
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110404/a31b74c9/attachment.html>
More information about the NumPy-Discussion
mailing list