[Numpy-discussion] loadtxt/savetxt tickets

Bruce Southey bsouthey at gmail.com
Mon Apr 4 13:01:19 EDT 2011


On 04/04/2011 11:20 AM, Charles R Harris wrote:
>
>
> On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey at gmail.com 
> <mailto:bsouthey at gmail.com>> wrote:
>
>     On 03/31/2011 12:02 PM, Derek Homeier wrote:
>     > On 31 Mar 2011, at 17:03, Bruce Southey wrote:
>     >
>     >> This is an invalid ticket because the docstring clearly states
>     that in
>     >> 3 different, yet critical places, that missing values are not
>     handled
>     >> here:
>     >>
>     >> "Each row in the text file must have the same number of values."
>     >> "genfromtxt : Load data with missing values handled as specified."
>     >> "   This function aims to be a fast reader for simply formatted
>     >> files.  The
>     >>     `genfromtxt` function provides more sophisticated handling of,
>     >> e.g.,
>     >>     lines with missing values."
>     >>
>     >> Really I am trying to separate the usage of loadtxt and
>     genfromtxt to
>     >> avoid unnecessary duplication and confusion. Part of this is
>     >> historical because loadtxt was added in 2007 and genfromtxt was
>     added
>     >> in 2009. So really certain features of loadtxt have been
>      'kept' for
>     >> backwards compatibility purposes yet these features can be
>     'abused' to
>     >> handle missing data. But I really consider that any missing values
>     >> should cause loadtxt to fail.
>     >>
>     > OK, I was not aware of the design issues of loadtxt vs. genfromtxt -
>     > you could probably say also for historical reasons since I have not
>     > used genfromtxt much so far.
>     > Anyway the docstring statement "Converters can also be used to
>     >           provide a default value for missing data:"
>     > then appears quite misleading, or an invitation to abuse, if you
>     will.
>     > This should better be removed from the documentation then, or users
>     > explicitly discouraged from using converters instead of genfromtxt
>     > (I don't see how you could completely prevent using converters in
>     > this way).
>     >
>     >> The patch is incorrect because it should not include a space in the
>     >> split() as indicated in the comment by the original reporter. Of
>     > The split('\r\n') alone caused test_dtype_with_object(self) to fail,
>     > probably
>     > because it relies on stripping the blanks. But maybe the test is
>     ill-
>     > formed?
>     >
>     >> course a corrected patch alone still is not sufficient to
>     address the
>     >> problem without the user providing the correct converter. Also you
>     >> start to run into problems with multiple delimiters (such as
>     one space
>     >> versus two spaces) so you start down the path to add all the
>     features
>     >> that duplicate genfromtxt.
>     > Given that genfromtxt provides that functionality more conveniently,
>     > I agree again users should be encouraged to use this instead of
>     > converters.
>     > But the actual tab-problem causes in fact an issue not related to
>     > missing
>     > values at all (well, depending on what you call a missing value).
>     > I am describing an example on the ticket.
>     >
>     > Cheers,
>     >                                       Derek
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>     > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>     Okay I see that 1071 got closed which I am fine with.
>
>     I think that your following example should be a test because the two
>     spaces should not be removed with a tab delimiter:
>     np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t',
>     dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
>
>
> Make a test and we'll put it in.
>
> Chuck
>
>
I know!
Trying to write one made me realize that loadtxt is not handling string 
arrays correctly. So I have to check more on this as I think loadtxt is 
giving a 1-d array instead of a 2-d array.

I do agree with you Pierre but this is a nice corner case that Derek 
raised where a space does not necessarily mean a missing value when 
there is a tab delimiter:

data = StringIO("aa\tbb\n \t \ncc\tdd")
dt=np.dtype([('label', 'S2'), ('comment', 'S2')])
test = np.loadtxt(data, delimiter="\t", dtype=dt)
control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt)

So 'test' and 'control' should give the same array.

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110404/a31b74c9/attachment.html>


More information about the NumPy-Discussion mailing list