Re: [Numpy-discussion] loadtxt/savetxt tickets

March 31, 2011


      On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
...
On 26. mars 2011, at 21.44, Derek Homeier wrote:
...
Hi Paul,
having had a look at the other tickets you dug up,
...
My opinions are my own, and in detail, they are:
1752:
  I attach a possible patch. FWIW, I agree with the request. The
patch is written to be compatible with the fix in ticket #1562, but
I did not test that yet.
Tested, see also my comments on Trac.
Great!
...
...
1731:
  This seems like a rather trivial feature enhancement. I attach a
possible patch.
Agreed. Haven't tested it though.
Great!
...
...
1616:
  The suggested patch seems reasonable to me, but I do not have a
full list of what objects loadtxt supports today as opposed to what
this patch will support.
Looks like you got this one. Just remember to make it compatible with
#1752. Should be easy.
...
...
1562:
  I attach a possible patch. This could also be the default
behavior to my mind, since the function caller can simply call
numpy.squeeze if needed. Changing default behavior would probably
break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with.
808    if X.ndim < ndmin:
809        if ndmin == 1:
810            X.shape = (X.size, )
811        elif ndmin == 2:
812            X.shape = (X.size, 1)
The last line should be:
812            X.shape = (1, X.size)
If someone wants a 2D array out, they would most likely expect a one-row
file to come out as a one-row array, not the other way around. IMHO.
...
...
1458:
  The fix suggested in the ticket seems reasonable, but I have
never used record arrays, so I am not sure  of this.
There were some issues with Python3, and I also had some general
reservations
as noted on Trac - basically, it makes 'unpack' equivalent to
transposing for 2D-arrays,
but to splitting into fields for 1D-recarrays. My question was, what's
going to happen
when you get to 2D-recarrays? Currently this is not an issue since
loadtxt can only
read 2D regular or 1D structured arrays. But this might change if the
data block
functionality (see below) were to be implemented - data could then be
returned as
3D arrays or 2D structured arrays... Still, it would probably make
most sense (or at
least give the widest functionality) to have 'unpack=True' always
return a list or iterator
over columns.
OK, I don't know recarrays, as I said.
...
...
1445:
  Adding this functionality could break old code, as some old
datafiles may have empty lines which are now simply ignored. I do
not think the feature is a good idea. It could rather be implemented
as a separate function.
1107:
  I do not see the need for this enhancement. In my eyes, the
usecols kwarg does this and more. Perhaps I am misunderstanding
something here.
Agree about #1445, and the bit about 'usecols' - 'numcols' would just
provide a
shorter call to e.g. read the first 20 columns of a file (well, not
even that much
over 'usecols=range(20)'...), don't think that justifies an extra
argument.
But the 'datablocks' provides something new, that a number of people
seem
to miss from e.g. gnuplot (including me, actually ;-). And it would
also satisfy the
request from #1445 without breaking backwards compatibility.
I've been wondering if could instead specify the separator lines
through the
parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if
that would make
it more useful...
What about writing a separate function, e.g. loadblocktxt, and have it
separate the chunks and call loadtxt for each chunk? Just a thought. Another
possibility would be to write a function that would let you load a set of
text files in a directory, and return a dict of datasets, one per file. One
could write a similar save-function, too. They would just need to call
loadtxt/savetxt on a per-file basis.
...
...
1071:
     It is not clear to me whether loadtxt is supposed to support
missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters
as described there.
The problem is, the default delimiter is described as 'any
whitespace', which in the
present implementation obviously includes any number of blanks or
tabs. These
are therefore treated differently from delimiters like ',' or '&'. I'd
reckon there are
too many people actually relying on this behaviour to silently change it
(e.g. I know plenty of tables with columns separated by either one or
several
tabs depending on the length of the previous entry). But the tab is
apparently also
treated differently if explicitly specified with "delimiter='\t'" -
and in that case using
a converter à la {2: lambda s: float(s or 'Nan')} is working for
fields in the middle of
the line, but not at the end - clearly warrants improvement. I've
prepared a patch
working for Python3 as well.
Great!
...
...
1163:
1565:
  These tickets seem to have the same origin of the problem. I
attach one possible patch. The previously suggested patches that
I've seen will not correctly convert floats to ints, which I believe
my patch will.
+1, though I am a bit concerned that prompting to raise a ValueError
for every
element could impede performance. I'd probably still enclose it into an
if issubclass(typ, np.uint64) or issubclass(typ, np.int64):
just like in npio.patch. I also thought one might switch to
int(float128(x)) in that
case, but at least for the given examples float128 cannot convert with
more
accuracy than float64 (even on PowerPC ;-).
There were some dissenting opinions that trying to read a float into
an int should
generally throw an exception though...
And Chuck just beat me...
I am sure someone has been using this functionality to convert floats to
ints. Changing will break their code. I am not sure how big a deal that
would be. Also, I am of the opinion that one should _first_ write a program
that works _correctly_, and only afterwards worry about performance. Even
so, using numpy.int64 would be better, because it would give a sensible
error message.
...
On 26 Mar 2011, at 21:25, Charles R Harris wrote:
...
I put all these patches together at
https://github.com/charris/numpy/tree/loadtxt-savetxt
. Please pull from there to continue work on loadtxt/savetxt so as
to avoid conflicts in the patches. One of the numpy tests is
failing, I assume from patch conflicts, and more tests for the
tickets are needed in any case. Also, new keywords should be added
to the end, not put in the middle of existing keywords.
I haven't reviewed the patches, just tried to get them organized.
Also, I have Derek as the author on all of them, that can be changed
if it is decided the credit should go elsewhere ;) Thanks for the
work you all have been doing on these tickets.
Thanks, I'll have a look at the new ticket and try to get that
organized!
With some luck, all the loadtxt tickets should be closed in short time :-)
Can some one bring me up to date on which tickets have been dealt with?
Also, how do they divide into bug fixes and enhancements?

Chuck