[Numpy-discussion] loadtxt/savetxt tickets

Sat Mar 26 16:44:14 EDT 2011

Hi Paul,

having had a look at the other tickets you dug up,

> My opinions are my own, and in detail, they are:
> 1752:
>    I attach a possible patch. FWIW, I agree with the request. The  
> patch is written to be compatible with the fix in ticket #1562, but  
> I did not test that yet.

Tested, see also my comments on Trac.

> 1731:
>    This seems like a rather trivial feature enhancement. I attach a  
> possible patch.

Agreed. Haven't tested it though.

> 1616:
>    The suggested patch seems reasonable to me, but I do not have a  
> full list of what objects loadtxt supports today as opposed to what  
> this patch will support.

> 1562:
>    I attach a possible patch. This could also be the default  
> behavior to my mind, since the function caller can simply call  
> numpy.squeeze if needed. Changing default behavior would probably  
> break old code, however.

See comments on Trac as well.

> 1458:
>    The fix suggested in the ticket seems reasonable, but I have  
> never used record arrays, so I am not sure  of this.

There were some issues with Python3, and I also had some general  
reservations
as noted on Trac - basically, it makes 'unpack' equivalent to  
transposing for 2D-arrays,
but to splitting into fields for 1D-recarrays. My question was, what's  
going to happen
when you get to 2D-recarrays? Currently this is not an issue since  
loadtxt can only
read 2D regular or 1D structured arrays. But this might change if the  
data block
functionality (see below) were to be implemented - data could then be  
returned as
3D arrays or 2D structured arrays... Still, it would probably make  
most sense (or at
least give the widest functionality) to have 'unpack=True' always  
return a list or iterator
over columns.

> 1445:
>    Adding this functionality could break old code, as some old  
> datafiles may have empty lines which are now simply ignored. I do  
> not think the feature is a good idea. It could rather be implemented  
> as a separate function.
> 1107:
>    I do not see the need for this enhancement. In my eyes, the  
> usecols kwarg does this and more. Perhaps I am misunderstanding  
> something here.

Agree about #1445, and the bit about 'usecols' - 'numcols' would just  
provide a
shorter call to e.g. read the first 20 columns of a file (well, not  
even that much
over 'usecols=range(20)'...), don't think that justifies an extra  
argument.
But the 'datablocks' provides something new, that a number of people  
seem
to miss from e.g. gnuplot (including me, actually ;-). And it would  
also satisfy the
request from #1445 without breaking backwards compatibility.
I've been wondering if could instead specify the separator lines  
through the
parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if  
that would make
it more useful...

> 1071:
> 	It is not clear to me whether loadtxt is supposed to support  
> missing values in the fashion indicated in the ticket.

In principle it should at least allow you to, by the use of converters  
as described there.
The problem is, the default delimiter is described as 'any  
whitespace', which in the
present implementation obviously includes any number of blanks or  
tabs. These
are therefore treated differently from delimiters like ',' or '&'. I'd  
reckon there are
too many people actually relying on this behaviour to silently change it
(e.g. I know plenty of tables with columns separated by either one or  
several
tabs depending on the length of the previous entry). But the tab is  
apparently also
treated differently if explicitly specified with "delimiter='\t'" -  
and in that case using
a converter à la {2: lambda s: float(s or 'Nan')} is working for  
fields in the middle of
the line, but not at the end - clearly warrants improvement. I've  
prepared a patch
working for Python3 as well.

> 1163:
> 1565:
>    These tickets seem to have the same origin of the problem. I  
> attach one possible patch. The previously suggested patches that  
> I've seen will not correctly convert floats to ints, which I believe  
> my patch will.

+1, though I am a bit concerned that prompting to raise a ValueError  
for every
element could impede performance. I'd probably still enclose it into an
if issubclass(typ, np.uint64) or issubclass(typ, np.int64):
just like in npio.patch. I also thought one might switch to  
int(float128(x)) in that
case, but at least for the given examples float128 cannot convert with  
more
accuracy than float64 (even on PowerPC ;-).
There were some dissenting opinions that trying to read a float into  
an int should
generally throw an exception though...

And Chuck just beat me...

On 26 Mar 2011, at 21:25, Charles R Harris wrote:

> I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt 
> . Please pull from there to continue work on loadtxt/savetxt so as  
> to avoid conflicts in the patches. One of the numpy tests is  
> failing, I assume from patch conflicts, and more tests for the  
> tickets are needed in any case. Also, new keywords should be added  
> to the end, not put in the middle of existing keywords.
>
> I haven't reviewed the patches, just tried to get them organized.  
> Also, I have Derek as the author on all of them, that can be changed  
> if it is decided the credit should go elsewhere ;) Thanks for the  
> work you all have been doing on these tickets.

Thanks, I'll have a look at the new ticket and try to get that  
organized!

Cheers,
						Derek