On 26. mars 2011, at 21.44, Derek Homeier wrote:
> Hi Paul,
>
> having had a look at the other tickets you dug up,
>
>> My opinions are my own, and in detail, they are:
>> 1752:
>> I attach a possible patch. FWIW, I agree with the request. The
>> patch is written to be compatible with the fix in ticket #1562, but
>> I did not test that yet.
>
> Tested, see also my comments on Trac.
Great!
>> 1731:
>> This seems like a rather trivial feature enhancement. I attach a
>> possible patch.
>
> Agreed. Haven't tested it though.
Great!
>> 1616:
>> The suggested patch seems reasonable to me, but I do not have a
>> full list of what objects loadtxt supports today as opposed to what
>> this patch will support.
Looks like you got this one. Just remember to make it compatible with #1752. Should be easy.
>> 1562:
>> I attach a possible patch. This could also be the default
>> behavior to my mind, since the function caller can simply call
>> numpy.squeeze if needed. Changing default behavior would probably
>> break old code, however.
>
> See comments on Trac as well.
Your patch is better, but there is one thing I disagree with.
808 if X.ndim < ndmin:
809 if ndmin == 1:
810 X.shape = (X.size, )
811 elif ndmin == 2:
812 X.shape = (X.size, 1)
The last line should be:
812 X.shape = (1, X.size)
If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
>> 1458:
>> The fix suggested in the ticket seems reasonable, but I have
>> never used record arrays, so I am not sure of this.
>
> There were some issues with Python3, and I also had some general
> reservations
> as noted on Trac - basically, it makes 'unpack' equivalent to
> transposing for 2D-arrays,
> but to splitting into fields for 1D-recarrays. My question was, what's
> going to happen
> when you get to 2D-recarrays? Currently this is not an issue since
> loadtxt can only
> read 2D regular or 1D structured arrays. But this might change if the
> data block
> functionality (see below) were to be implemented - data could then be
> returned as
> 3D arrays or 2D structured arrays... Still, it would probably make
> most sense (or at
> least give the widest functionality) to have 'unpack=True' always
> return a list or iterator
> over columns.
OK, I don't know recarrays, as I said.
>> 1445:
>> Adding this functionality could break old code, as some old
>> datafiles may have empty lines which are now simply ignored. I do
>> not think the feature is a good idea. It could rather be implemented
>> as a separate function.
>> 1107:
>> I do not see the need for this enhancement. In my eyes, the
>> usecols kwarg does this and more. Perhaps I am misunderstanding
>> something here.
>
> Agree about #1445, and the bit about 'usecols' - 'numcols' would just
> provide a
> shorter call to e.g. read the first 20 columns of a file (well, not
> even that much
> over 'usecols=range(20)'...), don't think that justifies an extra
> argument.
> But the 'datablocks' provides something new, that a number of people
> seem
> to miss from e.g. gnuplot (including me, actually ;-). And it would
> also satisfy the
> request from #1445 without breaking backwards compatibility.
> I've been wondering if could instead specify the separator lines
> through the
> parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if
> that would make
> it more useful...
What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis.
>> 1071:
>> It is not clear to me whether loadtxt is supposed to support
>> missing values in the fashion indicated in the ticket.
>
> In principle it should at least allow you to, by the use of converters
> as described there.
> The problem is, the default delimiter is described as 'any
> whitespace', which in the
> present implementation obviously includes any number of blanks or
> tabs. These
> are therefore treated differently from delimiters like ',' or '&'. I'd
> reckon there are
> too many people actually relying on this behaviour to silently change it
> (e.g. I know plenty of tables with columns separated by either one or
> several
> tabs depending on the length of the previous entry). But the tab is
> apparently also
> treated differently if explicitly specified with "delimiter='\t'" -
> and in that case using
> a converter à la {2: lambda s: float(s or 'Nan')} is working for
> fields in the middle of
> the line, but not at the end - clearly warrants improvement. I've
> prepared a patch
> working for Python3 as well.
Great!
>> 1163:
>> 1565:
>> These tickets seem to have the same origin of the problem. I
>> attach one possible patch. The previously suggested patches that
>> I've seen will not correctly convert floats to ints, which I believe
>> my patch will.
>
> +1, though I am a bit concerned that prompting to raise a ValueError
> for every
> element could impede performance. I'd probably still enclose it into an
> if issubclass(typ, np.uint64) or issubclass(typ, np.int64):
> just like in npio.patch. I also thought one might switch to
> int(float128(x)) in that
> case, but at least for the given examples float128 cannot convert with
> more
> accuracy than float64 (even on PowerPC ;-).
> There were some dissenting opinions that trying to read a float into
> an int should
> generally throw an exception though...
>
> And Chuck just beat me...
I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. Even so, using numpy.int64 would be better, because it would give a sensible error message.
> On 26 Mar 2011, at 21:25, Charles R Harris wrote:
>
>> I put all these patches together at
https://github.com/charris/numpy/tree/loadtxt-savetxt
>> . Please pull from there to continue work on loadtxt/savetxt so as
>> to avoid conflicts in the patches. One of the numpy tests is
>> failing, I assume from patch conflicts, and more tests for the
>> tickets are needed in any case. Also, new keywords should be added
>> to the end, not put in the middle of existing keywords.
>>
>> I haven't reviewed the patches, just tried to get them organized.
>> Also, I have Derek as the author on all of them, that can be changed
>> if it is decided the credit should go elsewhere ;) Thanks for the
>> work you all have been doing on these tickets.
>
> Thanks, I'll have a look at the new ticket and try to get that
> organized!
With some luck, all the loadtxt tickets should be closed in short time :-)