[Numpy-discussion] deprecate fromstring() for text reading?

Benjamin Root ben.v.root at gmail.com
Tue Nov 3 09:59:59 EST 2015


Correct, there were entries that would sometimes take up their entire
width. The delimited text readers could not read this particular dataset.
The dataset I am referring to is the processed ISD data:
https://www.ncdc.noaa.gov/isd

As for fromstring() not being able to help there, I didn't mean to imply
that it would. I was more aiming to point out a situation where the NumPy's
text file reader was significantly better than the Pandas version, so we
would want to make sure that we properly benchmark any significant changes
to NumPy's text reading code. Who knows where else NumPy beats Pandas?

Ben

On Mon, Nov 2, 2015 at 6:44 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Tue, Oct 27, 2015 at 7:30 AM, Benjamin Root <ben.v.root at gmail.com>
> wrote:
>
>> FWIW, when I needed a fast Fixed Width reader
>>
>
> was there potentially no whitespace between fields in that case? In which
> case, it really isn a different use-case than delimited text -- if it's at
> all common, a version written in C would be nice and fast. and nat hard to
> do.
>
> But fromstring never would have helped you with that anyway :-)
>
> -CHB
>
>
>
>> for a very large dataset last year, I found that np.genfromtext() was
>> faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back
>> to pure python for fixed width scenarios.
>>
>> On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker - NOAA Federal <
>> chris.barker at noaa.gov> wrote:
>>
>>> Grabbing the pandas csv reader would be great, and I hope it happens
>>> sooner than later, though alas, I haven't the spare cycles for it either.
>>>
>>> In the meantime though, can we put a deprecation Warning in when using
>>> fromstring() on text files? It's really pretty broken.
>>>
>>> -Chris
>>>
>>> On Oct 23, 2015, at 4:02 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>>
>>>
>>>
>>> On Oct 23, 2015, at 6:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>> On Oct 23, 2015 3:30 PM, "Jeff Reback" <jeffreback at gmail.com> wrote:
>>> >
>>> > On Oct 23, 2015, at 6:13 PM, Charles R Harris <
>>> charlesr.harris at gmail.com> wrote:
>>> >
>>> >>
>>> >>
>>> >> On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
>>> chris.barker at noaa.gov> wrote:
>>> >>>
>>> >>>
>>> >>>> I think it would be good to keep the usage to read binary data at
>>> least.
>>> >>>
>>> >>>
>>> >>> Agreed -- it's only the text file reading I'm proposing to
>>> deprecate. It was kind of weird to cram it in there in the first place.
>>> >>>
>>> >>> Oh, fromfile() has the same issues.
>>> >>>
>>> >>> Chris
>>> >>>
>>> >>>
>>> >>>> Or is there a good alternative to `np.fromstring(<bytes>,
>>> dtype=...)`?  -- Marten
>>> >>>>
>>> >>>> On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker <
>>> chris.barker at noaa.gov> wrote:
>>> >>>>>
>>> >>>>> There was just a question about a bug/issue with scipy.fromstring
>>> (which is numpy.fromstring) when used to read integers from a text file.
>>> >>>>>
>>> >>>>>
>>> https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
>>> >>>>>
>>> >>>>> fromstring() is bugging and inflexible for reading text files --
>>> and it is a very, very ugly mess of code. I dug into it a while back, and
>>> gave up -- just to much of a mess!
>>> >>>>>
>>> >>>>> So we really should completely re-implement it, or deprecate it. I
>>> doubt anyone is going to do a big refactor, so that means deprecating it.
>>> >>>>>
>>> >>>>> Also -- if we do want a fast read numbers from text files function
>>> (which would be nice, actually), it really should get a new name anyway.
>>> >>>>>
>>> >>>>> (and the hopefully coming new dtype system would make it easier to
>>> write cleanly)
>>> >>>>>
>>> >>>>> I'm not sure what deprecating something means, though -- have it
>>> raise a deprecation warning in the next version?
>>> >>>>>
>>> >>
>>> >> There was discussion at SciPy 2015 of separating out the text reading
>>> abilities of Pandas so that numpy could include it. We should contact Jeff
>>> Rebeck and see about moving that forward.
>>> >
>>> >
>>> > IIRC Thomas Caswell was interested in doing this :)
>>>
>>> When he was in Berkeley a few weeks ago he assured me that every night
>>> since SciPy he has dutifully been feeling guilty about not having done it
>>> yet. I think this week his paltry excuse is that he's "on his honeymoon" or
>>> something.
>>>
>>> ...which is to say that if someone has some spare cycles to take this
>>> over then I think that might be a nice wedding present for him :-).
>>>
>>> (The basic idea is to take the text reading backend behind
>>> pandas.read_csv and extract it into a standalone package that pandas could
>>> depend on, and that could also be used by other packages like numpy (among
>>> others -- I thing dato's SFrame package has a fork of this code as well?))
>>>
>>> -n
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>> I can certainly provide guidance on how/what to extract but don't have
>>> spare cycles myself for this :(
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151103/1beff7be/attachment.html>


More information about the NumPy-Discussion mailing list