deprecate fromstring() for text reading?
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
CHB
I think it would be good to keep the usage to read binary data at least. Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov wrote:
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I think it would be good to keep the usage to read binary data at least.
Agreed  it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov wrote:
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal < chris.barker@noaa.gov> wrote:
I think it would be good to keep the usage to read binary data at least.
Agreed  it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov wrote:
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
Chuck
On Oct 23, 2015, at 6:13 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal chris.barker@noaa.gov wrote:
I think it would be good to keep the usage to read binary data at least.
Agreed  it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov wrote: There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
Chuck _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
IIRC Thomas Caswell was interested in doing this :)
Jeff
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris charlesr.harris@gmail.com
wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov
wrote:
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and
it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to
write cleanly)
I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal chris.barker@noaa.gov wrote:
I think it would be good to keep the usage to read binary data at least.
Agreed  it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov wrote:
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though  have it raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
Grabbing the pandas csv reader would be great, and I hope it happens sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using fromstring() on text files? It's really pretty broken.
Chris
On Oct 23, 2015, at 4:02 PM, Jeff Reback jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris charlesr.harris@gmail.com
wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov
wrote:
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and
it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to
write cleanly)
I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
FWIW, when I needed a fast Fixed Width reader for a very large dataset last year, I found that np.genfromtext() was faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker  NOAA Federal < chris.barker@noaa.gov> wrote:
Grabbing the pandas csv reader would be great, and I hope it happens sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using fromstring() on text files? It's really pretty broken.
Chris
On Oct 23, 2015, at 4:02 PM, Jeff Reback jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris charlesr.harris@gmail.com
wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov
wrote:
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
fromstring() is bugging and inflexible for reading text files  and
it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to
write cleanly)
I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
On Tue, Oct 27, 2015 at 7:30 AM, Benjamin Root ben.v.root@gmail.com wrote:
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which case, it really isn a different usecase than delimited text  if it's at all common, a version written in C would be nice and fast. and nat hard to do.
But fromstring never would have helped you with that anyway :)
CHB
for a very large dataset last year, I found that np.genfromtext() was faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker  NOAA Federal < chris.barker@noaa.gov> wrote:
Grabbing the pandas csv reader would be great, and I hope it happens sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using fromstring() on text files? It's really pretty broken.
Chris
On Oct 23, 2015, at 4:02 PM, Jeff Reback jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker chris.barker@noaa.gov
wrote:
> > There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
> >
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
> > fromstring() is bugging and inflexible for reading text files 
and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
> > So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
> > Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
> > (and the hopefully coming new dtype system would make it easier to
write cleanly)
> > I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
>
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Correct, there were entries that would sometimes take up their entire width. The delimited text readers could not read this particular dataset. The dataset I am referring to is the processed ISD data: https://www.ncdc.noaa.gov/isd
As for fromstring() not being able to help there, I didn't mean to imply that it would. I was more aiming to point out a situation where the NumPy's text file reader was significantly better than the Pandas version, so we would want to make sure that we properly benchmark any significant changes to NumPy's text reading code. Who knows where else NumPy beats Pandas?
Ben
On Mon, Nov 2, 2015 at 6:44 PM, Chris Barker chris.barker@noaa.gov wrote:
On Tue, Oct 27, 2015 at 7:30 AM, Benjamin Root ben.v.root@gmail.com wrote:
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which case, it really isn a different usecase than delimited text  if it's at all common, a version written in C would be nice and fast. and nat hard to do.
But fromstring never would have helped you with that anyway :)
CHB
for a very large dataset last year, I found that np.genfromtext() was faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker  NOAA Federal < chris.barker@noaa.gov> wrote:
Grabbing the pandas csv reader would be great, and I hope it happens sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using fromstring() on text files? It's really pretty broken.
Chris
On Oct 23, 2015, at 4:02 PM, Jeff Reback jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
> I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to
deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
> Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
> > On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker <
chris.barker@noaa.gov> wrote:
>> >> There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
>> >>
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
>> >> fromstring() is bugging and inflexible for reading text files 
and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
>> >> So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
>> >> Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
>> >> (and the hopefully coming new dtype system would make it easier to
write cleanly)
>> >> I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
>>
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I was more aiming to point out a situation where the NumPy's text file reader was significantly better than the Pandas version, so we would want to make sure that we properly benchmark any significant changes to NumPy's text reading code. Who knows where else NumPy beats Pandas? Indeed. For this example, I think a fixedwith reader really is a different animal, and it's probably a good idea to have a high performance one in Numpy. Among other things, you wouldn't want it to try to autodetermine data types or anything like that.
I think what's on the table now is to bring in a new delimited reader  I.e. CSV in its various flavors.
CHB
Ben
On Mon, Nov 2, 2015 at 6:44 PM, Chris Barker chris.barker@noaa.gov wrote:
On Tue, Oct 27, 2015 at 7:30 AM, Benjamin Root ben.v.root@gmail.com wrote:
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which case, it really isn a different usecase than delimited text  if it's at all common, a version written in C would be nice and fast. and nat hard to do.
But fromstring never would have helped you with that anyway :)
CHB
for a very large dataset last year, I found that np.genfromtext() was faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker  NOAA Federal < chris.barker@noaa.gov> wrote:
Grabbing the pandas csv reader would be great, and I hope it happens sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using fromstring() on text files? It's really pretty broken.
Chris
On Oct 23, 2015, at 4:02 PM, Jeff Reback jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Oct 23, 2015 3:30 PM, "Jeff Reback" jeffreback@gmail.com wrote:
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker  NOAA Federal <
chris.barker@noaa.gov> wrote:
> I think it would be good to keep the usage to read binary data at
least.
Agreed  it's only the text file reading I'm proposing to
deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
> Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`?  Marten
> > On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker <
chris.barker@noaa.gov> wrote:
>> >> There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
>> >>
https://mail.scipy.org/pipermail/scipyuser/2015October/036746.html
>> >> fromstring() is bugging and inflexible for reading text files 
and it is a very, very ugly mess of code. I dug into it a while back, and gave up  just to much of a mess!
>> >> So we really should completely reimplement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
>> >> Also  if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
>> >> (and the hopefully coming new dtype system would make it easier to
write cleanly)
>> >> I'm not sure what deprecating something means, though  have it
raise a deprecation warning in the next version?
>>
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others  I thing dato's SFrame package has a fork of this code as well?))
n
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
On 3 Nov 2015, at 6:03 pm, Chris Barker  NOAA Federal chris.barker@noaa.gov wrote:
I was more aiming to point out a situation where the NumPy's text file reader was significantly better than the Pandas version, so we would want to make sure that we properly benchmark any significant changes to NumPy's text reading code. Who knows where else NumPy beats Pandas? Indeed. For this example, I think a fixedwith reader really is a different animal, and it's probably a good idea to have a high performance one in Numpy. Among other things, you wouldn't want it to try to autodetermine data types or anything like that.
I think what's on the table now is to bring in a new delimited reader  I.e. CSV in its various flavors.
To add my own handful of change or at least another data point, I had been looking into both the pandas and the Astropy fast readers as a fast loadtxt/genfromtxt replacement; at the time I found the Astropy cparser source somewhat easier to dig into, although looking now Pandas' parser.pyx seems clear enough as well. Some comparison of the two can be found at http://astropy.readthedocs.org/en/stable/io/ascii/fast_ascii_io.html#speedg...
Unfortunately the Astropy fast reader currently does not support fixedwidth format either, and adding this functionality would require modifications to the tokenizer C code  not sure how extensive.
Cheers, Derek
participants (8)

Benjamin Root

Charles R Harris

Chris Barker

Chris Barker  NOAA Federal

Derek Homeier

Jeff Reback

Marten van Kerkwijk

Nathaniel Smith