Possible roadmap addendum: building better text file readers
dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to "factorize": as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to "factorize" / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes
Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen
This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant <travis@continuum.io> wrote:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
-Travis
Yeah, what I envision is just an infrastructural "parsing engine" to replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv module + Cython guts of pandas.read_{csv, table, excel}. It needs to be somewhat adaptable to some of the domain specific decisions of structured arrays vs. DataFrames-- like I use Python objects for strings, but one consequence of this is that I can "intern" strings (only one PyObject per unique string value occurring) where structured arrays cannot, so you get much better performance and memory usage that way. That's soon to change, though, I gather, at which point I'll almost definitely (!) move to pointer arrays instead of dtype=object arrays. - Wes
On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant <travis@continuum.io>wrote:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
-Travis
I have a "proof of concept" CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend. Warren
On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Feb 23, 2012 at 3:19 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant <travis@continuum.io> wrote:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
-Travis
I have a "proof of concept" CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend.
Warren
Sweet, between this, Continuum folks, and me and my guys I think we can come up with something good and suits all our needs. We should set up some realistic performance test cases that we can monitor via vbench (wesm/vbench) while we're work on the project. - W
On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit :
That would indeed be great. Reading large files is a real pain whatever the python method used. BTW, could you tell us what you mean by large files? cheers, Éric.
Sweet, between this, Continuum folks, and me and my guys I think we can come up with something good and suits all our needs. We should set up some realistic performance test cases that we can monitor via vbench (wesm/vbench) while we're work on the project.
Un clavier azerty en vaut deux ---------------------------------------------------------- Éric Depagne eric@depagne.org
On Thu, Feb 23, 2012 at 3:31 PM, Éric Depagne <eric@depagne.org> wrote:
Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit :
That would indeed be great. Reading large files is a real pain whatever the python method used.
BTW, could you tell us what you mean by large files?
cheers, Éric.
Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. - Wes
Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures.
As a benchmark, recfile can read an uncached file with 350,000 lines and 32 columns in about 5 seconds. File size ~220M -e -- Erin Scott Sheldon Brookhaven National Laboratory
On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures.
As a benchmark, recfile can read an uncached file with 350,000 lines and 32 columns in about 5 seconds. File size ~220M
-e -- Erin Scott Sheldon Brookhaven National Laboratory
That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(350000, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. -W
On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_.
But why, oh why, are people storing big data in CSV? G
On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_.
But why, oh why, are people storing big data in CSV?
Because everyone can read it. It's not so much "storage" as "transmission". -- Robert Kern
On Thu, Feb 23, 2012 at 3:14 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_.
But why, oh why, are people storing big data in CSV?
Because everyone can read it. It's not so much "storage" as "transmission".
Because their labmate/officemate/advisor is using Excel... Ben Root
Le 23/02/2012 22:38, Benjamin Root a écrit :
labmate/officemate/advisor is using Excel... ... or an industrial partner with its windows-based software that can export (when it works) some very nice field data from a proprietary Honeywell data logger.
CSV data is better than no data ! (and better than XLS data !) About the *big* data aspect of Gael's question, this reminds me a software project saying [1] that I would distort the following way : '' Q : How does a CSV data file get to be a million line long ? A : One line at a time ! '' And my experience with some time series measurements was really about this : small changes in the data rate, a slightly longer acquisition period, and that's it ! Pierre (I shamefully confess I spent several hours writing *ad-hoc* Python scripts full of regexps and generators just to fix various tiny details of those CSV files... but in the end it worked !) [1] I just quickly googled "one day at a time" for a reference and ended up on http://en.wikipedia.org/wiki/The_Mythical_Man-Month
For convenience, here's a link to the mailing list thread on this topic from a couple months ago: http://thread.gmane.org/gmane.comp.python.numeric.general/47094 . Drew
But why, oh why, are people storing big data in CSV? Well, that's what scientist do :-)
Éric.
G _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Un clavier azerty en vaut deux ---------------------------------------------------------- Éric Depagne eric@depagne.org
Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all:
In [1]: df = DataFrame(np.random.randn(350000, 32))
In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s
I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float.
Note for reference (i'm skipping the first row which has the column labels from above):
In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s
In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s
In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_.
It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files? http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ The dtype is [('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)] -- Erin Scott Sheldon Brookhaven National Laboratory
On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all:
In [1]: df = DataFrame(np.random.randn(350000, 32))
In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s
I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float.
Note for reference (i'm skipping the first row which has the column labels from above):
In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s
In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s
In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_.
It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files?
http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/
The dtype is
[('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)]
-- Erin Scott Sheldon Brookhaven National Laboratory
Forgot this one that is also widely used: In [28]: %time recs = matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1) CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s Wall time: 65.55 s ok with one of those dat files and the dtype I get: In [18]: %time arr = np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype, skip_header=0, delimiter=' ') CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s Wall time: 17.67 s difference not so stark in this case. I don't produce structured arrays, though In [26]: %time arr = read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep=' ') CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s Wall time: 10.26 s - Wes
On Thu, Feb 23, 2012 at 2:19 PM, Warren Weckesser < warren.weckesser@enthought.com> wrote:
On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant <travis@continuum.io>wrote:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
-Travis
I have a "proof of concept" CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend.
Warren
The text reader that I've been working on is now on github: https://github.com/WarrenWeckesser/textreader Currently it makes two passes through the file. The first pass just counts the number of rows. It then allocates the array and reads the file again to parse the data and fill in the array. Eventually the first pass wll be optional, and you'll be able to specify how many rows to read (and then continue reading another block if you haven't read the entire file). You currently have to give the dtype as a structured array. That would be nice to fix. Actually, there are quite a few "must have" features that it doesn't have yet. One issue that this code handles is newlines embedded in quoted fields. Excel can generate and read files like this: 1.0,2.0,"foo bar" That is one "row" with three fields. The third field contains "foo\nbar". I haven't pushed it to the extreme, but the "big" example (in the examples/ directory) is a 1 gig text file with 2 million rows and 50 fields in each row. This is read in less than 30 seconds (but that's with a solid state drive). Quoting the README file: "This is experimental, unreleased software. Use at your own risk." There are some hard-coded buffer sizes (that eventually should be dynamic), and the error checking is not complete, so mistakes or unanticipated cases can result in seg. faults. Warren
On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
I haven't pushed it to the extreme, but the "big" example (in the examples/ directory) is a 1 gig text file with 2 million rows and 50 fields in each row. This is read in less than 30 seconds (but that's with a solid state drive).
Obviously this was just a quick test, but FYI, a solid state drive shouldn't really make any difference here -- this is a pure sequential read, and for those, SSDs are if anything actually slower than traditional spinning-platter drives. For this kind of benchmarking, you'd really rather be measuring the CPU time, or reading byte streams that are already in memory. If you can process more MB/s than the drive can provide, then your code is effectively perfectly fast. Looking at this number has a few advantages: - You get more repeatable measurements (no disk buffers and stuff messing with you) - If your code can go faster than your drive, then the drive won't make your benchmark look bad - There are probably users out there that have faster drives than you (e.g., I just measured ~340 megabytes/s off our lab's main RAID array), so it's nice to be able to measure optimizations even after they stop mattering on your equipment. Cheers, -- Nathaniel
On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
I haven't pushed it to the extreme, but the "big" example (in the examples/ directory) is a 1 gig text file with 2 million rows and 50 fields in each row. This is read in less than 30 seconds (but that's with a solid state drive).
Obviously this was just a quick test, but FYI, a solid state drive shouldn't really make any difference here -- this is a pure sequential read, and for those, SSDs are if anything actually slower than traditional spinning-platter drives.
Good point.
For this kind of benchmarking, you'd really rather be measuring the CPU time, or reading byte streams that are already in memory. If you can process more MB/s than the drive can provide, then your code is effectively perfectly fast. Looking at this number has a few advantages: - You get more repeatable measurements (no disk buffers and stuff messing with you) - If your code can go faster than your drive, then the drive won't make your benchmark look bad - There are probably users out there that have faster drives than you (e.g., I just measured ~340 megabytes/s off our lab's main RAID array), so it's nice to be able to measure optimizations even after they stop mattering on your equipment.
For anyone benchmarking software like this, be sure to clear the disk cache before each run. In linux: $ sync $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" In Mac OSX: $ purge I'm not sure what the equivalent is in Windows. Warren
On Feb 26, 2012, at 1:16 PM, Warren Weckesser wrote:
For anyone benchmarking software like this, be sure to clear the disk cache before each run. In linux:
$ sync $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
It is also a good idea to run a disk-cache enabled test too, just to better see how things can be improved in your code. Disk subsystem is pretty slow, and during development you can get much better feedback by looking at load times from memory, not from disk (also, tests run much faster, so you can save a lot of devel time).
In Mac OSX:
$ purge
Now that I switched to a Mac, this is good to know. Thanks! -- Francesc Alted
On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
For this kind of benchmarking, you'd really rather be measuring the CPU time, or reading byte streams that are already in memory. If you can process more MB/s than the drive can provide, then your code is effectively perfectly fast. Looking at this number has a few advantages: - You get more repeatable measurements (no disk buffers and stuff messing with you) - If your code can go faster than your drive, then the drive won't make your benchmark look bad - There are probably users out there that have faster drives than you (e.g., I just measured ~340 megabytes/s off our lab's main RAID array), so it's nice to be able to measure optimizations even after they stop mattering on your equipment.
For anyone benchmarking software like this, be sure to clear the disk cache before each run. In linux:
Err, my argument was that you should do exactly the opposite, and just worry about hot-cache times (or time reading a big in-memory buffer, to avoid having to think about the OS's caching strategies). Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not. One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step. -- Nathaniel
On Sun, Feb 26, 2012 at 1:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
For this kind of benchmarking, you'd really rather be measuring the CPU time, or reading byte streams that are already in memory. If you can process more MB/s than the drive can provide, then your code is effectively perfectly fast. Looking at this number has a few advantages: - You get more repeatable measurements (no disk buffers and stuff messing with you) - If your code can go faster than your drive, then the drive won't make your benchmark look bad - There are probably users out there that have faster drives than you (e.g., I just measured ~340 megabytes/s off our lab's main RAID array), so it's nice to be able to measure optimizations even after they stop mattering on your equipment.
For anyone benchmarking software like this, be sure to clear the disk cache before each run. In linux:
Err, my argument was that you should do exactly the opposite, and just worry about hot-cache times (or time reading a big in-memory buffer, to avoid having to think about the OS's caching strategies).
Right, I got that. Sorry if the placement of the notes about how to clear the cache seemed to imply otherwise.
Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not.
One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step.
Thanks for the tip. Do you happen to have any sample code that demonstrates this? I'd like to explore this more. Warren
On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
Right, I got that. Sorry if the placement of the notes about how to clear the cache seemed to imply otherwise.
OK, cool, np.
Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not.
One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step.
Thanks for the tip. Do you happen to have any sample code that demonstrates this? I'd like to explore this more.
No, I've never actually run into a situation where I needed it myself, but I learned the trick from Tridge so I tend to believe it :-). mmap() is actually a pretty simple interface -- the only thing I'd watch out for is that you want to mmap() the file in pieces (so as to avoid VM exhaustion on 32-bit systems), but you want to use pretty big pieces (because each call to mmap()/munmap() has overhead). So you might want to use chunks in the 32-128 MiB range. Or since I guess you're probably developing on a 64-bit system you can just be lazy and mmap the whole file for initial testing. git uses mmap, but I'm not sure it's very useful example code. Also it's not going to do magic. Your code has to be fairly quick before avoiding a single memcpy() will be noticeable. HTH, -- Nathaniel
On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
Right, I got that. Sorry if the placement of the notes about how to clear the cache seemed to imply otherwise.
OK, cool, np.
Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not.
One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step.
Thanks for the tip. Do you happen to have any sample code that demonstrates this? I'd like to explore this more.
No, I've never actually run into a situation where I needed it myself, but I learned the trick from Tridge so I tend to believe it :-). mmap() is actually a pretty simple interface -- the only thing I'd watch out for is that you want to mmap() the file in pieces (so as to avoid VM exhaustion on 32-bit systems), but you want to use pretty big pieces (because each call to mmap()/munmap() has overhead). So you might want to use chunks in the 32-128 MiB range. Or since I guess you're probably developing on a 64-bit system you can just be lazy and mmap the whole file for initial testing. git uses mmap, but I'm not sure it's very useful example code.
Also it's not going to do magic. Your code has to be fairly quick before avoiding a single memcpy() will be noticeable.
HTH,
Yes, thanks! I'm working on a mmap version now. I'm very curious to see just how much of an improvement it can give. Warren
Excerpts from Warren Weckesser's message of Sun Feb 26 16:22:35 -0500 2012:
Yes, thanks! I'm working on a mmap version now. I'm very curious to see just how much of an improvement it can give.
FYI, memmap is generally an incomplete solution for numpy arrays; it only understands rows, not columns and rows. If you memmap a rec array on disk and try to load one full column, it still loads the whole file beforehand. This was why I essentially wrote my own memmap like interface with recfile, the code I'm converting. It allows working with columns and rows without loading large chunks of memory. BTW, I think we will definitely benefit from merging some of our codes. When I get my stuff fully converted we should discuss. -e -- Erin Scott Sheldon Brookhaven National Laboratory
Excerpts from Erin Sheldon's message of Sun Feb 26 17:35:00 -0500 2012:
Excerpts from Warren Weckesser's message of Sun Feb 26 16:22:35 -0500 2012:
Yes, thanks! I'm working on a mmap version now. I'm very curious to see just how much of an improvement it can give.
FYI, memmap is generally an incomplete solution for numpy arrays; it only understands rows, not columns and rows. If you memmap a rec array on disk and try to load one full column, it still loads the whole file beforehand.
I read your message out of context. I was referring to interfaces to binary files, but I forgot your only working on the text interface. Sorry for the noise, -e -- Erin Scott Sheldon Brookhaven National Laboratory
Erin Sheldon writes: [...]
This was why I essentially wrote my own memmap like interface with recfile, the code I'm converting. It allows working with columns and rows without loading large chunks of memory. [...]
This sounds like at any point in time you only have one part of the array mapped into the application. My question is then, why would you manually implement the buffering? The OS should already take care of that by unmapping pages when it's short on physical memory, and faulting pages in when you access them. This reminds me of some previous discussion about making the ndarray API more friendly to code that wants to manage the underlying storage, from mmap'ing it to handling compressed storage. Are there any news on that front? Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
Hi, mmap can give a speed up in some case, but slow down in other. So care must be taken when using it. For example, the speed difference between read and mmap are not the same when the file is local and when it is on NFS. On NFS, you need to read bigger chunk to make it worthwhile. Another example is on an SMP computer. If for example you have a 8 cores computer but have only enought ram for 1 or 2 copy of your dataset, using mmap is a bad idea. If you read the file by chunk normally the OS will keep the file in its cache in ram. So if you launch 8 jobs, they will all use the system cache to shared the data. If you use mmap, I think this bypass the OS cache. So you will always read the file. On NFS with a cluster of computer, this can bring a high load on the file server. So having a way to specify to use or not to use mmap would be great as you can't always guess the right thing to do. (Except if I'm wrong and this don't by pass the OS cache) Anyway, it is great to see people work in this problem, this was just a few comments I had in mind when I read this thread. Frédéric On Sun, Feb 26, 2012 at 4:22 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
Right, I got that. Sorry if the placement of the notes about how to clear the cache seemed to imply otherwise.
OK, cool, np.
Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not.
One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step.
Thanks for the tip. Do you happen to have any sample code that demonstrates this? I'd like to explore this more.
No, I've never actually run into a situation where I needed it myself, but I learned the trick from Tridge so I tend to believe it :-). mmap() is actually a pretty simple interface -- the only thing I'd watch out for is that you want to mmap() the file in pieces (so as to avoid VM exhaustion on 32-bit systems), but you want to use pretty big pieces (because each call to mmap()/munmap() has overhead). So you might want to use chunks in the 32-128 MiB range. Or since I guess you're probably developing on a 64-bit system you can just be lazy and mmap the whole file for initial testing. git uses mmap, but I'm not sure it's very useful example code.
Also it's not going to do magic. Your code has to be fairly quick before avoiding a single memcpy() will be noticeable.
HTH,
Yes, thanks! I'm working on a mmap version now. I'm very curious to see just how much of an improvement it can give.
Warren
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Frédéric Bastien writes:
Hi, mmap can give a speed up in some case, but slow down in other. So care must be taken when using it. For example, the speed difference between read and mmap are not the same when the file is local and when it is on NFS. On NFS, you need to read bigger chunk to make it worthwhile.
Another example is on an SMP computer. If for example you have a 8 cores computer but have only enought ram for 1 or 2 copy of your dataset, using mmap is a bad idea. If you read the file by chunk normally the OS will keep the file in its cache in ram. So if you launch 8 jobs, they will all use the system cache to shared the data. If you use mmap, I think this bypass the OS cache. So you will always read the file.
Not according to mmap(2): MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called. My understanding is that all processes will use exactly the same physical memory, and swapping that memory will use the file itself.
On NFS with a cluster of computer, this can bring a high load on the file server. So having a way to specify to use or not to use mmap would be great as you can't always guess the right thing to do. (Except if I'm wrong and this don't by pass the OS cache)
Anyway, it is great to see people work in this problem, this was just a few comments I had in mind when I read this thread.
Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
On Feb 26, 2012, at 1:49 PM, Nathaniel Smith wrote:
On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
For this kind of benchmarking, you'd really rather be measuring the CPU time, or reading byte streams that are already in memory. If you can process more MB/s than the drive can provide, then your code is effectively perfectly fast. Looking at this number has a few advantages: - You get more repeatable measurements (no disk buffers and stuff messing with you) - If your code can go faster than your drive, then the drive won't make your benchmark look bad - There are probably users out there that have faster drives than you (e.g., I just measured ~340 megabytes/s off our lab's main RAID array), so it's nice to be able to measure optimizations even after they stop mattering on your equipment.
For anyone benchmarking software like this, be sure to clear the disk cache before each run. In linux:
Err, my argument was that you should do exactly the opposite, and just worry about hot-cache times (or time reading a big in-memory buffer, to avoid having to think about the OS's caching strategies).
Clearing the disk cache is very important for getting meaningful, repeatable benchmarks in code where you know that the cache will usually be cold and where hitting the disk will have unpredictable effects (i.e., pretty much anything doing random access, like databases, which have complicated locality patterns, you may or may not trigger readahead, etc.). But here we're talking about pure sequential reads, where the disk just goes however fast it goes, and your code can either keep up or not.
Exactly.
One minor point where the OS interface could matter: it's good to set up your code so it can use mmap() instead of read(), since this can reduce overhead. read() has to copy the data from the disk into OS memory, and then from OS memory into your process's memory; mmap() skips the second step.
Cool. Nice trick! -- Francesc Alted
As others on this list, I've also been confused a bit by the prolific numpy interfaces to reading text. Would it be an idea to create some sort of object oriented solution for this purpose? reader = np.FileReader('my_file.txt') reader.loadtxt() # for backwards compat.; np.loadtxt could instantiate a reader and call this function if one wants to keep the interface reader.very_general_and_typically_slow_reading(missing_data=True) reader.my_files_look_like_this_plz_be_fast(fmt='%20.8e', separator=',', ncol=2) reader.cvs_read() # same as above, but with sensible defaults reader.lazy_read() # returns a generator/iterator, so you can slice out a small part of a huge array, for instance, even when working with text (yes, inefficient) reader.convert_line_by_line(myfunc) # line-by-line call myfunc, letting the user somehow convert easily to his/her format of choice: netcdf, hdf5, ... Not fast, but convenient Another option is to create a hierarchy of readers implemented as classes. Not sure if the benefits outweigh the disadvantages. Just a crazy idea - it would at least gather all the file reading interfaces into one place (or one object hierarchy) so folks know where to look. The whole numpy namespace is a bit cluttered, imho, and for newbies it would be beneficial to use submodules to a greater extent than today - but that's a more long-term discussion. Paul On 23. feb. 2012, at 21:08, Travis Oliphant wrote:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
-Travis
On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
Hi,
23.02.2012 20:32, Wes McKinney kirjoitti: [clip]
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
I'm willing to adapt my code if it is wanted, but at the same time I don't want to step on this person's toes. Should I proceed? -e -- Erin Scott Sheldon Brookhaven National Laboratory
On Fri, Feb 24, 2012 at 9:07 AM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
I'm willing to adapt my code if it is wanted, but at the same time I don't want to step on this person's toes. Should I proceed?
-e -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
That may work-- I haven't taken a look at the code but it is probably a good starting point. We could create a new repo on the pydata GitHub org (http://github.com/pydata) and use that as our point of collaboration. I will hopefully be able to put some serious energy into this this spring. - Wes
Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
That may work-- I haven't taken a look at the code but it is probably a good starting point. We could create a new repo on the pydata GitHub org (http://github.com/pydata) and use that as our point of collaboration. I will hopefully be able to put some serious energy into this this spring.
First I want to make sure that we are not duplicating effort of the person Travis mentioned. Logistically, I think it is probably easier to just fork numpy into my github account and then work it directly into the code base, and ask for a pull request when things are ready. I expect I could have something with all the required features ready in a week or so. It is mainly just porting the code from C++ to C, and writing the interfaces by hand instead of with swig; I've got plenty of experience with that, so it should be straightforward. -e -- Erin Scott Sheldon Brookhaven National Laboratory
Erin Sheldon <erin.sheldon <at> gmail.com> writes:
Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
That may work-- I haven't taken a look at the code but it is probably a good starting point. We could create a new repo on the pydata GitHub org (http://github.com/pydata) and use that as our point of collaboration. I will hopefully be able to put some serious energy into this this spring.
First I want to make sure that we are not duplicating effort of the person Travis mentioned.
Logistically, I think it is probably easier to just fork numpy into my github account and then work it directly into the code base, and ask for a pull request when things are ready.
I expect I could have something with all the required features ready in a week or so. It is mainly just porting the code from C++ to C, and writing the interfaces by hand instead of with swig; I've got plenty of experience with that, so it should be straightforward.
-e
Hi Erin, I'm the one Travis mentioned earlier about working on this. I was planning on diving into it this week, but it sounds like you may have some code already that fits the requirements? If so, I would be available to help you with porting/testing your code with numpy, or I can take what you have and build on it in my numpy fork on github. -Jay Bourque Continuum IO
Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
Hi Erin,
I'm the one Travis mentioned earlier about working on this. I was planning on diving into it this week, but it sounds like you may have some code already that fits the requirements? If so, I would be available to help you with porting/testing your code with numpy, or I can take what you have and build on it in my numpy fork on github.
Hi Jay,all - What I've got is a solution for writing and reading structured arrays to and from files, both in text files and binary files. It is written in C and python. It allows reading arbitrary subsets of the data efficiently without reading in the whole file. It defines a class Recfile that exposes an array like interface for reading, e.g. x=rf[columns][rows]. Limitations: Because it was designed with arrays in mind, it doesn't deal with not fixed-width string fields. Also, it doesn't deal with quoted strings, as those are not necessary for writing or reading arrays with fixed length strings. Doesn't deal with missing data. This is where Wes' tokenizing-oriented code might be useful. So there is a fair amount of functionality to be added for edge cases, but it provides a framework. I think some of this can be written into the C code, others will have to be done at the python level. I've forked numpy on my github account, and should have the code added in a few days. I'll send mail when it is ready. Help will be greatly appreciated getting this to work with loadtxt, adding functionality from Wes' and others code, and testing. Also, because it works on binary files too, I think it might be worth it to make numpy.fromfile a python function, and to use a Recfile object when reading subsets of the data. For example numpy.fromfile(f, rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile object to read the column and row subsets. We could rename the C fromfile to something appropriate, and call it when the whole file is being read (recfile uses it internally when reading ranges). thanks, -e -- Erin Scott Sheldon Brookhaven National Laboratory
On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
What I've got is a solution for writing and reading structured arrays to and from files, both in text files and binary files. It is written in C and python. It allows reading arbitrary subsets of the data efficiently without reading in the whole file. It defines a class Recfile that exposes an array like interface for reading, e.g. x=rf[columns][rows].
What format do you use for binary data? Something tiled? I don't understand how you can read in a single column of a standard text or mmap-style binary file any more efficiently than by reading the whole file. -- Nathaniel
Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012:
On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
What I've got is a solution for writing and reading structured arrays to and from files, both in text files and binary files. It is written in C and python. It allows reading arbitrary subsets of the data efficiently without reading in the whole file. It defines a class Recfile that exposes an array like interface for reading, e.g. x=rf[columns][rows].
What format do you use for binary data? Something tiled? I don't understand how you can read in a single column of a standard text or mmap-style binary file any more efficiently than by reading the whole file.
For binary, I just seek to the appropriate bytes on disk and read them, no mmap. The user must have input an accurate dtype describing rows in the file of course. This saves a lot of memory and time on big files if you just need small subsets. For ascii, the approach is similar except care must be taken when skipping over unread fields and rows. For writing binary, I just tofile() so the bytes correspond directly between array and file. For ascii, I use the appropriate formats for each type. Does this answer your question? -e -- Erin Scott Sheldon Brookhaven National Laboratory
Hi All - I've added the relevant code to my numpy fork here https://github.com/esheldon/numpy The python module and c file are at /numpy/lib/recfile.py and /numpy/lib/src/_recfile.c Access from python is numpy.recfile See below for the doc string for the main class, Recfile. Some example usage is shown. As listed in the limitations section below, quoted strings are not yet supported for text files. This can be addressed by optionally using some smarter code when reading strings from these types of files. I'd greatly appreciate some help with that aspect. There is a test suite in numpy.recfile.test() A class for reading and writing structured arrays to and from files. Both binary and text files are supported. Any subset of the data can be read without loading the whole file. See the limitations section below for caveats. parameters ---------- fobj: file or string A string or file object. mode: string Mode for opening when fobj is a string dtype: A numpy dtype or descriptor describing each line of the file. The dtype must contain fields. This is a required parameter; it is a keyword only for clarity. Note for text files the dtype will be converted to native byte ordering. Any data written to the file must also be in the native byte ordering. nrows: int, optional Number of rows in the file. If not entered, the rows will be counted from the file itself. This is a simple calculation for binary files, but can be slow for text files. delim: string, optional The delimiter for text files. If None or "" the file is assumed to be binary. Should be a single character. skipheader: int, optional Skip this many lines in the header. offset: int, optional Move to this offset in the file. Reads will all be relative to this location. If not sent, it is taken from the current positioin in the input file object or 0 if a filename was entered. string_newlines: bool, optional If true, strings in text files may contain newlines. This is only relevant for text files when the nrows= keyword is not sent, because the number of lines must be counted. In this case the full text reading code is used to count rows instead of a simple newline count. Because the text is fully processed twice, this can double the time to read files. padnull: bool If True, nulls in strings are replaced with spaces when writing text ignorenull: bool If True, nulls in strings are not written when writing text. This results in string fields that are not fixed width, so cannot be read back in using recfile limitations ----------- Currently, only fixed width string fields are supported. String fields can contain any characters, including newlines, but for text files quoted strings are not currently supported: the quotes will be part of the result. For binary files, structured sub-arrays and complex can be writen and read, but this is not supported yet for text files. examples --------- # read from binary file dtype=[('id','i4'),('x','f8'),('y','f8'),('arr','f4',(2,2))] rec=numpy.recfile.Recfile(fname,dtype=dtype) # read all data using either slice or method notation data=rec[:] data=rec.read() # read row slices data=rec[8:55:3] # read subset of columns and possibly rows # can use either slice or method notation data=rec['x'][:] data=rec['id','x'][:] data=rec[col_list][row_list] data=rec.read(columns=col_list, rows=row_list) # for text files, just send the delimiter string # all the above calls will also work rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',') # save time for text files by sending row count rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',',nrows=10000) # write some data rec=numpy.recfile.Recfile(fname,mode='w',dtype=dtype,delim=',') rec.write(data) # append some data rec.write(more_data) # print metadata about the file print rec Recfile nrows: 345472 ncols: 6 mode: 'w' id <i4 x <f8 y <f8 arr <f4 array[2,2] Excerpts from Erin Sheldon's message of Mon Feb 27 09:44:52 -0500 2012:
Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
Hi Erin,
I'm the one Travis mentioned earlier about working on this. I was planning on diving into it this week, but it sounds like you may have some code already that fits the requirements? If so, I would be available to help you with porting/testing your code with numpy, or I can take what you have and build on it in my numpy fork on github.
Hi Jay,all -
What I've got is a solution for writing and reading structured arrays to and from files, both in text files and binary files. It is written in C and python. It allows reading arbitrary subsets of the data efficiently without reading in the whole file. It defines a class Recfile that exposes an array like interface for reading, e.g. x=rf[columns][rows].
Limitations: Because it was designed with arrays in mind, it doesn't deal with not fixed-width string fields. Also, it doesn't deal with quoted strings, as those are not necessary for writing or reading arrays with fixed length strings. Doesn't deal with missing data. This is where Wes' tokenizing-oriented code might be useful. So there is a fair amount of functionality to be added for edge cases, but it provides a framework. I think some of this can be written into the C code, others will have to be done at the python level.
I've forked numpy on my github account, and should have the code added in a few days. I'll send mail when it is ready. Help will be greatly appreciated getting this to work with loadtxt, adding functionality from Wes' and others code, and testing.
Also, because it works on binary files too, I think it might be worth it to make numpy.fromfile a python function, and to use a Recfile object when reading subsets of the data. For example numpy.fromfile(f, rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile object to read the column and row subsets. We could rename the C fromfile to something appropriate, and call it when the whole file is being read (recfile uses it internally when reading ranges).
thanks, -e -- Erin Scott Sheldon Brookhaven National Laboratory
I have a few features that I believe would make text file easier for many people. In some countries (most?) the decimal separator in real numbers is not a point but a comma. I think it would be very useful that the decimal separator be specified with a keyword argument (decimal = '.' for example) on the text reading function. There are workarounds such as previously replacing dots with commas, changing the locale (which is usually a messy solution) but it is always very annoying. I often use rpy to call R's functions read.table or scan to read text files. I have been meaning to write "improved" functions to read text files but lately I find it much simpler to use rpy. Another thing that is very useful is the ability to read a predetermined number of lines from the file. As of right now loadtxt and genfromtxt both read the entire file AFAICT. Paulo ________________________________ De: Jay Bourque <jayvius@gmail.com> Para: numpy-discussion@scipy.org Enviadas: Segunda-feira, 27 de Fevereiro de 2012 2:24 Assunto: Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers Erin Sheldon <erin.sheldon <at> gmail.com> writes:
Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
That may work-- I haven't taken a look at the code but it is probably a good starting point. We could create a new repo on the pydata GitHub org (http://github.com/pydata) and use that as our point of collaboration. I will hopefully be able to put some serious energy into this this spring.
First I want to make sure that we are not duplicating effort of the person Travis mentioned.
Logistically, I think it is probably easier to just fork numpy into my github account and then work it directly into the code base, and ask for a pull request when things are ready.
I expect I could have something with all the required features ready in a week or so. It is mainly just porting the code from C++ to C, and writing the interfaces by hand instead of with swig; I've got plenty of experience with that, so it should be straightforward.
-e
Hi Erin, I'm the one Travis mentioned earlier about working on this. I was planning on diving into it this week, but it sounds like you may have some code already that fits the requirements? If so, I would be available to help you with porting/testing your code with numpy, or I can take what you have and build on it in my numpy fork on github. -Jay Bourque Continuum IO _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 2/27/2012 10:10 AM, Paulo Jabardo wrote:
I have a few features that I believe would make text file easier for many people. In some countries (most?) the decimal separator in real numbers is not a point but a comma. I think it would be very useful that the decimal separator be specified with a keyword argument (decimal = '.' for example) on the text reading function.
Down that path lies madness. For a fast reader, just document input format to use "international notation" (i.e., the decimal point) and give the user the responsibility to ensure the data are in the right format. The format translation utilities should be separate, and calling them should be optional. fwiw, Alan Isaac
I don't know what is the best solution but this certainly isn't madness. First of all '.' isn't international notation it is used in some countries. In most of Europe (and Latin America) the comma is used. Anyone in countries that use a comma as a separator will stumble upon text files with comma as decimal separators very often. Usually a simple search and replace is sufficient but if if the data has string fields, one might mess up the data. Is this the most important feature? Of course not but it helps a lot. As a matter of fact, one of the reasons I started to use R years ago was the flexibility of the function read.table: I don't have to worry about tabular data in text text files, I know I can read them (most of the time...). Now, I use rpy to call read.table. As for speed, right now read.table is faster than loadtxt. Of course numpy shouldn't simply reproduce any feature found in R (or matlab, scilab, etc) but reading data from external sources is a very important step in any data analysis (and often a difficult step). So while this feature is not a top priority it is important for anyone that has to deal with external data written by other programs that use the "correct" locale and it is certainly not in the path to madness. I have been thinking for a while about writing/porting a read.table equivalent but unfortunately I haven't had much time in the past few months and because of that I have kind of stopped my transition from R to python for a while. Paulo ________________________________ De: Alan G Isaac <alan.isaac@gmail.com> Para: Discussion of Numerical Python <numpy-discussion@scipy.org> Enviadas: Segunda-feira, 27 de Fevereiro de 2012 12:53 Assunto: Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers On 2/27/2012 10:10 AM, Paulo Jabardo wrote:
I have a few features that I believe would make text file easier for many people. In some countries (most?) the decimal separator in real numbers is not a point but a comma. I think it would be very useful that the decimal separator be specified with a keyword argument (decimal = '.' for example) on the text reading function.
Down that path lies madness. For a fast reader, just document input format to use "international notation" (i.e., the decimal point) and give the user the responsibility to ensure the data are in the right format. The format translation utilities should be separate, and calling them should be optional. fwiw, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 2/27/2012 1:00 PM, Paulo Jabardo wrote:
First of all '.' isn't international notation
That is in fact a standard designation. http://en.wikipedia.org/wiki/Decimal_mark#Influence_of_calculators_and_compu... Alan Isaac
27.02.2012 19:07, Alan G Isaac kirjoitti:
On 2/27/2012 1:00 PM, Paulo Jabardo wrote:
First of all '.' isn't international notation
That is in fact a standard designation. http://en.wikipedia.org/wiki/Decimal_mark#Influence_of_calculators_and_compu...
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1): http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download Not that it necessarily is important for this discussion.
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1):
http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
I do not think you are right. I think that is a presentational requirement: rules of presentation for documents that are intended to become international standards. Note as well the requirement of spacing to separate digits. Clearly this cannot be a data storage specification. Naturally, the important thing is to agree on a standard data representation. Which one it is is less important, especially if conversion tools will be supplied. But it really is past time for the scientific community to insist on one international standard, and the decimal point has privilege of place because of computing language conventions. (Being the standard in the two largest economies in the world is a different kind of argument in favor of this choice.) Alan Isaac
Hi, On Mon, Feb 27, 2012 at 2:43 PM, Alan G Isaac <alan.isaac@gmail.com> wrote:
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1):
http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
I do not think you are right. I think that is a presentational requirement: rules of presentation for documents that are intended to become international standards. Note as well the requirement of spacing to separate digits. Clearly this cannot be a data storage specification.
Naturally, the important thing is to agree on a standard data representation. Which one it is is less important, especially if conversion tools will be supplied.
But it really is past time for the scientific community to insist on one international standard, and the decimal point has privilege of place because of computing language conventions. (Being the standard in the two largest economies in the world is a different kind of argument in favor of this choice.)
Maybe we can just agree it is an important option to have rather than an unimportant one, Best, Matthew
On 2/27/2012 2:47 PM, Matthew Brett wrote:
Maybe we can just agree it is an important option to have rather than an unimportant one,
It depends on what you mean by "option". If you mean there should be conversion tools from other formats to a specified supported format, then I agree. If you mean that the core reader should be cluttered with attempts to handle various and ill-specified formats, so that we end up with the kind of mess that leads people to expect their "CSV file" to be correctly parsed when they are using a non-comma delimiter, then I disagree. Cheers, Alan Isaac
Hi, 27.02.2012 20:43, Alan G Isaac kirjoitti:
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1):
http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
I do not think you are right. I think that is a presentational requirement: rules of presentation for documents that are intended to become international standards.
Yes, it's an requirement for the standard texts themselves, but not what the standard texts specify. Which is why I didn't think it was so relevant (but the wikipedia link just prompted an immediate [citation needed]). I agree that using something else than '.' does not make much sense. -- Pauli Virtanen
Hi, On Mon, Feb 27, 2012 at 2:58 PM, Pauli Virtanen <pav@iki.fi> wrote:
Hi,
27.02.2012 20:43, Alan G Isaac kirjoitti:
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1):
http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
I do not think you are right. I think that is a presentational requirement: rules of presentation for documents that are intended to become international standards.
Yes, it's an requirement for the standard texts themselves, but not what the standard texts specify. Which is why I didn't think it was so relevant (but the wikipedia link just prompted an immediate [citation needed]). I agree that using something else than '.' does not make much sense.
I suppose if anyone out there is from a country that uses commas for decimals in CSV files and does not want to have to convert them before reading them will be keen to volunteer to help with the coding. I am certainly glad it is not my own case, Best, Matthew
The architecture of this system should separate the iteration across the I/O from the transformation *on* the data. It should also allow the ability to plug-in different transformations at a low-level --- some thought should go into the API of the low-level transformation. Being able to memory-map text files would also be a bonus (but this would require some kind of index to allow seeking through the file). I have some ideas in this direction, but don't have the time to write them up just yet. -Travis On Feb 27, 2012, at 2:44 PM, Matthew Brett wrote:
Hi,
On Mon, Feb 27, 2012 at 2:58 PM, Pauli Virtanen <pav@iki.fi> wrote:
Hi,
27.02.2012 20:43, Alan G Isaac kirjoitti:
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
ISO specifies comma to be used in international standards (ISO/IEC Directives, part 2 / 6.6.8.1):
http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
I do not think you are right. I think that is a presentational requirement: rules of presentation for documents that are intended to become international standards.
Yes, it's an requirement for the standard texts themselves, but not what the standard texts specify. Which is why I didn't think it was so relevant (but the wikipedia link just prompted an immediate [citation needed]). I agree that using something else than '.' does not make much sense.
I suppose if anyone out there is from a country that uses commas for decimals in CSV files and does not want to have to convert them before reading them will be keen to volunteer to help with the coding. I am certainly glad it is not my own case,
Best,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I will just let Jay know that he should coordinate with you. It would be helpful for him to have someone to collaborate with on this. I'm looking forward to seeing your code. Definitely don't hold back on our account. We will adapt to whatever you can offer. Best regards, -Travis On Feb 24, 2012, at 8:07 AM, Erin Sheldon wrote:
Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
This is actually on my short-list as well --- it just didn't make it to the list.
In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
I'm willing to adapt my code if it is wanted, but at the same time I don't want to step on this person's toes. Should I proceed?
-e -- Erin Scott Sheldon Brookhaven National Laboratory
Wes - I designed the recfile package to fill this need. It might be a start. Some features: - the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii. http://code.google.com/p/recfile/ the trunk is pretty far past the most recent release. Erin Scott Sheldon Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
dear all,
I haven't read all 180 e-mails, but I didn't see this on Travis's initial list.
All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see:
- Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory.
- Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files.
- Need to "factorize": as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to "factorize" / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities.
If anyone wants to get involved in this particular problem right now, let me know!
best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Wes -
I designed the recfile package to fill this need. It might be a start.
Some features:
- the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing
Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii.
http://code.google.com/p/recfile/
the trunk is pretty far past the most recent release.
Erin Scott Sheldon
Can you relicense as BSD-compatible?
Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
dear all,
I haven't read all 180 e-mails, but I didn't see this on Travis's initial list.
All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see:
- Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory.
- Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files.
- Need to "factorize": as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to "factorize" / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities.
If anyone wants to get involved in this particular problem right now, let me know!
best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory
Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012:
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
I designed the recfile package to fill this need. It might be a start. Can you relicense as BSD-compatible?
If required, that would be fine with me. -e
Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
dear all,
I haven't read all 180 e-mails, but I didn't see this on Travis's initial list.
All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see:
- Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory.
- Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files.
- Need to "factorize": as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to "factorize" / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h
To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure.
It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities.
If anyone wants to get involved in this particular problem right now, let me know!
best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory
-- Erin Scott Sheldon Brookhaven National Laboratory
Le 23/02/2012 20:32, Wes McKinney a écrit :
If anyone wants to get involved in this particular problem right now, let me know! Hi Wes,
I'm totally out of the implementations issues you described, but I have some million-lines-long CSV files so that I experience "some slowdown" when loading those. I'll be very glad to use any upgraded loadfromtxt/genfromtxt/anyfunction once it's out ! Best, Pierre (and this reminds me shamefully that I still didn't take the time to give a serious try at your pandas...)
participants (21)
-
Alan G Isaac -
Benjamin Root -
Drew Frank -
Erin Sheldon -
Francesc Alted -
Frédéric Bastien -
Gael Varoquaux -
Jay Bourque -
Lluís -
Matthew Brett -
Nathaniel Smith -
Paul Anton Letnes -
Pauli Virtanen -
Paulo Jabardo -
Pierre Haessig -
Robert Kern -
Travis Oliphant -
Travis Oliphant -
Warren Weckesser -
Wes McKinney -
Éric Depagne