Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
[Re-adding the list to the To: field, after it got dropped accidentally] On Tue, Feb 28, 2012 at 12:28 AM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Mon Feb 27 17:33:52 -0500 2012:
On Mon, Feb 27, 2012 at 6:02 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012:
On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
What I've got is a solution for writing and reading structured arrays to and from files, both in text files and binary files. It is written in C and python. It allows reading arbitrary subsets of the data efficiently without reading in the whole file. It defines a class Recfile that exposes an array like interface for reading, e.g. x=rf[columns][rows].
What format do you use for binary data? Something tiled? I don't understand how you can read in a single column of a standard text or mmap-style binary file any more efficiently than by reading the whole file.
For binary, I just seek to the appropriate bytes on disk and read them, no mmap. The user must have input an accurate dtype describing rows in the file of course. This saves a lot of memory and time on big files if you just need small subsets.
Have you quantified the time savings? I'd expect this to either be the same speed or slower than reading the entire file.
Nathaniel -
Yes I've verified it, but as you point out below there are pathological cases. See below.
The reason is that usually the OS cannot read just a few bytes from a middle of a file -- if it is reading at all, it will read at least a page (4K on linux). If your rows are less than 4K in size, then reading a little bit out of each row means that you will be loading the entire file from disk regardless. You avoid any parsing overhead for the skipped columns, but for binary files that should be zero. (Even if you're doing endian conversion or something it should still be trivial compared to disk speed.)
I'll say up front, the speed gains for binary data are often huge over even numpy.memmap because memmap is not column aware. My code doesn't have that limitation.
Hi Erin, I don't doubt your observations, but... there must be something more going on! In a modern VM implementation, what happens when you request to read from an arbitrary offset in the file is: 1) The OS works out which disk page (or set of pages, for a longer read) contains the given offset 2) It reads those pages from the disk, and loads them into some OS owned buffers (the "page cache") 3) It copies the relevant bytes out of the page cache into the buffer passed to read() And when you mmap() and then attempt to access some memory at an arbitrary offset within the mmap region, what happens is: 1) The processor notices that it doesn't actually know how the memory address given maps to real memory (a tlb miss), so it asks the OS 2) The OS notices that this is a memory-mapped region, and works out which disk page maps to the given memory address 3) It reads that page from disk, and loads it into some OS owned buffers (the "page cache") 4) It tells the processor That is, reading at a bunch of fixed offsets inside a large memory mapped array (which is what numpy does when you request a single column of a recarray) should end up issuing *exactly the same read commands* as writing code that explicitly seeks to those addresses and reads them. But, I realized I've never actually tested this myself, so I wrote a little test (attached). It reads a bunch of uint32's at equally-spaced offsets from a large file, using either mmap, explicit seeks, or the naive read-everything approach. I'm finding it very hard to get precise results, because I don't have a spare drive and anything that touches the disk really disrupts the timing here (and apparently Ubuntu no longer has a real single-user mode :-(), but here are some examples on a 200,000,000 byte file with different simulated row sizes: 1024 byte rows: Mode: MMAP. Checksum: bdd205e9. Time: 3.44 s Mode: SEEK. Checksum: bdd205e9. Time: 3.34 s Mode: READALL. Checksum: bdd205e9. Time: 3.53 s Mode: MMAP. Checksum: bdd205e9. Time: 3.39 s Mode: SEEK. Checksum: bdd205e9. Time: 3.30 s Mode: READALL. Checksum: bdd205e9. Time: 3.17 s Mode: MMAP. Checksum: bdd205e9. Time: 3.16 s Mode: SEEK. Checksum: bdd205e9. Time: 3.41 s Mode: READALL. Checksum: bdd205e9. Time: 3.43 s 65536 byte rows (64 KiB): Mode: MMAP. Checksum: da4f9d8d. Time: 3.25 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.27 s Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s Mode: MMAP. Checksum: da4f9d8d. Time: 3.34 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.36 s Mode: READALL. Checksum: da4f9d8d. Time: 3.44 s Mode: MMAP. Checksum: da4f9d8d. Time: 3.18 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.19 s Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s 1048576 byte rows (1 MiB): Mode: MMAP. Checksum: 22963df9. Time: 1.57 s Mode: SEEK. Checksum: 22963df9. Time: 1.44 s Mode: READALL. Checksum: 22963df9. Time: 3.13 s Mode: MMAP. Checksum: 22963df9. Time: 1.59 s Mode: SEEK. Checksum: 22963df9. Time: 1.43 s Mode: READALL. Checksum: 22963df9. Time: 3.16 s Mode: MMAP. Checksum: 22963df9. Time: 1.55 s Mode: SEEK. Checksum: 22963df9. Time: 1.66 s Mode: READALL. Checksum: 22963df9. Time: 3.15 s And for comparison: In [32]: a = np.memmap("src/bigfile", np.uint32, "r") In [33]: time hex(np.sum(a[::1048576//4][:-1], dtype=a.dtype)) CPU times: user 0.00 s, sys: 0.01 s, total: 0.01 s Wall time: 1.54 s Out[34]: '0x22963df9L' (Ubuntu Linux 2.6.38-13, traditional spinning-disk media) So, in this test: For small rows: seeking is irrelevant, reading everything is just as fast. (And the cutoff for "small" is not very small... I tried 512KiB too and it looked like 32KiB). For large rows: seeking is faster than reading everything. But mmap, explicit seeks, and np.memmap all act the same. I guess it's possible the difference you're seeing could just mean that, like, Windows has a terrible VM subsystem, but that would be weird.
In the ascii case the gains for speed are less for the reasons you point out; you have to read through the data even to skip rows and fields. Then it is really about memory.
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
I wrote this code to deal with a typical use case of mine where small subsets of binary or ascii data are begin read from a huge file. It solves that problem.
Cool.
If your rows are greater than 4K in size, then seeking will allow you to avoid loading some parts of the file from disk... but it also might defeat the OS's readahead heuristics, which means that instead of doing a streaming read, you're might be doing a disk seek for every row. On an SSD, this is fine, and probably a nice win. On a traditional spinning-disk, losing readahead will cause a huge slowdown; you should only win if each row is like... a megabyte apiece. Seeks are much much slower than continuous reads. So I'm surprised if this actually helps! But that's just theory, so I am curious to see the actual numbers.
Re: memory savings -- it's definitely a win to avoid allocating the whole array if you just want to read one column, but you can accomplish that without any particular cleverness in the low-level file reading code. You just read the first N rows into a fixed-size temp buffer, copy out the relevant column into the output array, repeat.
Certainly this could also be done that way, and would be equally good for some cases.
I've already got all the C code to do this stuff, so it not much work for me to hack it into my numpy fork. If it turns out there are cases that are faster using another method, we should root them out during testing and add the appropriate logic.
Cool. I'm just a little concerned that, since we seem to have like... 5 different implementations of this stuff all being worked on at the same time, we need to get some consensus on which features actually matter, so they can be melded together into the Single Best File Reader Evar. An interface where indexing and file-reading are combined is significantly more complicated than one where the core file-reading inner-loop can ignore indexing. So far I'm not sure why this complexity would be worthwhile, so that's what I'm trying to understand. Cheers, -- Nathaniel
Also, for some crazy ascii files we may want to revert to pure python anyway, but I think these should be special cases that can be flagged at runtime through keyword arguments to the python functions.
BTW, did you mean to go off-list?
cheers,
-e -- Erin Scott Sheldon Brookhaven National Laboratory
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an example that will end up pulling the entire file into memory mm=numpy.memmap(fname, dtype=dtype) rows=numpy.arange(mm.size) x=mm['x'][rows] I just tested this on a 3G binary file and I'm sitting at 3G memory usage. I believe this is because numpy.memmap only understands rows. I don't fully understand the reason for that, but I suspect it is related to the fact that the ndarray really only has a concept of itemsize, and the fields are really just a reinterpretation of those bytes. It may be that one could tweak the ndarray code to get around this. But I would appreciate enlightenment on this subject. This fact was the original motivator for writing my code; the text reading ability came later.
Cool. I'm just a little concerned that, since we seem to have like... 5 different implementations of this stuff all being worked on at the same time, we need to get some consensus on which features actually matter, so they can be melded together into the Single Best File Reader Evar. An interface where indexing and file-reading are combined is significantly more complicated than one where the core file-reading inner-loop can ignore indexing. So far I'm not sure why this complexity would be worthwhile, so that's what I'm trying to understand.
I think I've addressed the reason why the low level C code was written. And I think a unified, high level interface to binary and text files, which the Recfile class provides, is worthwhile. Can you please say more about "...one where the core file-reading inner-loop can ignore indexing"? I didn't catch the meaning. -e
Cheers, -- Nathaniel
Also, for some crazy ascii files we may want to revert to pure python anyway, but I think these should be special cases that can be flagged at runtime through keyword arguments to the python functions.
BTW, did you mean to go off-list?
cheers,
-e -- Erin Scott Sheldon Brookhaven National Laboratory
-- Erin Scott Sheldon Brookhaven National Laboratory
Excerpts from Erin Sheldon's message of Wed Feb 29 10:11:51 -0500 2012:
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an
That should have been: "...read *all* the rows". -e -- Erin Scott Sheldon Brookhaven National Laboratory
On Wed, Feb 29, 2012 at 15:11, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an example that will end up pulling the entire file into memory
mm=numpy.memmap(fname, dtype=dtype) rows=numpy.arange(mm.size) x=mm['x'][rows]
I just tested this on a 3G binary file and I'm sitting at 3G memory usage. I believe this is because numpy.memmap only understands rows. I don't fully understand the reason for that, but I suspect it is related to the fact that the ndarray really only has a concept of itemsize, and the fields are really just a reinterpretation of those bytes. It may be that one could tweak the ndarray code to get around this. But I would appreciate enlightenment on this subject.
Each record (I would avoid the word "row" in this context) is contiguous in memory whether that memory is mapped to disk or not. Additionally, the way that virtual memory (i.e. mapped memory) works is that when you request the data at a given virtual address, the OS will go look up the page it resides in (typically 4-8k in size) and pull the whole page into main memory. Since you are requesting most of the records, you are probably pulling all of the file into main memory. Memory mapping works best when you pull out contiguous chunks at a time rather than pulling out stripes. numpy structured arrays do not rearrange your data to put all of the 'x' data contiguous with each other. You can arrange that yourself, if you like (use a structured scalar with a dtype such that each field is an array of the appropriate length and dtype). Then pulling out all of the 'x' field values will only touch a smaller fraction of the file. -- Robert Kern
On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an example that will end up pulling the entire file into memory
mm=numpy.memmap(fname, dtype=dtype) rows=numpy.arange(mm.size) x=mm['x'][rows]
I just tested this on a 3G binary file and I'm sitting at 3G memory usage. I believe this is because numpy.memmap only understands rows. I don't fully understand the reason for that, but I suspect it is related to the fact that the ndarray really only has a concept of itemsize, and the fields are really just a reinterpretation of those bytes. It may be that one could tweak the ndarray code to get around this. But I would appreciate enlightenment on this subject.
Ahh, that makes sense. But, the tool you are using to measure memory usage is misleading you -- you haven't mentioned what platform you're on, but AFAICT none of them have very good tools for describing memory usage when mmap is in use. (There isn't a very good way to handle it.) What's happening is this: numpy read out just that column from the mmap'ed memory region. The OS saw this and decided to read the entire file, for reasons discussed previously. Then, since it had read the entire file, it decided to keep it around in memory for now, just in case some program wanted it again in the near future. Now, if you instead fetched just those bytes from the file using seek+read or whatever, the OS would treat that request in the exact same way: it'd still read the entire file, and it would still keep the whole thing around in memory. On Linux, you could test this by dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much memory is listed as "free" in top, and then using your code to read the same file -- you'll see that the 'free' memory drops by 3 gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 gigabytes. [Note: if you try this experiment, make sure that you don't have the same file opened with np.memmap -- for some reason Linux seems to ignore the request to drop_caches for files that are mmap'ed.] The difference between mmap and reading is that in the former case, then this cache memory will be "counted against" your process's resident set size. The same memory is used either way -- it's just that it gets reported differently by your tool. And in fact, this memory is not really "used" at all, in the way we usually mean that term -- it's just a cache that the OS keeps, and it will immediately throw it away if there's a better use for that memory. The only reason it's loading the whole 3 gigabytes into memory in the first place is that you have >3 gigabytes of memory to spare. You might even be able to tell the OS that you *won't* be reading that file again, so there's no point in keeping it all cached -- on Unix this is done via the madavise() or posix_fadvise() syscalls. (No guarantee the OS will actually listen, though.)
This fact was the original motivator for writing my code; the text reading ability came later.
Cool. I'm just a little concerned that, since we seem to have like... 5 different implementations of this stuff all being worked on at the same time, we need to get some consensus on which features actually matter, so they can be melded together into the Single Best File Reader Evar. An interface where indexing and file-reading are combined is significantly more complicated than one where the core file-reading inner-loop can ignore indexing. So far I'm not sure why this complexity would be worthwhile, so that's what I'm trying to understand.
I think I've addressed the reason why the low level C code was written. And I think a unified, high level interface to binary and text files, which the Recfile class provides, is worthwhile.
Can you please say more about "...one where the core file-reading inner-loop can ignore indexing"? I didn't catch the meaning.
Sure, sorry. What I mean is just, it's easier to write code that only knows how to do a dumb sequential read, and doesn't know how to seek to particular places and pick out just the fields that are being requested. And it's easier to maintain, and optimize, and document, and add features, and so forth. (And we can still have a high-level interface on top of it, if that's useful.) So I'm trying to understand if there's really a compelling advantage that we get by build seeking smarts into our low-level C code, that we can't get otherwise. -- Nathaniel
Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an example that will end up pulling the entire file into memory
mm=numpy.memmap(fname, dtype=dtype) rows=numpy.arange(mm.size) x=mm['x'][rows]
I just tested this on a 3G binary file and I'm sitting at 3G memory usage. I believe this is because numpy.memmap only understands rows. I don't fully understand the reason for that, but I suspect it is related to the fact that the ndarray really only has a concept of itemsize, and the fields are really just a reinterpretation of those bytes. It may be that one could tweak the ndarray code to get around this. But I would appreciate enlightenment on this subject.
Ahh, that makes sense. But, the tool you are using to measure memory usage is misleading you -- you haven't mentioned what platform you're on, but AFAICT none of them have very good tools for describing memory usage when mmap is in use. (There isn't a very good way to handle it.)
What's happening is this: numpy read out just that column from the mmap'ed memory region. The OS saw this and decided to read the entire file, for reasons discussed previously. Then, since it had read the entire file, it decided to keep it around in memory for now, just in case some program wanted it again in the near future.
Now, if you instead fetched just those bytes from the file using seek+read or whatever, the OS would treat that request in the exact same way: it'd still read the entire file, and it would still keep the whole thing around in memory. On Linux, you could test this by dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much memory is listed as "free" in top, and then using your code to read the same file -- you'll see that the 'free' memory drops by 3 gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 gigabytes.
[Note: if you try this experiment, make sure that you don't have the same file opened with np.memmap -- for some reason Linux seems to ignore the request to drop_caches for files that are mmap'ed.]
The difference between mmap and reading is that in the former case, then this cache memory will be "counted against" your process's resident set size. The same memory is used either way -- it's just that it gets reported differently by your tool. And in fact, this memory is not really "used" at all, in the way we usually mean that term -- it's just a cache that the OS keeps, and it will immediately throw it away if there's a better use for that memory. The only reason it's loading the whole 3 gigabytes into memory in the first place is that you have >3 gigabytes of memory to spare.
You might even be able to tell the OS that you *won't* be reading that file again, so there's no point in keeping it all cached -- on Unix this is done via the madavise() or posix_fadvise() syscalls. (No guarantee the OS will actually listen, though.)
This is interesting, and on my machine I think I've verified that what you say is actually true. This all makes theoretical sense, but goes against some experiments I and my colleagues have done. For example, a colleague of mine was able to read a couple of large files in using my code but not using memmap. The combined files were greater than memory size. With memmap the code started swapping. This was on 32-bit OSX. But as I said, I just tested this on my linux box and it works fine with numpy.memmap. I don't have an OSX box to test this. So if what you say holds up on non-linux systems, it is in fact an indicator that the section of my code dealing with binary could be dropped; that bit was trivial anyway. -e -- Erin Scott Sheldon Brookhaven National Laboratory
On Wed, Feb 29, 2012 at 7:57 PM, Erin Sheldon <erin.sheldon@gmail.com>wrote:
Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
Even for binary, there are pathological cases, e.g. 1) reading a random subset of nearly all rows. 2) reading a single column when rows are small. In case 2 you will only go this route in the first place if you need to save memory. The user should be aware of these issues.
FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to grab a single column and read a large fraction of the rows. Here is an example that will end up pulling the entire file into memory
mm=numpy.memmap(fname, dtype=dtype) rows=numpy.arange(mm.size) x=mm['x'][rows]
I just tested this on a 3G binary file and I'm sitting at 3G memory usage. I believe this is because numpy.memmap only understands rows. I don't fully understand the reason for that, but I suspect it is related to the fact that the ndarray really only has a concept of itemsize, and the fields are really just a reinterpretation of those bytes. It may be that one could tweak the ndarray code to get around this. But I would appreciate enlightenment on this subject.
Ahh, that makes sense. But, the tool you are using to measure memory usage is misleading you -- you haven't mentioned what platform you're on, but AFAICT none of them have very good tools for describing memory usage when mmap is in use. (There isn't a very good way to handle it.)
What's happening is this: numpy read out just that column from the mmap'ed memory region. The OS saw this and decided to read the entire file, for reasons discussed previously. Then, since it had read the entire file, it decided to keep it around in memory for now, just in case some program wanted it again in the near future.
Now, if you instead fetched just those bytes from the file using seek+read or whatever, the OS would treat that request in the exact same way: it'd still read the entire file, and it would still keep the whole thing around in memory. On Linux, you could test this by dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much memory is listed as "free" in top, and then using your code to read the same file -- you'll see that the 'free' memory drops by 3 gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 gigabytes.
[Note: if you try this experiment, make sure that you don't have the same file opened with np.memmap -- for some reason Linux seems to ignore the request to drop_caches for files that are mmap'ed.]
The difference between mmap and reading is that in the former case, then this cache memory will be "counted against" your process's resident set size. The same memory is used either way -- it's just that it gets reported differently by your tool. And in fact, this memory is not really "used" at all, in the way we usually mean that term -- it's just a cache that the OS keeps, and it will immediately throw it away if there's a better use for that memory. The only reason it's loading the whole 3 gigabytes into memory in the first place is that you have >3 gigabytes of memory to spare.
You might even be able to tell the OS that you *won't* be reading that file again, so there's no point in keeping it all cached -- on Unix this is done via the madavise() or posix_fadvise() syscalls. (No guarantee the OS will actually listen, though.)
This is interesting, and on my machine I think I've verified that what you say is actually true.
This all makes theoretical sense, but goes against some experiments I and my colleagues have done. For example, a colleague of mine was able to read a couple of large files in using my code but not using memmap. The combined files were greater than memory size. With memmap the code started swapping. This was on 32-bit OSX. But as I said, I just tested this on my linux box and it works fine with numpy.memmap. I don't have an OSX box to test this.
I've seen this on OS X too. Here's another example on Linux: http://thread.gmane.org/gmane.comp.python.numeric.general/43965. Using tcmalloc was reported by a couple of people to solve that particular issue. Ralf
*In an effort to build a consensus of what numpy's New and Improved text file readers should look like, I've put together a short list of the main points discussed in this thread so far:* * * 1. Loading text files using loadtxt/genfromtxt need a significant performance boost (I think at least an order of magnitude increase in performance is very doable based on what I've seen with Erin's recfile code) 2. Improved memory usage. Memory used for reading in a text file shouldn’t be more than the file itself, and less if only reading a subset of file. 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt, etc). No new ones. 4. Underlying code should keep IO iteration and transformation of data separate (awaiting more thoughts from Travis on this). 5. Be able to plug in different transformations of data at low level (also awaiting more thoughts from Travis). 6. memory mapping of text files? 7. Eventually reduce memory usage even more by using same object for duplicate values in array (depends on implementing enum dtype?) Anything else? -Jay Bourque continuum.io
On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <jayvius@gmail.com> wrote:
1. Loading text files using loadtxt/genfromtxt need a significant performance boost (I think at least an order of magnitude increase in performance is very doable based on what I've seen with Erin's recfile code)
2. Improved memory usage. Memory used for reading in a text file shouldn’t be more than the file itself, and less if only reading a subset of file.
3. Keep existing interfaces for reading text files (loadtxt, genfromtxt, etc). No new ones.
4. Underlying code should keep IO iteration and transformation of data separate (awaiting more thoughts from Travis on this).
5. Be able to plug in different transformations of data at low level (also awaiting more thoughts from Travis).
6. memory mapping of text files?
7. Eventually reduce memory usage even more by using same object for duplicate values in array (depends on implementing enum dtype?)
Anything else?
Yes -- I'd like to see the solution be able to do high -performance reads of a portion of a file -- not always the whole thing. I seem to have a number of custom text files that I need to read that are laid out in chunks: a bit of a header, then a block of number, another header, another block. I'm happy to read and parse the header sections with pure pyton, but would love a way to read the blocks of numbers into a numpy array fast. This will probably come out of the box with any of the proposed solutions, as long as they start at the current position of a passes-in fiel object, and can be told how much to read, then leave the file pointer in the correct position. Great to see this moving forward. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Mar 6, 2012 at 4:45 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <jayvius@gmail.com> wrote:
1. Loading text files using loadtxt/genfromtxt need a significant performance boost (I think at least an order of magnitude increase in performance is very doable based on what I've seen with Erin's recfile code)
2. Improved memory usage. Memory used for reading in a text file shouldn’t be more than the file itself, and less if only reading a subset of file.
3. Keep existing interfaces for reading text files (loadtxt, genfromtxt, etc). No new ones.
4. Underlying code should keep IO iteration and transformation of data separate (awaiting more thoughts from Travis on this).
5. Be able to plug in different transformations of data at low level (also awaiting more thoughts from Travis).
6. memory mapping of text files?
7. Eventually reduce memory usage even more by using same object for duplicate values in array (depends on implementing enum dtype?)
Anything else?
Yes -- I'd like to see the solution be able to do high -performance reads of a portion of a file -- not always the whole thing. I seem to have a number of custom text files that I need to read that are laid out in chunks: a bit of a header, then a block of number, another header, another block. I'm happy to read and parse the header sections with pure pyton, but would love a way to read the blocks of numbers into a numpy array fast. This will probably come out of the box with any of the proposed solutions, as long as they start at the current position of a passes-in fiel object, and can be told how much to read, then leave the file pointer in the correct position.
If you are setup with Cython to build extension modules, and you don't mind testing an unreleased and experimental reader, you can try the text reader that I'm working on: https://github.com/WarrenWeckesser/textreader You can read a file like this, where the first line gives the number of rows of the following array, and that pattern repeats: 5 1.0, 2.0, 3.0 4.0, 5.0, 6.0 7.0, 8.0, 9.0 10.0, 11.0, 12.0 13.0, 14.0, 15.0 3 1.0, 1.5, 2.0, 2.5 3.0, 3.5, 4.0, 4.5 5.0, 5.5, 6.0, 6.5 1 1.0D2, 1.25D-1, 6.25D-2, 99 with code like this: import numpy as np from textreader import readrows filename = 'data/multi.dat' f = open(filename, 'r') line = f.readline() while len(line) > 0: nrows = int(line) a = readrows(f, np.float32, numrows=nrows, sci='D', delimiter=',') print "a:" print a print line = f.readline() Warren
Warren et al: On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
If you are setup with Cython to build extension modules,
I am
and you don't mind testing an unreleased and experimental reader,
and I don't.
you can try the text reader that I'm working on: https://github.com/WarrenWeckesser/textreader
It just took me a while to get around to it! First of all: this is pretty much exactly what I've been looking for for years, and never got around to writing myself - thanks! My comments/suggestions: 1) a docstring for the textreader module would be nice. 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to parse an ISO datetime string timezone specifier, but short of that, I think the default should be None or UTC -- time zones are too ugly to presume anything! 3) it breaks with the old MacOS style line endings: \r only. Maybe no need to support that, but it turns out one of my old test files still had them! 4) when I try to read more rows than are in the file, I get: File "textreader.pyx", line 247, in textreader.readrows (python/textreader.c:3469) ValueError: negative dimensions are not allowed good to get an error, but it's not very informative! 5) for reading float64 values -- I get something different with textreader than with the python "float()": input: "678.901" float("") : 678.90099999999995 textreader : 678.90100000000007 as close as the number of figures available, but curious... 5) Performance issue: in my case, I'm reading a big file that's in chunks -- each one has a header indicating how many rows follow, then the rows, so I parse it out bit by bit. For smallish files, it's much faster than pure python, and almost as fast as some old C code of mine that is far less flexible. But for large files, -- it's much slower -- indeed slower than a pure python version for my use case. I did a simplified test -- with 10,000 rows: total number of rows: 10000 pure python took: 1.410408 seconds pure python chunks took: 1.613094 seconds textreader all at once took: 0.067098 seconds textreader in chunks took : 0.131802 seconds but with 1,000,000 rows: total number of rows: 1000000 total number of chunks: 1000 pure python took: 30.712564 seconds pure python chunks took: 31.313225 seconds textreader all at once took: 1.314924 seconds textreader in chunks took : 9.684819 seconds then it gets even worse with the chunk size smaller: total number of rows: 1000000 total number of chunks: 10000 pure python took: 30.032246 seconds pure python chunks took: 42.010589 seconds textreader all at once took: 1.318613 seconds textreader in chunks took : 87.743729 seconds my code, which is C that essentially runs fscanf over the file, has essentially no performance hit from doing it in chunks -- so I think something is wrong here. Sorry, I haven't dug into the code to try to figure out what yet -- does it rewind the file each time maybe? Enclosed is my test code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Mar 20, 2012 at 5:59 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Warren et al:
On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
If you are setup with Cython to build extension modules,
I am
and you don't mind testing an unreleased and experimental reader,
and I don't.
you can try the text reader that I'm working on: https://github.com/WarrenWeckesser/textreader
It just took me a while to get around to it!
First of all: this is pretty much exactly what I've been looking for for years, and never got around to writing myself - thanks!
My comments/suggestions:
1) a docstring for the textreader module would be nice.
2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to parse an ISO datetime string timezone specifier, but short of that, I think the default should be None or UTC -- time zones are too ugly to presume anything!
3) it breaks with the old MacOS style line endings: \r only. Maybe no need to support that, but it turns out one of my old test files still had them!
4) when I try to read more rows than are in the file, I get: File "textreader.pyx", line 247, in textreader.readrows (python/textreader.c:3469) ValueError: negative dimensions are not allowed
good to get an error, but it's not very informative!
5) for reading float64 values -- I get something different with textreader than with the python "float()": input: "678.901" float("") : 678.90099999999995 textreader : 678.90100000000007
as close as the number of figures available, but curious...
5) Performance issue: in my case, I'm reading a big file that's in chunks -- each one has a header indicating how many rows follow, then the rows, so I parse it out bit by bit. For smallish files, it's much faster than pure python, and almost as fast as some old C code of mine that is far less flexible.
But for large files, -- it's much slower -- indeed slower than a pure python version for my use case.
I did a simplified test -- with 10,000 rows:
total number of rows: 10000 pure python took: 1.410408 seconds pure python chunks took: 1.613094 seconds textreader all at once took: 0.067098 seconds textreader in chunks took : 0.131802 seconds
but with 1,000,000 rows:
total number of rows: 1000000 total number of chunks: 1000 pure python took: 30.712564 seconds pure python chunks took: 31.313225 seconds textreader all at once took: 1.314924 seconds textreader in chunks took : 9.684819 seconds
then it gets even worse with the chunk size smaller:
total number of rows: 1000000 total number of chunks: 10000 pure python took: 30.032246 seconds pure python chunks took: 42.010589 seconds textreader all at once took: 1.318613 seconds textreader in chunks took : 87.743729 seconds
my code, which is C that essentially runs fscanf over the file, has essentially no performance hit from doing it in chunks -- so I think something is wrong here.
Sorry, I haven't dug into the code to try to figure out what yet -- does it rewind the file each time maybe?
Enclosed is my test code.
-Chris
Chris, Thanks! The feedback is great. I won't have time to get back to this for another week or so, but then I'll look into the issues you reported. Warren
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
participants (7)
-
Chris Barker
-
Erin Sheldon
-
Jay Bourque
-
Nathaniel Smith
-
Ralf Gommers
-
Robert Kern
-
Warren Weckesser