[Numpy-discussion] record data previous to Numpy use

Derek Homeier derek at astro.physik.uni-goettingen.de
Fri Jul 7 06:04:57 EDT 2017


On 7 Jul 2017, at 1:59 am, Chris Barker <Chris.Barker at noaa.gov> wrote:
> 
> On Thu, Jul 6, 2017 at 10:55 AM,  <paul.carrico at free.fr> wrote:
> It's is just a reflexion, but for huge files one solution might be to split/write/build first the array in a dedicated file (2x o(n) iterations - one to identify the blocks size - additional one to get and write), and then to load it in memory and work with numpy - 
> 
> 
> I may have your use case confused, but if you have a huge file with multiple "blocks" in it, there shouldn't be any problem with loading it in one go -- start at the top of the file and load one block at a time (accumulating in a list) -- then you only have the memory overhead issues for one block at a time, should be no problem.
> 
> at this stage the dimension is known and some packages will be fast and more adapted (pandas or astropy as suggested).
> 
> pandas at least is designed to read variations of CSV files, not sure you could use the optimized part to read an array out of part of an open file from a particular point or not.
> 
The fragmented structure indeed would probably be the biggest challenge, although astropy,
while it cannot read from an open file handle, at least should be able to directly parse a block
of input lines, e.g. collected with readline() in a list. Guess pandas could do the same.
Alternatively the line positions of the blocks could be directly passed to the data_start and
data_end keywords, but that would require opening and at least partially reading the file
multiple times. In fact, if the blocks are relatively small, the overhead may be too large to
make it worth using the faster parsers - if you look at the timing notebooks I had linked to
earlier, it takes at least ~100 input lines before they show any speed gains over genfromtxt,
and ~1000 to see roughly linear scaling. In that case writing your own customised reader
could be the best option after all.

Cheers,
					Derek


More information about the NumPy-Discussion mailing list