[Numpy-discussion] record data previous to Numpy use

Thu Jul 6 05:51:55 EDT 2017

On Thu, Jul 6, 2017 at 1:49 AM, <paul.carrico at free.fr> wrote:
>
> Dear All
>
> First of all thanks for the answers and the information’s (I’ll ding into
it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I
can use ‘spli’ instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by
the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before
working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
confident) on how to proceed, but I’m really blocked on data records … I
trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using ‘append’ in python
meaning, not np),

Although you can avoid some list appending in your case (because the blocks
self-describe their length), I would caution you against prematurely
avoiding it. It's often the most natural way to write the code in Python,
so go ahead and write it that way first. Once you get it working correctly,
but it's too slow or memory intensive, then you can puzzle over how to
preallocate the numpy arrays later. But quite often, it's fine. In this
case, the reading and handling of the text data itself is probably the
bottleneck, not appending to the lists. As I said, Python lists are
cleverly implemented to make appending fast. Accumulating numbers in a list
then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60
million of lines
>
> Please find in attachment an extract of the input format
(‘example_of_input’), and the matrix I’m trying to create and manage with
Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks.
Each block will itself be a list of numpy arrays, which are the sub-blocks
themselves. I didn't bother adding the first three columns to the
sub-blocks or trying to assemble them all into a uniform-width matrix by
padding with trailing 0s. Since you say that the trailing 0s are going to
be "specially treated afterwards", I suspect that you can more easily work
with the lists of arrays instead. I assume floating-point data rather than
trying to figure out whether int or float from the data. The code can
handle multiple data values on one line (not especially well-tested, but it
ought to work), but it assumes that the number of sub-blocks, index of the
sub-block, and sub-block size are each on the own line. The code gets a
little more complicated if that's not the case.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170706/44debae3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: read_blocks.py
Type: text/x-python-script
Size: 3093 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170706/44debae3/attachment.bin>