[Numpy-discussion] record data previous to Numpy use

Chris Barker chris.barker at noaa.gov
Thu Jul 6 12:33:59 EDT 2017


OK, you have two performance "issues"

1) memory use: IF yu need to read a file to build a numpy array, and dont
know how big it is when you start,  you need to accumulate the values
first, and then make an array out of them. And numpy arrays are fixed size,
so they can not efficiently accumulate values.

The usual way to handle this is to read the data into a list with .append()
or the like, and then make an array from it. This is quite fast -- lists
are fast and efficient for extending arrays. However, you are then storing
(at least) a pointer and a python float object for each value, which is a
lot more memory than a single float value in a numpy array, and you need to
make the array from it, which means you have the full list and all its
pyton floats AND the array in memory at once.

Frankly, computers have a lot of memory these days, so this is a non-issue
in most cases.

Nonetheless, a while back I wrote an extendable numpy array object to
address just this issue. You can find the code on gitHub here:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py

I have not tested it with recent numpy's but I expect is still works fine.
It's also py2, but wouldn't take much to port.

In practice, it uses less memory that the "build a list, then make it into
an array", but isnt any faster, unless you add (.extend) a bunch of values
at once, rather than one at a time. (if you do it one at a time, the whole
python float to numpy float conversion, and function call overhead takes
just as long).

But it will, generally be as fast or faster than using  a list, and use
less memory, so a fine basis for a big ascii file reader.

However, it looks like while your files may be huge, they hold a number of
arrays, so each array may not be large enough to bother with any of this.

2) parsing and converting overhead -- for the most part, python/numpy text
file reading code read the text into a python string, converts it to python
number objects, then puts them in a list or converts them to native numbers
in an array. This whole process is a bit slow (though reading files is slow
anyway, so usually not worth worrying about, which is why the built-in file
reading methods do this). To improve this, you need to use code that reads
the file and parses it in C, and puts it straight into a numpy array
without passing through python. This is what the pandas (and I assume
astropy) text file readers do.

But if you don't want those dependencies, there is the "fromfile()"
function in numpy -- it is not very robust, but if you files are
well-formed, then it is quite fast. So your code would look something like:

with open(the_filename) as infile:
    while True:
        line = infile.readline()
        if not line:
            break
        # work with line to figure out the next block
        if ready_to_read_a_block:
            arr = np.fromfile(infile, dtype=np.int32, count=num_values,
sep=' ')
            # sep specifies that you are reading text, not binary!
            arr.shape = the_shape_it_should_be


But Robert is right -- get it to work with the "usual" methods -- i.e. put
numbers in a list, then make an array out it -- first, and then worry about
making it faster.

-CHB


On Thu, Jul 6, 2017 at 1:49 AM, <paul.carrico at free.fr> wrote:

> Dear All
>
>
> First of all thanks for the answers and the information’s (I’ll ding into
> it) and let me trying to add comments on what I want to :
>
>    1. My asci file mainly contains data (float and int) in a single column
>    2. (it is not always the case but I can easily manage it – as well I
>    saw I can use ‘spli’ instruction if necessary)
>    3. Comments/texts indicates the beginning of a bloc immediately
>    followed by the number of sub-blocs
>    4. So I need to read/record all the values in order to build a matrix
>    before working on it (using Numpy & vectorization)
>       - The columns 2 and 3 have been added for further treatments
>       - The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
> confident) on how to proceed, but I’m really blocked on data records … I
> trying to find a way to efficiently read and record data in a matrix:
>
>    - avoiding dynamic memory allocation (here using ‘append’ in python
>    meaning, not np),
>    - dealing with huge asci file: the latest file I get contains more
>    than *60 million of lines*
>
>
> Please find in attachment an extract of the input format
> (‘example_of_input’), and the matrix I’m trying to create and manage with
> Numpy
>
>
> Thanks again for your time
>
> Paul
>
>
> #######################################
>
> ##BEGIN *-> line number x in the original file*
>
> 42   *-> indicates the number of sub-blocs*
>
> 1     *-> number of the 1rst sub-bloc*
>
> 6     *-> gives how many value belong to the sub bloc*
>
> 12
>
> 47
>
> 2
>
> 46
>
> 3
>
> 51
>
> ….
>
> 13  * -> another type of sub-bloc with 25 values*
>
> 25
>
> 15
>
> 88
>
> 21
>
> 42
>
> 22
>
> 76
>
> 19
>
> 89
>
> 0
>
> 18
>
> 80
>
> 23
>
> 38
>
> 24
>
> 73
>
> 20
>
> 81
>
> 0
>
> 90
>
> 0
>
> 41
>
> 0
>
> 39
>
> 0
>
> 77
>
>>
> 42 *-> another type of sub-bloc with 2 values*
>
> 2
>
> 115
>
> 109
>
>
>  #######################################
>
> *The matrix result*
>
> 1 0 0 6 12 47 2 46 3 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 2 0 0 6 3 50 11 70 12 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 3 0 0 8 11 50 3 49 4 54 5 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 4 0 0 8 12 70 11 66 9 65 10 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 5 0 0 8 2 47 12 68 10 44 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 6 0 0 8 5 56 6 58 7 61 11 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 7 0 0 8 11 61 7 60 8 63 9 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> 8 0 0 19 12 47 2 46 3 51 0 13 97 14 92 15 96 0 72 0 48 0 52 0 0 0 0 0 0
>
> 9 0 0 19 13 97 14 92 15 96 0 16 86 17 82 18 85 0 95 0 91 0 90 0 0 0 0 0 0
>
> 10 0 0 19 3 50 11 70 12 51 0 15 89 19 94 13 96 0 52 0 71 0 72 0 0 0 0 0 0
>
> 11 0 0 19 15 89 19 94 13 96 0 18 81 20 84 16 85 0 90 0 77 0 95 0 0 0 0 0 0
>
> 12 0 0 25 3 49 4 54 5 57 11 50 0 15 88 21 42 22 76 19 89 0 52 0 53 0 55 0
> 71
>
> 13 0 0 25 15 88 21 42 22 76 19 89 0 18 80 23 38 24 73 20 81 0 90 0 41 0 39
> 0 77
>
> 14 0 0 25 11 66 9 65 10 68 12 70 0 19 78 25 99 26 98 13 94 0 71 0 67 0 69
> 0 72
>
> ….
>
>
> #######################################
>
> *An example of the code I started to write*
>
> # -*- coding: utf-8 -*-
>
>  import time, sys, os, re
>
> import itertools
>
> import numpy as np
>
>
> PATH = str(os.path.abspath(''))
>
>
> input_file_name ='/example_of_input.txt'
>
>
>
>
> ## check if the file exists, then if it's empty or not
>
> if (os.path.isfile(PATH + input_file_name)):
>
>     if (os.stat(PATH + input_file_name).st_size > 0):
>
>
>
>         ## go through the file in order to find specific sentences
>
>         ## specific blocks will be defined afterward
>
>         Block_position = []; j=0;
>
>         with open(PATH + input_file_name, "r") as data:
>
>             for line in data:
>
>                 if '##BEGIN' in line:
>
>                     Block_position.append(j)
>
>                 j=j+1
>
>
>
>
>
>         ## just to tests to get all the values
>
> #        i = 0
>
> #        data = np.zeros( (505), dtype=np.int )
>
> #        with open(PATH + input_file_name, "r") as f:
>
> #            for i in range (0,505):
>
> #                data[i] = int(f.read(Block_position[0]+1+i))
>
> #                print ("i = ", i)
>
>
>
>
>
> #           for line in itertools.islice(f,Block_position[0],516):
>
> #               data[i]=f.read(0+i)
>
> #               i=i+1
>
>
>
>
>
>
>     else:
>
>         print "The file %s is empty : post-processing cannot be performed
> !!!\n" % input_file_name
>
>
>
>
> else:
>
>     print "Error : the file %s does not exist: post-processing stops
> !!!\n" % input_file_name
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170706/ddb6cc0c/attachment-0001.html>


More information about the NumPy-Discussion mailing list