Mailman 3 fast numpy.fromfile skipping data chunks - NumPy-Discussion

fast numpy.fromfile skipping data chunks

Andrea Cimatoribus

13 Mar 2013 13 Mar '13

8:45 a.m.

Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic? Thanks import numpy as np #For cython! cimport numpy as np from libc.stdint cimport uint32_t def cffskip32(fid, int count=1, int skip=0): cdef int k=0 cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, dtype=np.uint32) if skip>=0: while k<count: try: data[k] = np.fromfile(fid, count=1, dtype=np.uint32) fid.seek(skip, 1) k +=1 except ValueError: data = data[:k] break return data

Show replies by date

Nathaniel Smith

13 Mar 13 Mar

8:54 a.m.

On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic?

If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably # Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy() (Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...) -n

Frédéric Bastien

9:03 a.m.

Hi, I would suggest that you look at pytables[1]. It use a different file format, but it seam to do exactly what you want and give an object that have a very similar interface to numpy.ndarray (but fewer function). You would just ask for the slice/indices that you want and it return you a numpy.ndarray. HTH Frédéric [1] http://www.pytables.org/moin On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...
Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic?

If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably

# Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy()

(Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...)

-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Andrea Cimatoribus

9:21 a.m.

New subject: R: fast numpy.fromfile skipping data chunks

I see that pytables deals with hdf5 data. It would be very nice if the data were in such a standard format, but that is not the case, and that can't be changed. ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Frédéric Bastien [nouiz@nouiz.org] Inviato: mercoledì 13 marzo 2013 15.03 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks Hi, I would suggest that you look at pytables[1]. It use a different file format, but it seam to do exactly what you want and give an object that have a very similar interface to numpy.ndarray (but fewer function). You would just ask for the slice/indices that you want and it return you a numpy.ndarray. HTH Frédéric [1] http://www.pytables.org/moin On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...
Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic?

If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably

# Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy()

(Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...)

-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Andrea Cimatoribus

9:18 a.m.

New subject: R: fast numpy.fromfile skipping data chunks

This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff). I'll at pytables. # Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy()

Nathaniel Smith

9:32 a.m.

New subject: R: fast numpy.fromfile skipping data chunks

On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff).

np.memmap takes an offset= argument. -n

Andrea Cimatoribus

9:37 a.m.

New subject: R: R: fast numpy.fromfile skipping data chunks

Indeed, but that offset "it should be a multiple of the byte-size of dtype" as the help says. Indeed, this is silly. ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Nathaniel Smith [njs@pobox.com] Inviato: mercoledì 13 marzo 2013 15.32 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff).

Andrea Cimatoribus

9:40 a.m.

New subject: R: R: R: fast numpy.fromfile skipping data chunks

On top of that, there is another issue: it can be that the data available itself is not a multiple of dtype, since there can be write errors at the end of the file, and I don't know how to deal with that. ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Andrea Cimatoribus Inviato: mercoledì 13 marzo 2013 15.37 A: Discussion of Numerical Python Oggetto: [Numpy-discussion] R: R: fast numpy.fromfile skipping data chunks Indeed, but that offset "it should be a multiple of the byte-size of dtype" as the help says. Indeed, this is silly. ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Nathaniel Smith [njs@pobox.com] Inviato: mercoledì 13 marzo 2013 15.32 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff).

np.memmap takes an offset= argument. -n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Andrea Cimatoribus

9:46 a.m.

New subject: R: R: R: fast numpy.fromfile skipping data chunks

...

Indeed, but that offset "it should be a multiple of the byte-size of dtype" as the help says.

My mistake, sorry, even if the help says so, it seems that this is not the case in the actual code. Still, the problem with the size of the available data (which is not necessarily a multiple of dtype byte-size) remains. ac

Nathaniel Smith

9:53 a.m.

New subject: R: R: R: fast numpy.fromfile skipping data chunks

On Wed, Mar 13, 2013 at 2:46 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

...
Indeed, but that offset "it should be a multiple of the byte-size of dtype" as the help says.

My mistake, sorry, even if the help says so, it seems that this is not the case in the actual code. Still, the problem with the size of the available data (which is not necessarily a multiple of dtype byte-size) remains.

Worst case you can always work around such issues with an extra layer of view manipulation: # create a raw view onto the contents of the file file_bytes = np.memmap(path, dtype=np.uint8, ...) # cut out any arbitrary number of bytes from the beginning and end data_bytes = file_bytes[...some slice expression...] # switch to viewing the bytes as the proper data type data = data_bytes.view(dtype=np.uint32) # proceed as before -n

Andrea Cimatoribus

10:13 a.m.

New subject: R: R: R: R: fast numpy.fromfile skipping data chunks

Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story). The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file? ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Nathaniel Smith [njs@pobox.com] Inviato: mercoledì 13 marzo 2013 15.53 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: R: R: fast numpy.fromfile skipping data chunks On Wed, Mar 13, 2013 at 2:46 PM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

...
Indeed, but that offset "it should be a multiple of the byte-size of dtype" as the help says.

My mistake, sorry, even if the help says so, it seems that this is not the case in the actual code. Still, the problem with the size of the available data (which is not necessarily a multiple of dtype byte-size) remains.

Nathaniel Smith

10:43 a.m.

New subject: R: R: R: R: fast numpy.fromfile skipping data chunks

On 13 Mar 2013 15:16, "Andrea Cimatoribus" <Andrea.Cimatoribus@nioz.nl> wrote:

...

Ok, this seems to be working (well, as soon as I get the right offset and

things like that, but that's a different story).

...

The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file?

Oh, in that case you're probably IO bound, not CPU bound, so Cython etc. can't help. Traditional spinning-disk hard drives can read quite quickly, but take a long time to find the right place to read from and start reading. Your OS has heuristics in it to detect sequential reads and automatically start the setup for the next read while you're processing the previous read, so you don't see the seek overhead. If your reads are widely separated enough, these heuristics will get confused and you'll drop back to doing a new disk seek on every call to read(), which is deadly. (And would explain what you're seeing.) If this is what's going on, your best bet is to just write a python loop that uses fromfile() to read some largeish (megabytes?) chunk, subsample those and throw away the rest, and repeat. -n

Andrea Cimatoribus

10:54 a.m.

New subject: R: fast numpy.fromfile skipping data chunks

Thanks a lot for the feedback, I'll try to modify my function to overcome this issue. Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases? ________________________________________ Da: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] per conto di Nathaniel Smith [njs@pobox.com] Inviato: mercoledì 13 marzo 2013 16.43 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks On 13 Mar 2013 15:16, "Andrea Cimatoribus" <Andrea.Cimatoribus@nioz.nl<mailto:Andrea.Cimatoribus@nioz.nl>> wrote:

...

Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story). The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file?

Daπid

12:41 p.m.

New subject: R: fast numpy.fromfile skipping data chunks

On 13 March 2013 16:54, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl>wrote:

...

Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases?

It should help. Nevertheless, when talking about performance, it is difficult to predict, mainly because in a computer there are many things going on and many layers involved. I have a couple of computers equipped with SSD, if you want, if you send me some benchmarks I can run them and see if I get any speedup.

Charles R Harris

6:40 p.m.

New subject: R: fast numpy.fromfile skipping data chunks

On Wed, Mar 13, 2013 at 9:54 AM, Andrea Cimatoribus < Andrea.Cimatoribus@nioz.nl> wrote:

...

Thanks a lot for the feedback, I'll try to modify my function to overcome this issue. Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases?

It should. Seek time on an ssd is quite low, and readout is fast. Skipping over items will probably not be as fast as a sequential read but I expect it will be substantially faster than a disk. Nathaniel's loop idea will probably work faster also. The sequential readout rate of a modern ssd will be about 500 MB/sec, so you can probably just divide that into your file size to get an estimate of the time needed. <snip> Chuck

Andrea Cimatoribus

14 Mar 14 Mar

3:48 a.m.

New subject: R: R: R: R: R: fast numpy.fromfile skipping data chunks

Thanks for all the feedback (on the SSD too). For what concerns "biggus" library, for working on larger-than-memory arrays, this is really interesting, but unfortunately I don't have time to test it at the moment, I will try to have a look at it in the future. I hope to see something like that implemented in numpy soon, though.

Sudheer Joseph

4:18 a.m.

New subject: Numpy correlate

Dear Numpy/Scipy experts, Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag. with best regards, Sudheer *************************************************************** Sudheer Joseph Indian National Centre for Ocean Information Services Ministry of Earth Sciences, Govt. of India POST BOX NO: 21, IDA Jeedeemetla P.O. Via Pragathi Nagar,Kukatpally, Hyderabad; Pin:5000 55 Tel:+91-40-23886047(O),Fax:+91-40-23895011(O), Tel:+91-40-23044600(R),Tel:+91-40-9440832534(Mobile) E-mail:sjo.India@gmail.com;sudheer.joseph@yahoo.com Web- http://oppamthadathil.tripod.com ***************************************************************

Pierre Haessig

18 Mar 18 Mar

noon

New subject: Numpy correlate

Hi Sudheer, Le 14/03/2013 10:18, Sudheer Joseph a écrit :

...

Dear Numpy/Scipy experts, Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag. You indeed pointed out a lack of documentation of in matplotlib.xcorr function because the definition of covariance can be ambiguous.

The way I would try to get an interpretation of xcorr function (& its friends) is to go back to the theoretical definition of cross-correlation, which is a normalized version of the covariance. In your example you've created a time series X(k) and a lagged one : Y(k) = X(k-5) Now, the covariance function of X and Y is commonly defined as : Cov_{X,Y}(h) = E(X(k+h) * Y(k)) where E is the expectation (assuming that X and Y are centered for the sake of clarity). If I plug in the definition of Y, I get Cov(h) = E(X(k+h) * X(k-5)). This yields naturally the fact that the covariance is indeed maximal at h=-5 and not h=+5. Note that this reasoning does yield the opposite result with a different definition of the covariance, ie. Cov_{X,Y}(h) = E(X(k) * Y(k+h)) (and that's what I first did !). Therefore, I think there should be a definition in of cross correlation in matplotlib xcorr docstring. In R's acf doc, there is this mention : "The lag k value returned by ccf(x, y) estimates the correlation between x[t+k] and y[t]. " (see http://stat.ethz.ch/R-manual/R-devel/library/stats/html/acf.html) Now I believe, this upper discussion really belongs to matplotlib ML. I'll put an issue on github (I just spotted a mistake the definition of normalization anyway) Coming back to numpy : There's a strange thing, the definition of numpy.correlate seems to give the other definition "z[k] = sum_n a[n] * conj(v[n+k])" ( http://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html) although its usage prooves otherwise. What did I miss ? best, Pierre

Skipper Seabold

12:10 p.m.

New subject: Numpy correlate

On Mon, Mar 18, 2013 at 1:00 PM, Pierre Haessig <pierre.haessig@crans.org>wrote:

...

Hi Sudheer,

Le 14/03/2013 10:18, Sudheer Joseph a écrit :

Dear Numpy/Scipy experts, Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag.

You indeed pointed out a lack of documentation of in matplotlib.xcorr function because the definition of covariance can be ambiguous.

The way I would try to get an interpretation of xcorr function (& its friends) is to go back to the theoretical definition of cross-correlation, which is a normalized version of the covariance.

In your example you've created a time series X(k) and a lagged one : Y(k) = X(k-5)

Now, the covariance function of X and Y is commonly defined as : Cov_{X,Y}(h) = E(X(k+h) * Y(k)) where E is the expectation (assuming that X and Y are centered for the sake of clarity).

If I plug in the definition of Y, I get Cov(h) = E(X(k+h) * X(k-5)). This yields naturally the fact that the covariance is indeed maximal at h=-5 and not h=+5.

Note that this reasoning does yield the opposite result with a different definition of the covariance, ie. Cov_{X,Y}(h) = E(X(k) * Y(k+h)) (and that's what I first did !).

Therefore, I think there should be a definition in of cross correlation in matplotlib xcorr docstring. In R's acf doc, there is this mention : "The lag k value returned by ccf(x, y) estimates the correlation between x[t+k] and y[t]. " (see http://stat.ethz.ch/R-manual/R-devel/library/stats/html/acf.html)

Now I believe, this upper discussion really belongs to matplotlib ML. I'll put an issue on github (I just spotted a mistake the definition of normalization anyway)

You might be interested in the statsmodels implementation which should be similar to the R functionality. http://nbviewer.ipython.org/urls/raw.github.com/jseabold/tutorial/master/tsa... http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.acf.html<http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.acf.html?highlight=acf#statsmodels.tsa.stattools.acf> http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.tsaplots.plot_acf.html<http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.tsaplots.plot_acf.html?highlight=acf#statsmodels.graphics.tsaplots.plot_acf> Skipper

josef.pktd＠gmail.com

3:21 p.m.

New subject: Numpy correlate

On Mon, Mar 18, 2013 at 1:10 PM, Skipper Seabold <jsseabold@gmail.com> wrote:

...

On Mon, Mar 18, 2013 at 1:00 PM, Pierre Haessig <pierre.haessig@crans.org> wrote:

...
Hi Sudheer,

Le 14/03/2013 10:18, Sudheer Joseph a écrit :

Dear Numpy/Scipy experts, Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag.

You indeed pointed out a lack of documentation of in matplotlib.xcorr function because the definition of covariance can be ambiguous.

The way I would try to get an interpretation of xcorr function (& its friends) is to go back to the theoretical definition of cross-correlation, which is a normalized version of the covariance.

In your example you've created a time series X(k) and a lagged one : Y(k) = X(k-5)

Now, the covariance function of X and Y is commonly defined as : Cov_{X,Y}(h) = E(X(k+h) * Y(k)) where E is the expectation (assuming that X and Y are centered for the sake of clarity).

If I plug in the definition of Y, I get Cov(h) = E(X(k+h) * X(k-5)). This yields naturally the fact that the covariance is indeed maximal at h=-5 and not h=+5.

Note that this reasoning does yield the opposite result with a different definition of the covariance, ie. Cov_{X,Y}(h) = E(X(k) * Y(k+h)) (and that's what I first did !).

Therefore, I think there should be a definition in of cross correlation in matplotlib xcorr docstring. In R's acf doc, there is this mention : "The lag k value returned by ccf(x, y) estimates the correlation between x[t+k] and y[t]. " (see http://stat.ethz.ch/R-manual/R-devel/library/stats/html/acf.html)

Now I believe, this upper discussion really belongs to matplotlib ML. I'll put an issue on github (I just spotted a mistake the definition of normalization anyway)

You might be interested in the statsmodels implementation which should be similar to the R functionality.

http://nbviewer.ipython.org/urls/raw.github.com/jseabold/tutorial/master/tsa... http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools... http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.tsap...

we don't have any cross-correlation xcorr, AFAIR but I guess it should work the same way. Josef

...

Skipper

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sudheer Joseph

19 Mar 19 Mar

2:07 a.m.

New subject: Numpy correlate

Thank you All for the response, acf do not accept 2 variables so naturally http://nbviewer.ipython.org/urls/raw.github.com/jseabold/tutorial/master/tsa...

...

http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools... http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.tsap...

These may not work for me. *************************************************************** Sudheer Joseph Indian National Centre for Ocean Information Services Ministry of Earth Sciences, Govt. of India POST BOX NO: 21, IDA Jeedeemetla P.O. Via Pragathi Nagar,Kukatpally, Hyderabad; Pin:5000 55 Tel:+91-40-23886047(O),Fax:+91-40-23895011(O), Tel:+91-40-23044600(R),Tel:+91-40-9440832534(Mobile) E-mail:sjo.India@gmail.com;sudheer.joseph@yahoo.com Web- http://oppamthadathil.tripod.com *************************************************************** ________________________________ From: "josef.pktd@gmail.com" <josef.pktd@gmail.com> To: Discussion of Numerical Python <numpy-discussion@scipy.org> Sent: Tuesday, 19 March 2013 1:51 AM Subject: Re: [Numpy-discussion] Numpy correlate On Mon, Mar 18, 2013 at 1:10 PM, Skipper Seabold <jsseabold@gmail.com> wrote:

...

On Mon, Mar 18, 2013 at 1:00 PM, Pierre Haessig <pierre.haessig@crans.org> wrote:

...
Hi Sudheer,

Le 14/03/2013 10:18, Sudheer Joseph a écrit :

Dear Numpy/Scipy experts, Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag.

You indeed pointed out a lack of documentation of in matplotlib.xcorr function because the definition of covariance can be ambiguous.

The way I would try to get an interpretation of xcorr function (& its friends) is to go back to the theoretical definition of cross-correlation, which is a normalized version of the covariance.

In your example you've created a time series X(k) and a lagged one : Y(k) = X(k-5)

Now, the covariance function of X and Y is commonly defined as : Cov_{X,Y}(h) = E(X(k+h) * Y(k)) where E is the expectation (assuming that X and Y are centered for the sake of clarity).

If I plug in the definition of Y, I get Cov(h) = E(X(k+h) * X(k-5)). This yields naturally the fact that the covariance is indeed maximal at h=-5 and not h=+5.

Note that this reasoning does yield the opposite result with a different definition of the covariance, ie. Cov_{X,Y}(h) = E(X(k) * Y(k+h)) (and that's what I first did !).

Therefore, I think there should be a definition in of cross correlation in matplotlib xcorr docstring. In R's acf doc, there is this mention : "The lag k value returned by ccf(x, y) estimates the correlation between x[t+k] and y[t]. " (see http://stat.ethz.ch/R-manual/R-devel/library/stats/html/acf.html)

Now I believe, this upper discussion really belongs to matplotlib ML. I'll put an issue on github (I just spotted a mistake the definition of normalization anyway)

You might be interested in the statsmodels implementation which should be similar to the R functionality.

http://nbviewer.ipython.org/urls/raw.github.com/jseabold/tutorial/master/tsa... http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools... http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.tsap...

we don't have any cross-correlation xcorr, AFAIR but I guess it should work the same way. Josef

...

Skipper

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sudheer Joseph

2:12 a.m.

New subject: Numpy correlate

Thank you Pierre, It appears the numpy.correlate uses the frequency domain method for getting the ccf. I would like to know how serious or exactly what is the issue with normalization?. I have computed cross correlation using the function and interpreting the results based on it. It will be helpful if you could tell me if there is a significant bug in the function with best regards, Sudheer From: Pierre Haessig <pierre.haessig@crans.org> To: numpy-discussion@scipy.org Sent: Monday, 18 March 2013 10:30 PM Subject: Re: [Numpy-discussion] Numpy correlate Hi Sudheer, Le 14/03/2013 10:18, Sudheer Joseph a écrit : Dear Numpy/Scipy experts,

...

Attached is a script which I made to test the numpy.correlate ( which is called py plt.xcorr) to see how the cross correlation is calculated. From this it appears the if i call plt.xcorr(x,y) Y is slided back in time compared to x. ie if y is a process that causes a delayed response in x after 5 timesteps then there should be a high correlation at Lag 5. However in attached plot the response is seen in only -ve side of the lags. Can any one advice me on how to see which way exactly the 2 series are slided back or forth.? and understand the cause result relation better?( I understand merely by correlation one cannot assume cause and result relation, but it is important to know which series is older in time at a given lag. You indeed pointed out a lack of documentation of in matplotlib.xcorr function because the definition of covariance can be ambiguous.

Pierre Haessig

20 Mar 20 Mar

3:30 a.m.

New subject: Numpy correlate

Hi, Le 19/03/2013 08:12, Sudheer Joseph a écrit :

...

*Thank you Pierre,* It appears the numpy.correlate uses the frequency domain method for getting the ccf. I would like to know how serious or exactly what is the issue with normalization?. I have computed cross correlation using the function and interpreting the results based on it. It will be helpful if you could tell me if there is a significant bug in the function with best regards, Sudheer np.correlate works in the time domain. I started a discussion about a month ago about the way it's implemented http://mail.scipy.org/pipermail/numpy-discussion/2013-February/065562.html Unfortunately I didn't find time to dig deeper in the matter which needs working in the C code of numpy which I'm not familiar with.

Concerning the normalization of mpl.xcorr, I think that what is computed is just fine. It's just the way this normalization is described in the docstring which I think is weird. https://github.com/matplotlib/matplotlib/issues/1835 best, Pierre

Chris Barker - NOAA Federal

14 Mar 14 Mar

10:40 a.m.

New subject: R: R: R: R: R: fast numpy.fromfile skipping data chunks

On Thu, Mar 14, 2013 at 1:48 AM, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl> wrote:

...

Thanks for all the feedback (on the SSD too). For what concerns "biggus" library, for working on larger-than-memory arrays, this is really interesting, but unfortunately I don't have time to test it at the moment, I will try to have a look at it in the future. I hope to see something like that implemented in numpy soon, though.

You may also want to look at carray: https://github.com/FrancescAlted/carray I"ve never used it, but it stores the contents of the array in a compressed from in memory, so if you data compresses well, then it could be a slick solution. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Francesc Alted

13 Mar 13 Mar

9:53 a.m.

On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:

...

Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). [clip]

You can do a fid.seek(offset) prior to np.fromfile() and the it will read from offset. See the docstrings for `file.seek()` on how to use it. -- Francesc Alted

Francesc Alted

10:04 a.m.

On 3/13/13 3:53 PM, Francesc Alted wrote:

...

On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:

...
Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). [clip]

You can do a fid.seek(offset) prior to np.fromfile() and the it will read from offset. See the docstrings for `file.seek()` on how to use it.

Ups, you were already using file.seek(). Disregard, please. -- Francesc Alted

Richard Hattersley

11:53 a.m.

...

Since the files are huge, and would make me run out of memory, I need to read data skipping some records

Is it possible to describe what you're doing with the data once you have subsampled it? And if there were a way to work with the full resolution data, would that be desirable? I ask because I've been dabbling with a pure-Python library for handilng larger-than-memory datasets - https://github.com/SciTools/biggus, and it uses similar chunking techniques as mentioned in the other replies to process data at the full streaming I/O rate. It's still in the early stages of development so the design can be fluid, so maybe it's worth seeing if there's enough in common with your needs to warrant adding your use case. Richard On 13 March 2013 13:45, Andrea Cimatoribus <Andrea.Cimatoribus@nioz.nl>wrote:

...

Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic? Thanks

import numpy as np #For cython! cimport numpy as np from libc.stdint cimport uint32_t

def cffskip32(fid, int count=1, int skip=0):

cdef int k=0 cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, dtype=np.uint32)

if skip>=0: while k<count: try: data[k] = np.fromfile(fid, count=1, dtype=np.uint32) fid.seek(skip, 1) k +=1 except ValueError: data = data[:k] break return data _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4251

Age (days ago)

4258

Last active (days ago)

List overview

Download

26 comments

12 participants

participants (12)

Andrea Cimatoribus
Charles R Harris
Chris Barker - NOAA Federal
Daπid
Francesc Alted
Frédéric Bastien
josef.pktd＠gmail.com
Nathaniel Smith
Pierre Haessig
Richard Hattersley
Skipper Seabold
Sudheer Joseph

fast numpy.fromfile skipping data chunks

Andrea Cimatoribus

Frédéric Bastien

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Andrea Cimatoribus

Pierre Haessig

Skipper Seabold

Pierre Haessig

Francesc Alted

Francesc Alted

Richard Hattersley

tags

participants (12)