Optimize removing nan-values of dataset
Hi, i am trying to remove nan-values from an array of shape(40, 6). These nan-values at point data[x] should be replaced by the mean of data[x-1] and data[x+1] if both values at x-1 and x+1 are not nan. The function nan_to_mean (see below) is working but i wonder if i could optimize the code. I thought about something like 1. Find all nan values in array: nans = np.isnan(dataarray) 2. Check if values before, after nan indice are not nan 3. Calculate mean While using this script for my original dataset of shape(63856, 6) it takes 139.343 seconds to run it. And some datasets are even bigger. I attached the example_dataset.txt and the example.py script. Thanks for any help, Tom def nan_to_mean(arr): for cnt, value in enumerate(arr): # Check if first value is nan, if so continue if cnt == 0 and np.isnan(value): continue # Check if last value is nan: # If x-1 value is nan dont do anything! # If x-1 is float, last value will be value of x-1 elif cnt == (len(arr)-1): if np.isnan(value) and not np.isnan(arr[cnt-1]): arr[cnt] = arr[cnt-1] # If the first values of file are nan ignore them all elif np.isnan(value) and np.isnan(arr[cnt-1]): continue # Found nan value and x-1 value is of type float elif np.isnan(value) and not np.isnan(arr[cnt-1]): # Check if x+1 value is not nan if not np.isnan(arr[cnt+1]): arr[cnt] = '%.1f' % np.mean(( arr[cnt-1],arr[cnt+1])) # If x+1 value is nan, go to next value else: for N in xrange(2, 30): if cnt+N == (len(arr)): break elif not np.isnan(arr[cnt+N]): arr[cnt] = '%.1f' % np.mean( (arr[cnt-1], arr[cnt+N])) return arr
Hi Thomas, Your array is Nx6 do you want the nan values replace by the mean of the 2 adjacent elemets by row or by column? On Tue, Aug 13, 2013 at 2:50 AM, Thomas Goebel < Thomas.Goebel@th-nuernberg.de> wrote:
Hi,
i am trying to remove nan-values from an array of shape(40, 6). These nan-values at point data[x] should be replaced by the mean of data[x-1] and data[x+1] if both values at x-1 and x+1 are not nan. The function nan_to_mean (see below) is working but i wonder if i could optimize the code.
I thought about something like 1. Find all nan values in array: nans = np.isnan(dataarray) 2. Check if values before, after nan indice are not nan 3. Calculate mean
While using this script for my original dataset of shape(63856, 6) it takes 139.343 seconds to run it. And some datasets are even bigger. I attached the example_dataset.txt and the example.py script.
Thanks for any help, Tom
def nan_to_mean(arr): for cnt, value in enumerate(arr): # Check if first value is nan, if so continue if cnt == 0 and np.isnan(value): continue # Check if last value is nan: # If x-1 value is nan dont do anything! # If x-1 is float, last value will be value of x-1 elif cnt == (len(arr)-1): if np.isnan(value) and not np.isnan(arr[cnt-1]): arr[cnt] = arr[cnt-1] # If the first values of file are nan ignore them all elif np.isnan(value) and np.isnan(arr[cnt-1]): continue # Found nan value and x-1 value is of type float elif np.isnan(value) and not np.isnan(arr[cnt-1]): # Check if x+1 value is not nan if not np.isnan(arr[cnt+1]): arr[cnt] = '%.1f' % np.mean(( arr[cnt-1],arr[cnt+1])) # If x+1 value is nan, go to next value else: for N in xrange(2, 30): if cnt+N == (len(arr)): break elif not np.isnan(arr[cnt+N]): arr[cnt] = '%.1f' % np.mean( (arr[cnt-1], arr[cnt+N])) return arr
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
* On 13/08/2013 23:32, David Reed wrote:
Hi Thomas,
Your array is Nx6 do you want the nan values replace by the mean of the 2 adjacent elemets by row or by column?
Hi David, i want it to be replaced by column. I also found numpy.interp but this function replaces all nan values at the beginning/end of array which should be omitted. As an example: y = np.array([nan, nan, 1, 2, 3, nan, nan, 4, nan, 5, nan, nan]) nans = np.isnan(y) Only the values y[5:7] and y[8] should be replaced. Is it possible to set nans[0:2] and nans[-2:] to False with something like nans.startswith nans.endswith?
Yeah, interp is what you want. What you want to do with the end values is up to you, but could be done like this: ind = where(logical_not(np.isnan(y)))[0] y1 = interp(range(len(y)), ind, y[ind]) y1 = y1[ind[0]:ind[-1]] On Wed, Aug 14, 2013 at 4:38 AM, Thomas Goebel < Thomas.Goebel@th-nuernberg.de> wrote:
* On 13/08/2013 23:32, David Reed wrote:
Hi Thomas,
Your array is Nx6 do you want the nan values replace by the mean of the 2 adjacent elemets by row or by column?
Hi David,
i want it to be replaced by column.
I also found numpy.interp but this function replaces all nan values at the beginning/end of array which should be omitted.
As an example: y = np.array([nan, nan, 1, 2, 3, nan, nan, 4, nan, 5, nan, nan]) nans = np.isnan(y)
Only the values y[5:7] and y[8] should be replaced. Is it possible to set nans[0:2] and nans[-2:] to False with something like nans.startswith nans.endswith? _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
The example data/method you've provided doesn't do what you describe. E.g., in your example data you have several 2x2 blocks of NaNs. According to your description, these should not be replaced (as they all have a neighbor that is also a NaN). Your example method, however, replaces them - in fact, replaces any NaN values that are not in the first or last row or contiguous with NaNs in the first or last row. Here's a replacement method that does do what you've described: def nan_to_mean(data): data[1:-1][np.isnan(data[1:-1])] = ((data[:-2] + data[2:]) / 2)[np.isnan(data[1:-1])] return data ~Brett On Tue, Aug 13, 2013 at 1:50 AM, Thomas Goebel < Thomas.Goebel@th-nuernberg.de> wrote:
Hi,
i am trying to remove nan-values from an array of shape(40, 6). These nan-values at point data[x] should be replaced by the mean of data[x-1] and data[x+1] if both values at x-1 and x+1 are not nan. The function nan_to_mean (see below) is working but i wonder if i could optimize the code.
I thought about something like 1. Find all nan values in array: nans = np.isnan(dataarray) 2. Check if values before, after nan indice are not nan 3. Calculate mean
While using this script for my original dataset of shape(63856, 6) it takes 139.343 seconds to run it. And some datasets are even bigger. I attached the example_dataset.txt and the example.py script.
Thanks for any help, Tom
def nan_to_mean(arr): for cnt, value in enumerate(arr): # Check if first value is nan, if so continue if cnt == 0 and np.isnan(value): continue # Check if last value is nan: # If x-1 value is nan dont do anything! # If x-1 is float, last value will be value of x-1 elif cnt == (len(arr)-1): if np.isnan(value) and not np.isnan(arr[cnt-1]): arr[cnt] = arr[cnt-1] # If the first values of file are nan ignore them all elif np.isnan(value) and np.isnan(arr[cnt-1]): continue # Found nan value and x-1 value is of type float elif np.isnan(value) and not np.isnan(arr[cnt-1]): # Check if x+1 value is not nan if not np.isnan(arr[cnt+1]): arr[cnt] = '%.1f' % np.mean(( arr[cnt-1],arr[cnt+1])) # If x+1 value is nan, go to next value else: for N in xrange(2, 30): if cnt+N == (len(arr)): break elif not np.isnan(arr[cnt+N]): arr[cnt] = '%.1f' % np.mean( (arr[cnt-1], arr[cnt+N])) return arr
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Brett Olsen
-
David Reed
-
Thomas Goebel