Mailman 3 Optimize removing nan-values of dataset - NumPy-Discussion

13 Aug 2013

      Hi,

i am trying to remove nan-values from an array of shape(40, 6).
These nan-values at point data[x] should be replaced by the mean
of data[x-1] and data[x+1] if both values at x-1 and x+1 are not
nan. The function nan_to_mean (see below) is working but i wonder
if i could optimize the code.

I thought about something like
  1. Find all nan values in array:
     nans = np.isnan(dataarray)
  2. Check if values before, after nan indice are not nan
  3. Calculate mean

While using this script for my original dataset of
shape(63856, 6) it takes 139.343 seconds to run it. And some
datasets are even bigger. I attached the example_dataset.txt and
the example.py script.

Thanks for any help,
Tom

def nan_to_mean(arr):
    for cnt, value in enumerate(arr):
        # Check if first value is nan, if so continue
        if cnt == 0 and np.isnan(value):
            continue
        # Check if last value is nan:
        #     If x-1 value is nan dont do anything!
        #     If x-1 is float, last value will be value of x-1
        elif cnt == (len(arr)-1):
            if np.isnan(value) and not np.isnan(arr[cnt-1]):
                arr[cnt] = arr[cnt-1]
        # If the first values of file are nan ignore them all
        elif np.isnan(value) and np.isnan(arr[cnt-1]):
            continue
        # Found nan value and x-1 value is of type float
        elif np.isnan(value) and not np.isnan(arr[cnt-1]):
            # Check if x+1 value is not nan
            if not np.isnan(arr[cnt+1]):
                arr[cnt] = '%.1f' % np.mean((
                        arr[cnt-1],arr[cnt+1]))
            # If x+1 value is nan, go to next value
            else:
                for N in xrange(2, 30):
                    if cnt+N == (len(arr)):
                        break
                    elif not np.isnan(arr[cnt+N]):
                        arr[cnt] = '%.1f' % np.mean(
                                (arr[cnt-1], arr[cnt+N]))
    return arr

Optimize removing nan-values of dataset

Thomas Goebel

David Reed

Thomas Goebel

David Reed

Brett Olsen

tags

participants (3)