[Numpy-discussion] member1d and unique elements

Sun Aug 3 12:42:19 EDT 2008

I have two arrays of integers, and would like to know _where_ they
have elements in common, not just _which_ elements are in common.
This is because the entries in the integer array are aligned with
other arrays.  This seems very close to what member1d advertises as
its function.  However, member1d says that it expects arrays with only
unique elements.

First of all, my desired operation is well-posed:  I'd like f(ar1,
ar2) to return something in the shape of ar1 with True if the value at
that position appears anywhere in ar2 (regardless of duplication) and
False otherwise.

So I looked at the code and have two questions:
1) What is this code trying to achieve?
    aux = perm[ii+1]
    perm[ii+1] = perm[ii]
    perm[ii] = aux

Here perm is the stable argsort of the two concatenated arguments:
perm = concatenate((ar1, ar2)).argsort(kind='mergesort').
arr is the array of combined inputs in sorted order:
arr = concatenate((ar1, ar2))[perm]
and ii is a list of indices into arr where the value of arr is equal
to the next value in the array (arr[ii] == arr[ii+1]) _and_ arr[ii]
came from the _second_ input (ar2).

Now, this last bit (looking for elements of arr that are equal and
both came from the second array) is clearly trying to deal with
duplication, which is why I'm interested...

So, the code snippet is trying to swap perm[ii+1] with perm[ii], but I
don't see why.  Furthermore, there are funny results if a value is
duplicated three times, not just twice -- perm is no longer a
permutation vector.  Eg, member1d([1], [2,2,2]) results perm=[0,1,2,3]
and ii=[1,2] before the above snippet, and the above snippet makes
perm into [0,2,3,2]

I've commented those three lines, and I've never seen any changes to
the output of member1d.  The new value of perm is used to compute the
expression: perm.argsort(kind='mergesort')[:len( ar1 )], but the
changes to that expression as a result of the above three lines are
always at the high end of the array, which is sliced off by the last
[:len(ar1)].

Finally, my second question is:
2) Does anyone have a test case where member1d fails as a result of
duplicates in the input?  So far I haven't found any, with the above
lines commented or not.

Upon reflection and review of the changelog, another theory occurs to
me: member1d did not originally use a stable sort.  What I've written
above for interpretation of the value ii (indicates duplication within
ar2) is true for a stable sort, but for an unstable sort the same
condition has the interpretation that ii holds the values where the
sorting algorithm swapped the order of equal values unstably.  Then
the code snippet in question 1) looks like an attempt to swap those
values in the permutation array to make the sort stable again.  The
attempt would fail if there was duplication in either array.

So, I would propose deleting those three lines (since they seem to be
a non-functional relic) and declaring in the docstring that member1d
doesn't require unique elements.

Also, if this is correct, then the function simplifies considerably
since several values don't need to be computed anymore:

def setmember1d( ar1, ar2 ):
    ar = nm.concatenate( (ar1, ar2 ) )
    perm = ar.argsort(kind='mergesort')
    aux = ar[perm]
    flag = nm.concatenate( (aux[1:] == aux[:-1], [False] ) )
    indx = perm.argsort(kind='mergesort')[:len( ar1 )]
    return flag[indx]

Corrections to the above are welcome since I'm going to start using
member1d without regard for uniqueness, and I'd like to know if I'm
making a big mistake...

Thanks,
Greg