Mailman 3 Curious performance different with np.unique on arrays of characters - NumPy-Discussion

Sept. 14, 2023

      Hello -

In the course of some genomics simulations, I seem to have come across a curious (to me at least) performance difference in np.unique that I wanted to share. (If this is not the right forum for this, please let me know!)

With a np.array of characters (U1), np.unique seems to be much faster when doing np.view as int -> np.unique -> np.view as U1 for arrays of decent size. I would not have expected this since np.unique knows what's coming in as S1 and could handle the view-stuff internally. I've played with this a number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k elements) and seem to notice the same pattern. A short illustration below with U1, int32, return_counts = False, 10 vs 10k.

I wonder if this is actually intended behavior, i.e. the view-stuff is actually a good idea for the user to think about and implement if appropriate for their usecase (as it is for me).

Best regards,
Shyam

import numpy as np

charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1')
charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')

def unique_basic(x):
    return np.unique(x)

def unique_view(x):
    return np.unique(x.view(np.int32)).view(x.dtype)

In [27]: %timeit unique_basic(charlist_10)
2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [28]: %timeit unique_view(charlist_10)
2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [29]: %timeit unique_basic(charlist_10k)
204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit unique_view(charlist_10k)
66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [31]: np.__version__
Out[31]: '1.25.2'

--
Shyam Saladi
https://shyam.saladi.org

Curious performance different with np.unique on arrays of characters

saladi＠caltech.edu

Nathan

Devulapalli, Raghuveer

Charles R Harris

saladi＠caltech.edu

Lyla Watts

Klaus Zimmermann

Sebastian Berg

Nathan

Devulapalli, Raghuveer

Charles R Harris

saladi＠caltech.edu

Lyla Watts

Klaus Zimmermann

Sebastian Berg

tags

participants (7)