Curious performance different with np.unique on arrays of characters
Hello  In the course of some genomics simulations, I seem to have come across a curious (to me at least) performance difference in np.unique that I wanted to share. (If this is not the right forum for this, please let me know!) With a np.array of characters (U1), np.unique seems to be much faster when doing np.view as int > np.unique > np.view as U1 for arrays of decent size. I would not have expected this since np.unique knows what's coming in as S1 and could handle the viewstuff internally. I've played with this a number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k elements) and seem to notice the same pattern. A short illustration below with U1, int32, return_counts = False, 10 vs 10k. I wonder if this is actually intended behavior, i.e. the viewstuff is actually a good idea for the user to think about and implement if appropriate for their usecase (as it is for me). Best regards, Shyam import numpy as np charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1') def unique_basic(x): return np.unique(x) def unique_view(x): return np.unique(x.view(np.int32)).view(x.dtype) In [27]: %timeit unique_basic(charlist_10) 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [28]: %timeit unique_view(charlist_10) 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [29]: %timeit unique_basic(charlist_10k) 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [30]: %timeit unique_view(charlist_10k) 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [31]: np.__version__ Out[31]: '1.25.2'  Shyam Saladi https://shyam.saladi.org
Looking at a pyspy profile of a slightly modified version of the code you shared, it seems the difference comes down to NumPy's sorting implementation simply being faster for ints than unicode strings. In particular, it looks like string_quicksort_<npy::unicode_tag, char> is two or three times slower than quicksort_<npy::int_tag, int> when passed the same data. We could probably add a special case in the sorting code to improve performance for sorting singlecharacter arrays. I have no idea if that would be complicated or would make the code difficult to deal with. I'll also note that string sorting is a more general problem than integer sorting, since a generic string sort can't assume that it is handed singlecharacter strings. Note also that U1 arrays are arrays of a single *unicode* character, which in NumPy is stored as a 4byte codepoint. If all you care about is ASCII or Latin1 characters, an S1 encoding will be a bit faster. On my machine, using S1 makes unique_basic(charlist_10k) go from 466 us to 400 us. However, I can also rewrite unique_view in that case to cast to int8, which takes the runtime for unique_view(charlist_10k) from 172 us to 155 us. On Thu, Sep 14, 2023 at 8:10 AM <saladi@caltech.edu> wrote:
Hello 
In the course of some genomics simulations, I seem to have come across a curious (to me at least) performance difference in np.unique that I wanted to share. (If this is not the right forum for this, please let me know!)
With a np.array of characters (U1), np.unique seems to be much faster when doing np.view as int > np.unique > np.view as U1 for arrays of decent size. I would not have expected this since np.unique knows what's coming in as S1 and could handle the viewstuff internally. I've played with this a number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k elements) and seem to notice the same pattern. A short illustration below with U1, int32, return_counts = False, 10 vs 10k.
I wonder if this is actually intended behavior, i.e. the viewstuff is actually a good idea for the user to think about and implement if appropriate for their usecase (as it is for me).
Best regards, Shyam
import numpy as np
charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
def unique_basic(x): return np.unique(x)
def unique_view(x): return np.unique(x.view(np.int32)).view(x.dtype)
In [27]: %timeit unique_basic(charlist_10) 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [28]: %timeit unique_view(charlist_10) 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [29]: %timeit unique_basic(charlist_10k) 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [30]: %timeit unique_view(charlist_10k) 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [31]: np.__version__ Out[31]: '1.25.2'
 Shyam Saladi https://shyam.saladi.org _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: nathan12343@gmail.com
What processor you are running this on? np.sort uses AVX512 accelerated sorting for np.int32, so just wondering if you that is the reason for this difference. Raghuveer
Original Message From: saladi@caltech.edu <saladi@caltech.edu> Sent: Wednesday, September 13, 2023 6:14 PM To: numpydiscussion@python.org Subject: [Numpydiscussion] Curious performance different with np.unique on arrays of characters
Hello 
In the course of some genomics simulations, I seem to have come across a curious (to me at least) performance difference in np.unique that I wanted to share. (If this is not the right forum for this, please let me know!)
With a np.array of characters (U1), np.unique seems to be much faster when doing np.view as int > np.unique > np.view as U1 for arrays of decent size. I would not have expected this since np.unique knows what's coming in as S1 and could handle the viewstuff internally. I've played with this a number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k elements) and seem to notice the same pattern. A short illustration below with U1, int32, return_counts = False, 10 vs 10k.
I wonder if this is actually intended behavior, i.e. the viewstuff is actually a good idea for the user to think about and implement if appropriate for their usecase (as it is for me).
Best regards, Shyam
import numpy as np
charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
def unique_basic(x): return np.unique(x)
def unique_view(x): return np.unique(x.view(np.int32)).view(x.dtype)
In [27]: %timeit unique_basic(charlist_10) 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [28]: %timeit unique_view(charlist_10) 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [29]: %timeit unique_basic(charlist_10k) 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [30]: %timeit unique_view(charlist_10k) 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [31]: np.__version__ Out[31]: '1.25.2'
 Shyam Saladi https://shyam.saladi.org _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: raghuveer.devulapalli@intel.com
On Thu, Sep 14, 2023 at 11:34 AM Devulapalli, Raghuveer < raghuveer.devulapalli@intel.com> wrote:
What processor you are running this on? np.sort uses AVX512 accelerated sorting for np.int32, so just wondering if you that is the reason for this difference.
Raghuveer
We also have radix sort for stable sorting of int8, int16, which should be quite fast. Hmm, I wonder if radix sort could be vectorized? When we dropped Python 2.7, there were some folks who ended up using a uint8 array subtype for storing data. All they needed to add was automatic translation to strings for certain accesses. This gave them a 4x advantage in storage space.
Original Message From: saladi@caltech.edu <saladi@caltech.edu> Sent: Wednesday, September 13, 2023 6:14 PM To: numpydiscussion@python.org Subject: [Numpydiscussion] Curious performance different with np.unique on arrays of characters
Hello 
In the course of some genomics simulations, I seem to have come across a curious (to me at least) performance difference in np.unique that I wanted to share. (If this is not the right forum for this, please let me know!)
With a np.array of characters (U1), np.unique seems to be much faster when doing np.view as int > np.unique > np.view as U1 for arrays of decent size. I would not have expected this since np.unique knows what's coming in as S1 and could handle the viewstuff internally. I've played with this a number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k elements) and seem to notice the same pattern. A short illustration below with U1, int32, return_counts = False, 10 vs 10k.
I wonder if this is actually intended behavior, i.e. the viewstuff is actually a good idea for the user to think about and implement if appropriate for their usecase (as it is for me).
Best regards, Shyam
import numpy as np
charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
def unique_basic(x): return np.unique(x)
def unique_view(x): return np.unique(x.view(np.int32)).view(x.dtype)
In [27]: %timeit unique_basic(charlist_10) 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [28]: %timeit unique_view(charlist_10) 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [29]: %timeit unique_basic(charlist_10k) 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [30]: %timeit unique_view(charlist_10k) 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [31]: np.__version__ Out[31]: '1.25.2'
 Shyam Saladi https://shyam.saladi.org _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: raghuveer.devulapalli@intel.com
NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: charlesr.harris@gmail.com
Could you share the processor you're currently running this on? I ask because np.sort leverages AVX512 acceleration for sorting np.int32, and I'm curious if that could be contributing to the observed difference in performance. https://apkhexo.com/koloromodapk/
Hi, one thing that's been on my mind about this discussion: Isn't sorting strings simply a much harder job? Particularly Unicode strings? Cheers Klaus On 27/09/2023 13:12, Lyla Watts wrote:
Could you share the processor you're currently running this on? I ask because np.sort leverages AVX512 acceleration for sorting np.int32, and I'm curious if that could be contributing to the observed difference in performance. https://apkhexo.com/koloromodapk/ _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: klaus.zimmermann@smhi.se
On Fri, 20230929 at 11:39 +0200, Klaus Zimmermann wrote:
Hi,
one thing that's been on my mind about this discussion:
Isn't sorting strings simply a much harder job? Particularly Unicode strings?
Yes, but in theory if they are length 1 it is just sorting integers (8 or 64bit) for the current quirky NumPy fixedlength string dtypes. Modulo complicated stuff that Python doesn't worry about either [1]. But, of course that is in theory. In practice have a single implementation that deals with arbitrary string lengths, so the code does a lot of extra stuff (and it is harder to use fancy tricks, and our implementation for a lot of these things is very basic). Also while we do have the flexibility to create it now, we don't actually have an obvious place where to add such a specialization (of course you can always insert an `if ...` clause somewhere, but that isn't a nice design).  Sebastian [1] In principle you are right: sorting unicode is complicated! In practice, that is your problem as a user though. If you want to deal with weirder things, you have to normalize the unicode first, etc.
Cheers Klaus
On 27/09/2023 13:12, Lyla Watts wrote:
Could you share the processor you're currently running this on? I ask because np.sort leverages AVX512 acceleration for sorting np.int32, and I'm curious if that could be contributing to the observed difference in performance. https://apkhexo.com/koloromodapk/ _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: klaus.zimmermann@smhi.se
NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: sebastian@sipsolutions.net
participants (7)

Charles R Harris

Devulapalli, Raghuveer

Klaus Zimmermann

Lyla Watts

Nathan

saladi＠caltech.edu

Sebastian Berg