
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? Thanks, Andrew

On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <strawman@astraw.com> wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search,
Yes.
which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for "binary search minimize cache misses" are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Thu, May 8, 2008 at 9:55 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <strawman@astraw.com> wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search,
Yes.
which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for "binary search minimize cache misses" are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents.
I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup. Chuck

On Thu, May 8, 2008 at 10:30 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, May 8, 2008 at 9:55 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <strawman@astraw.com> wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search,
Yes.
which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for "binary search minimize cache misses" are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents.
I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup.
One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels. Chuck

On Thu, May 8, 2008 at 11:06 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, May 8, 2008 at 10:30 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Thu, May 8, 2008 at 9:55 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <strawman@astraw.com> wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search,
Yes.
which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for "binary search minimize cache misses" are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents.
I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup.
One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels.
You could even do it with your data from python, generating the second list and calling searchsorted multiple times. If you are searching for a bunch of values, it's probably good to also sort them first so they bunch together in the same part of the big list. Chuck

A Friday 09 May 2008, Andrew Straw escrigué:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
Well, if you can afford extra space for the hashes you can always use a dictionary for doing the lookups. In pure Python they are around 3x faster (for arrays of 8 millions of elements) than binary searches. If your space is tight, you can build an extension (for example in Pyrex) for doing binary search for your specific type (int64), for an small improvement. Finally, if you combine this approach with what is suggesting Charles Harris (i.e. creating several levels of caches, but not more than two, which in my experience works best), you can have pretty optimal lookups with relatively low space overhead. See this thread for a discussion of the binary/hash lookup approaches: http://mail.python.org/pipermail/python-list/2007-November/465900.html Hope that helps, -- Francesc Alted

Hi, I don't know if it helps, but Ulrich Drepper had a 9 part series about memory in Linux Weekly News (http://lwn.net). You can under all 9 linked under his name in the Guest archives (http://lwn.net/Archives/GuestIndex/) as not all are linked together. The first one is: http://lwn.net/Articles/250967/ What every programmer should know about memory, Part 1 <http://lwn.net/Articles/250967/> (September 21, 2007) In part 5 there was a comment on 'Cache-oblivious algorithms' by *akapoor*: "i guess it's worth mentioning harald-prokop's 1999 thesis on "cache oblivious algorithms" (http://citeseer.ist.psu.edu/prokop99cacheobliviou.html)." Regards Bruce Francesc Alted wrote:
A Friday 09 May 2008, Andrew Straw escrigué:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
Well, if you can afford extra space for the hashes you can always use a dictionary for doing the lookups. In pure Python they are around 3x faster (for arrays of 8 millions of elements) than binary searches. If your space is tight, you can build an extension (for example in Pyrex) for doing binary search for your specific type (int64), for an small improvement. Finally, if you combine this approach with what is suggesting Charles Harris (i.e. creating several levels of caches, but not more than two, which in my experience works best), you can have pretty optimal lookups with relatively low space overhead.
See this thread for a discussion of the binary/hash lookup approaches:
http://mail.python.org/pipermail/python-list/2007-November/465900.html
Hope that helps,

On Thu, May 8, 2008 at 9:51 PM, Andrew Straw <strawman@astraw.com> wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
What sort of algorithm is best also depends on the use. If you have a 25e6 sized table that you want to interpolate through with another set of 25e6 indices, then binary search is the wrong approach. In that case you really want to start from the last position and search forward with increasing steps to bracket the next value. Basically, binary search is order n*ln(m), where n is the size of the index list and m the size of the table. The sequential way is nearly n + m, which will be much better if n and m are of comparable size. Chuck

Hi Andrew 2008/5/9 Andrew Straw <strawman@astraw.com>:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
If found Francesc Altet's Pyrex implementation at http://mail.python.org/pipermail/python-list/2007-November/466503.html I modified it for use with Cython and added some tests: https://code.launchpad.net/~stefanv/+junk/my_bisect That may be a good starting point for further experimentation. As it is, it is already about 10 times faster than the built-in version (since I can assume we're working with int64's, so no special type checking is done). Regards Stéfan

On Sat, May 10, 2008 at 9:31 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
Hi Andrew
2008/5/9 Andrew Straw <strawman@astraw.com>:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
If found Francesc Altet's Pyrex implementation at
http://mail.python.org/pipermail/python-list/2007-November/466503.html
I modified it for use with Cython and added some tests:
That may be a good starting point for further experimentation. As it is, it is already about 10 times faster than the built-in version.
The built in version is in c, but not type specific. It could be moved to the generated code base easily enough. The slow part is here if (compare(parr + elsize*imid, pkey, key) < 0) Chuck

Thanks for all the comments on my original question. I was more offline than intended after I sent it until now, so I'm sorry I wasn't immediately able to participate in the discussion. Anyhow, after working on this a bit more, I came up with a few implementations of search algorithms doing just what I needed with the same interface available using bazaar and launchpad at http://launchpad.net/~astraw/+junk/fastsearch (MIT license). I have attached the output of the plot_comparisons.py benchmarking script to this email (note that this benchmarking is pretty crude). For the problem I originally wrote about, I get what a nearly unbelievable speedup of ~250x using the fastsearch.downsamp.DownSampledPreSearcher class, which is very similar in spirit to Charles' suggestion. It takes 1000 values from the original array to create a new first-level array that is itself localized in memory and points to a more localized region of the full original array. Also, I get a similar (though slightly slower) result using AVL trees using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl ( http://sourceforge.net/projects/pyavl ). Using the benchmarking code included in the bzr branch, I don't get anything like this speedup (e.g. the attached figure), so I'm not sure exactly what's going on at this point, but I'm not going to argue with a 250x speedup, so the fastsearch.downsamp code is now being put to use in one of my projects. Stefan -- I think your code simply implements the classic binary search -- I don't see how it will reduce cache misses. Anyhow, perhaps someone will find the above useful. I guess it would still be a substantial amount of work to make a numpy-types-aware implementation of AVL trees or similar algorithms. These sorts of binary search trees seem like the right way to solve this problem and thus there might be an interesting project in this. I imagine that a numpy-types-aware Cython might make such implementation significantly easier and still blazingly fast compared to the binary search implemented in searchsorted() given today's cached memory architectures. -Andrew Andrew Straw wrote:
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times.
Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm?
Thanks, Andrew _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

On Tue, May 13, 2008 at 5:59 PM, Andrew Straw <strawman@astraw.com> wrote:
Thanks for all the comments on my original question. I was more offline than intended after I sent it until now, so I'm sorry I wasn't immediately able to participate in the discussion.
Anyhow, after working on this a bit more, I came up with a few implementations of search algorithms doing just what I needed with the same interface available using bazaar and launchpad at http://launchpad.net/~astraw/+junk/fastsearch<http://launchpad.net/%7Eastraw/+junk/fastsearch>(MIT license). I have attached the output of the plot_comparisons.py benchmarking script to this email (note that this benchmarking is pretty crude).
For the problem I originally wrote about, I get what a nearly unbelievable speedup of ~250x using the fastsearch.downsamp.DownSampledPreSearcher class, which is very similar in spirit to Charles' suggestion. It takes 1000 values from the original array to create a new first-level array that is itself localized in memory and points to a more localized region of the full original array. Also, I get a similar (though slightly slower) result using AVL trees using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl ( http://sourceforge.net/projects/pyavl ).
Using the benchmarking code included in the bzr branch, I don't get anything like this speedup (e.g. the attached figure), so I'm not sure exactly what's going on at this point, but I'm not going to argue with a 250x speedup, so the fastsearch.downsamp code is now being put to use in one of my projects.
Stefan -- I think your code simply implements the classic binary search -- I don't see how it will reduce cache misses.
Anyhow, perhaps someone will find the above useful. I guess it would still be a substantial amount of work to make a numpy-types-aware implementation of AVL trees or similar algorithms. These sorts of binary search trees seem like the right way to solve this problem and thus there might be an interesting project in this. I imagine that a numpy-types-aware Cython might make such implementation significantly easier and still blazingly fast compared to the binary search implemented in searchsorted() given today's cached memory architectures.
That's pretty amazing, but I don't understand the graph. The DownSampled search looks like the worst. Are the curves mislabled? Are the axis correct? I'm assuming smaller is better here. Chuck

Charles R Harris wrote:
On Tue, May 13, 2008 at 5:59 PM, Andrew Straw <strawman@astraw.com <mailto:strawman@astraw.com>> wrote:
Thanks for all the comments on my original question. I was more offline than intended after I sent it until now, so I'm sorry I wasn't immediately able to participate in the discussion.
Anyhow, after working on this a bit more, I came up with a few implementations of search algorithms doing just what I needed with the same interface available using bazaar and launchpad at http://launchpad.net/~astraw/+junk/fastsearch <http://launchpad.net/%7Eastraw/+junk/fastsearch> (MIT license). I have attached the output of the plot_comparisons.py benchmarking script to this email (note that this benchmarking is pretty crude).
For the problem I originally wrote about, I get what a nearly unbelievable speedup of ~250x using the fastsearch.downsamp.DownSampledPreSearcher class, which is very similar in spirit to Charles' suggestion. It takes 1000 values from the original array to create a new first-level array that is itself localized in memory and points to a more localized region of the full original array. Also, I get a similar (though slightly slower) result using AVL trees using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl ( http://sourceforge.net/projects/pyavl ).
Using the benchmarking code included in the bzr branch, I don't get anything like this speedup (e.g. the attached figure), so I'm not sure exactly what's going on at this point, but I'm not going to argue with a 250x speedup, so the fastsearch.downsamp code is now being put to use in one of my projects.
Stefan -- I think your code simply implements the classic binary search -- I don't see how it will reduce cache misses.
Anyhow, perhaps someone will find the above useful. I guess it would still be a substantial amount of work to make a numpy-types-aware implementation of AVL trees or similar algorithms. These sorts of binary search trees seem like the right way to solve this problem and thus there might be an interesting project in this. I imagine that a numpy-types-aware Cython might make such implementation significantly easier and still blazingly fast compared to the binary search implemented in searchsorted() given today's cached memory architectures.
That's pretty amazing, but I don't understand the graph. The DownSampled search looks like the worst. Are the curves mislabled? Are the axis correct? I'm assuming smaller is better here.
The lines are labeled properly -- the graph is inconsistent with the findings on my real data (not shown), which is what I meant with "Using the benchmarking code included in the bzr branch, I don't get anything like this speedup (e.g. the attached figure)". My guess is that the BinarySearcher climbs terribly under some usage pattern that isn't being exhibited with this test. I'm really not sure yet what is the important difference with my real data and these synthetic data. I will keep the list posted as I find out more. Clearly, on the synthetic data for the benchmark, the BinarySearcher does pretty well when N items is large. This is quite contrary to my theory about cache misses being the root of my problem with the binary search, so I don't understand it at the moment, but certainly both the of the other searchers perform better on my real data. I will post any new insights as I continue to work on this... -Andrew

I will post any new insights as I continue to work on this...
OK, I save isolated a sample of my data that illustrates the terrible performance with the binarysearch. I have uploaded it as a pytables file to http://astraw.com/framenumbers.h5 in case anyone wants to have a look themselves. Here's an example of the type of benchmark I've been running: import fastsearch.downsamp import fastsearch.binarysearch import tables h5=tables.openFile('framenumbers.h5',mode='r') framenumbers=h5.root.framenumbers.read() keys=h5.root.keys.read() h5.close() def bench( implementation ): for key in keys: implementation.index( key ) downsamp = fastsearch.downsamp.DownSampledPreSearcher( framenumbers ) binary = fastsearch.binarysearch.BinarySearcher( framenumbers ) # The next two lines are IPython-specific, and the 2nd takes a looong time: %timeit bench(downsamp) %timeit bench(binary) Running the above gives: In [14]: %timeit bench(downsamp) 10 loops, best of 3: 64 ms per loop In [15]: %timeit bench(binary) 10 loops, best of 3: 184 s per loop Quite a difference (a factor of about 3000)! At this point, I haven't delved into the dataset to see what makes it so pathological -- performance is nowhere near this bad for the binary search algorithm with other sets of keys.

Andrew Straw wrote:
I have uploaded it as a pytables file to http://astraw.com/framenumbers.h5 Ahh, forgot to mention a potentially important point -- this data file is 191 MB.

On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <strawman@astraw.com> wrote:
I will post any new insights as I continue to work on this...
OK, I save isolated a sample of my data that illustrates the terrible performance with the binarysearch. I have uploaded it as a pytables file to http://astraw.com/framenumbers.h5 in case anyone wants to have a look themselves. Here's an example of the type of benchmark I've been running:
import fastsearch.downsamp import fastsearch.binarysearch import tables
h5=tables.openFile('framenumbers.h5',mode='r') framenumbers=h5.root.framenumbers.read() keys=h5.root.keys.read() h5.close()
def bench( implementation ): for key in keys: implementation.index( key )
downsamp = fastsearch.downsamp.DownSampledPreSearcher( framenumbers ) binary = fastsearch.binarysearch.BinarySearcher( framenumbers )
# The next two lines are IPython-specific, and the 2nd takes a looong time:
%timeit bench(downsamp) %timeit bench(binary)
Running the above gives:
In [14]: %timeit bench(downsamp) 10 loops, best of 3: 64 ms per loop
In [15]: %timeit bench(binary)
10 loops, best of 3: 184 s per loop
Quite a difference (a factor of about 3000)! At this point, I haven't delved into the dataset to see what makes it so pathological -- performance is nowhere near this bad for the binary search algorithm with other sets of keys.
It can't be that bad Andrew, something else is going on. And 191 MB isn's *that* big, I expect it should bit in memory with no problem. Chuck

Charles R Harris wrote:
On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <strawman@astraw.com <mailto:strawman@astraw.com>> wrote:
Quite a difference (a factor of about 3000)! At this point, I haven't delved into the dataset to see what makes it so pathological -- performance is nowhere near this bad for the binary search algorithm with other sets of keys.
It can't be that bad Andrew, something else is going on. And 191 MB isn's *that* big, I expect it should bit in memory with no problem.
I agree the performance difference seems beyond what one would expect due to cache misses alone. I'm at a loss to propose other explanations, though. Ideas?

On Wed, May 14, 2008 at 2:00 PM, Andrew Straw <strawman@astraw.com> wrote:
Charles R Harris wrote:
On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <strawman@astraw.com <mailto:strawman@astraw.com>> wrote:
Quite a difference (a factor of about 3000)! At this point, I haven't delved into the dataset to see what makes it so pathological -- performance is nowhere near this bad for the binary search algorithm with other sets of keys.
It can't be that bad Andrew, something else is going on. And 191 MB isn's *that* big, I expect it should bit in memory with no problem.
I agree the performance difference seems beyond what one would expect due to cache misses alone. I'm at a loss to propose other explanations, though. Ideas?
I just searched for 2**25/10 keys in a 2**25 array of reals. It took less than a second when vectorized. In a python loop it took about 7.7 seconds. The only thing I can think of is that the search isn't getting any cpu cycles for some reason. How much memory is it using? Do you have any nans and such in the data? Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Aha, I've found the problem -- my values were int64 and my keys were uint64. Switching to the same data type immediately fixes the issue! It's not a memory cache issue at all. Perhaps searchsorted() should emit a warning if the keys require casting... I can't believe how bad the hit was. -Andrew Charles R Harris wrote:
On Wed, May 14, 2008 at 2:00 PM, Andrew Straw <strawman@astraw.com <mailto:strawman@astraw.com>> wrote:
Charles R Harris wrote: > > > On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <strawman@astraw.com <mailto:strawman@astraw.com> > <mailto:strawman@astraw.com <mailto:strawman@astraw.com>>> wrote: > > > > Quite a difference (a factor of about 3000)! At this point, I haven't > delved into the dataset to see what makes it so pathological -- > performance is nowhere near this bad for the binary search algorithm > with other sets of keys. > > > It can't be that bad Andrew, something else is going on. And 191 MB > isn's *that* big, I expect it should bit in memory with no problem. I agree the performance difference seems beyond what one would expect due to cache misses alone. I'm at a loss to propose other explanations, though. Ideas?
I just searched for 2**25/10 keys in a 2**25 array of reals. It took less than a second when vectorized. In a python loop it took about 7.7 seconds. The only thing I can think of is that the search isn't getting any cpu cycles for some reason. How much memory is it using? Do you have any nans and such in the data?

On Wed, May 14, 2008 at 8:50 PM, Andrew Straw <strawman@astraw.com> wrote:
Aha, I've found the problem -- my values were int64 and my keys were uint64. Switching to the same data type immediately fixes the issue! It's not a memory cache issue at all.
Perhaps searchsorted() should emit a warning if the keys require casting... I can't believe how bad the hit was.
I think it used to fail and that was fixed not so long ago. Was it casting the keys or was it casting the big sorted array? The latter would certainly slow things down and would be a bug, and a bug of the sort that might have slipped in when the casting was added. Chuck

On Tue, May 13, 2008 at 6:59 PM, Andrew Straw <strawman@astraw.com> wrote:
easier and still blazingly fast compared to the binary search implemented in searchsorted() given today's cached memory architectures.
Andrew, I looked at your code and I don't quite understand something. Why are you looking up single values? Is it not the case that the overhead of several Python calls completely dominates the actual cost of performing a binary search (e.g. in C code)? For instance, the 'v' argument of searchsorted(a,v) can be an array: from scipy import * haystack = rand(1e7) needles = haystack[:10000].copy() haystack.sort() timeit searchsorted(haystack, needles) 100 loops, best of 3: 12.7 ms per loop Which seems to be much faster than: timeit for k in needles: x = searchsorted(haystack,k) 10 loops, best of 3: 43.8 ms per loop The other thing to consider is that a reasonably smart CPU cache manager may retain the first few levels that are explored in a binary search tree. Clearly you could speed this up by increasing the branching factor of the tree, or using a fast index into the large array. However, I think that these effects are being masked by Python overhead in your tests. I whipped up a weave implementation of searchsorted() that uses the STL. It clocks in at 9.72ms per loop, so I think NumPy's searchsorted() is fairly good. import scipy import scipy.weave def searchsorted2(a,v): N_a = len(a) N_v = len(v) indices = scipy.empty(N_v, dtype='intc') code = """ for(int i = 0; i < N_v; i++){ indices(i) = lower_bound(&a(0), &a(N_a), v(i)) - &a(0); } """ err = scipy.weave.inline(code, ['a','v','N_a', 'N_v','indices'], type_converters = scipy.weave.converters.blitz, compiler = 'gcc', support_code = '#include <algo.h>') return indices -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/

Nathan Bell wrote:
On Tue, May 13, 2008 at 6:59 PM, Andrew Straw <strawman@astraw.com> wrote:
easier and still blazingly fast compared to the binary search implemented in searchsorted() given today's cached memory architectures.
Andrew, I looked at your code and I don't quite understand something. Why are you looking up single values?
Hi Nathan, The Python overhead was nothing compared to the speed problems I was having... Now I'm quite sure that some optimization could go a little further. Nevertheless, for my motivating use case, it wouldn't be trivial to vectorize this, and a "little further" in this case is too little to justify the investment of my time at the moment. -Andrew
participants (7)
-
Andrew Straw
-
Bruce Southey
-
Charles R Harris
-
Francesc Alted
-
Nathan Bell
-
Robert Kern
-
Stéfan van der Walt