efficient way to manage a set of floats?
I have an application that involves managing sets of floats. I can use Python's builtin set type, but a data structure that is optimized for fixedsize objects that can be compared without hashing should be more efficient than a more general set construct. Is something like this available?  View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
On 10 May 2010 18:53, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
I have an application that involves managing sets of floats. I can use Python's builtin set type, but a data structure that is optimized for fixedsize objects that can be compared without hashing should be more efficient than a more general set construct. Is something like this available?
You might not find this as useful as you think  on a 32bit machine, the space overhead is roughly a 32bit object pointer or two for each float, plus about twice the number of floats times 32bit pointers for the table. And hashing might be worthwhile anyway  you could easily have a series of floats with related bit patterns you'd want to scatter all over hash space. Plus python's set object has seen a fair amount of performance tweaing. That said, there's support in numpy for many operations which use a sorted 1D array to represent a set of floats. There's searchsorted for lookup, plus IIRC union and intersection operators; I'm not sure about set difference. The big thing missing is updates, though if you can batch them, concatenate followed by sorting should be reasonable. Removal can be done with fancy indexing, though again batching is recommended. Maybe these should be regarded as analogous to python's frozensets. In terms of speed, the numpy functions are obviously not as asymptotically efficient as hash tables, though I suspect memory coherence is more of an issue than O(1) versus O(log(n)). The numpy functions allow vectorized lookups and vector operations on sets, which could be handy. Anne P.S. if you want sets with fuzzy queries, it occurs to me that scipy's kdtrees will actually do an okay job, and in compiled code. No updates there either, though. A
View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Anne Archibald2 wrote:
on a 32bit machine, the space overhead is roughly a 32bit object pointer or two for each float, plus about twice the number of floats times 32bit pointers for the table.
Hello Anne, I'm a bit confused by the above. It sounds as though the hash table approach might occupy 4 times as much storage as a single array. Is that right? Also, I don't understand why hashing would be useful for the set application. It seems as though a redblack tree might be a good implementation for a set of floats if all that one wants to do is add, delete, and test membership. (I will also need to do unions). Thanks for the help! Phillip  View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
Dr. Phillip M. Feldman wrote:
Anne Archibald2 wrote:
on a 32bit machine, the space overhead is roughly a 32bit object pointer or two for each float, plus about twice the number of floats times 32bit pointers for the table.
Hello Anne,
I'm a bit confused by the above. It sounds as though the hash table approach might occupy 4 times as much storage as a single array. Is that right?
Also, I don't understand why hashing would be useful for the set application.
It seems as though a redblack tree might be a good implementation for a set of floats if all that one wants to do is add, delete, and test membership. (I will also need to do unions).
A couple questions: How many floats will you be storing? When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004) Warren
Thanks for the help!
Phillip
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!! Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values. Phillip  View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
On 12 May 2010 20:09, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!!
Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values.
If you want onedimensional "sets" with numerical tolerances, then either a sortedarray implementation looks more appealing. A sortedtree implementation will be a little awkward, since you will often need to explore two branches to find out the nearest neighbour of a query point. In fact what you have is a onedimensional kdtree, which is helpfully provided by scipy.spatial, albeit without insertion or deletion operators. I should also point out that when you start wanting approximate matches, which you will as soon as you do any sort of arithmetic on your floats, your idea of a "set" becomes extremely messy. For example, suppose you try to insert a float that's one part in a million different from one that's in the table. Does it get inserted too or is it "equal" to what's there? When it comes time to remove it, your query will probably have a value slightly different from either previous value  which one, or both, do you remove? Or do you raise an exception? Resolving these questions satisfactorily will probably require you to know the scales that are relevant in your problem and implement sensible handling of scales larger or smaller than this (but beware of the "teapot in a stadium problem", of wildly different scales in the same data set). Even so, you will want to write algorithms that are robust to imprecision, duplication, and disappearance of points in your sets. (If this sounds like the voice of bitter experience, well, I discovered while writing a commercial raytracer that when you shoot billions of rays into millions of triangles, all sorts of astonishing limitations of floatingpoint turn into graphical artifacts. Which are always *highly* visible. It was during this period that the intervalarithmetic camp nearly gained a convert.) Anne
Phillip  View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!!
Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values.
You may want to explain your higherlevel problem. Maintaining sets of floating point numbers is almost never the right approach. With sets, comparison must necessarily be by exact equality because fuzzy equality is not transitive.  Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco
On Wed, May 12, 2010 at 9:27 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!!
Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values.
You may want to explain your higherlevel problem. Maintaining sets of floating point numbers is almost never the right approach. With sets, comparison must necessarily be by exact equality because fuzzy equality is not transitive.
with consistent scaling, shouldn't something like rounding to a fixed precision be enough?
round(1  0.7,14) == round(0.3, 14) True 1  0.7 == 0.3 False
or approx_equal instead of almost_equal Josef
 Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Wed, May 12, 2010 at 8:37 PM, <josef.pktd@gmail.com> wrote:
On Wed, May 12, 2010 at 9:27 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999,
while
10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!!
Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values.
You may want to explain your higherlevel problem. Maintaining sets of floating point numbers is almost never the right approach. With sets, comparison must necessarily be by exact equality because fuzzy equality is not transitive.
with consistent scaling, shouldn't something like rounding to a fixed precision be enough?
round(1  0.7,14) == round(0.3, 14) True 1  0.7 == 0.3 False
or approx_equal instead of almost_equal
Josef
I have to agree with Robert. Whenever a fellow student comes to me describing an issue where they needed to find a floating point number in an array, the problem can usually be restated in a way that makes much more sense.
There are so many issues with doing a naive comparison using round() (largely because it is intransitive as someone else already stated). As a quick and dirty solution to very specific issues, they work  but they are almost never left as a final solution. Ben
On Wed, May 12, 2010 at 21:37, <josef.pktd@gmail.com> wrote:
On Wed, May 12, 2010 at 9:27 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Warren Weckesser3 wrote:
A couple questions:
How many floats will you be storing?
When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1  0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.29999999999999999, while 10.7 is 0.30000000000000004)
Warren
Anne Thanks for that absolutely beautiful explanation!!
Warren I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values.
You may want to explain your higherlevel problem. Maintaining sets of floating point numbers is almost never the right approach. With sets, comparison must necessarily be by exact equality because fuzzy equality is not transitive.
with consistent scaling, shouldn't something like rounding to a fixed precision be enough?
Then you might was well convert to integers and do integer sets. The problem is that two floats very close to a border (and hence each other) would end up in rounding to different bins. They will compare unequal to each other and equal to numbers farther away but in the same arbitrary bin. Again, it depends on the higherlevel problem.  Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco
On 10 May 2010 21:56, Dr. Phillip M. Feldman <pfeldman@verizon.net> wrote:
Anne Archibald2 wrote:
on a 32bit machine, the space overhead is roughly a 32bit object pointer or two for each float, plus about twice the number of floats times 32bit pointers for the table.
Hello Anne,
I'm a bit confused by the above. It sounds as though the hash table approach might occupy 4 times as much storage as a single array. Is that right?
Probably. Hash tables usually operate at about halffull for efficiency, so you'd have twice as many entries as you do objects. If you're using a python hash table, you also have type tagging and malloc information for each object. If you had a custom implementation, you'd have to have some way to mark hash cells as empty. In any case, expect at least a doubling of the space needed for a simple array.
Also, I don't understand why hashing would be useful for the set application.
The reason hash tables rely on hashing is not just to obtain a number for a potentially complex object; a good hash function should produce effectively random numbers for the user's objects. These numbers are reduced modulo the size of the hash table to determine where the object should go. If the user supplies a whole bunch of objects that all happen to hash to the same value, they'll all try to go into the same bin, and the hash table degenerates to an O(n) object as it has to search through the whole list each time. If you are using floatingpoint objects, well, for example the exponent may well be all the same, or take only a few values. If, after reduction modulo the table size, it's what gets used to determine where your numbers go, they'll all go in the same bin. Or you could be using all integers, which usually end with lots of zeros in floatingpoint representation, so that all your numbers go in exactly the same bin. You could try using nonpoweroftwo table sizes, but that just sort of hides the problem: someone someday will provide a collection of numbers, let's say a, a+b, a+2b, ... that reduce to the same value modulo your table size, and suddenly your hash table is agonizingly slow. There's kind of an art to designing good generalpurpose hash functions; it's very handy that python provides one.
It seems as though a redblack tree might be a good implementation for a set of floats if all that one wants to do is add, delete, and test membership. (I will also need to do unions).
If you're implementing it from scratch, you could go with a redblack tree, but a hash table is probably faster. I'd go with a simple hash table with linkedlist buckets. Managing insertions and deletions should be only a minor pain compared to implementing a whole tree structure. You can probably find a nice hash function for floats with a bit of googling the literature. I should say, try just using python sets first, and only go into all this if they prove to be the slowest part of your program.
Thanks for the help!
Good luck, Anne
Phillip  View this message in context: http://old.nabble.com/efficientwaytomanageasetoffloatstp28518014p28... Sent from the Numpydiscussion mailing list archive at Nabble.com.
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
participants (6)

Anne Archibald

Benjamin Root

Dr. Phillip M. Feldman

josef.pktd＠gmail.com

Robert Kern

Warren Weckesser