On 10 May 2010 21:56, Dr. Phillip M. Feldman
Anne Archibald-2 wrote:
on a 32-bit machine, the space overhead is roughly a 32-bit object pointer or two for each float, plus about twice the number of floats times 32-bit pointers for the table.
Hello Anne,
I'm a bit confused by the above. It sounds as though the hash table approach might occupy 4 times as much storage as a single array. Is that right?
Probably. Hash tables usually operate at about half-full for efficiency, so you'd have twice as many entries as you do objects. If you're using a python hash table, you also have type tagging and malloc information for each object. If you had a custom implementation, you'd have to have some way to mark hash cells as empty. In any case, expect at least a doubling of the space needed for a simple array.
Also, I don't understand why hashing would be useful for the set application.
The reason hash tables rely on hashing is not just to obtain a number for a potentially complex object; a good hash function should produce effectively random numbers for the user's objects. These numbers are reduced modulo the size of the hash table to determine where the object should go. If the user supplies a whole bunch of objects that all happen to hash to the same value, they'll all try to go into the same bin, and the hash table degenerates to an O(n) object as it has to search through the whole list each time. If you are using floating-point objects, well, for example the exponent may well be all the same, or take only a few values. If, after reduction modulo the table size, it's what gets used to determine where your numbers go, they'll all go in the same bin. Or you could be using all integers, which usually end with lots of zeros in floating-point representation, so that all your numbers go in exactly the same bin. You could try using non-power-of-two table sizes, but that just sort of hides the problem: someone someday will provide a collection of numbers, let's say a, a+b, a+2b, ... that reduce to the same value modulo your table size, and suddenly your hash table is agonizingly slow. There's kind of an art to designing good general-purpose hash functions; it's very handy that python provides one.
It seems as though a red-black tree might be a good implementation for a set of floats if all that one wants to do is add, delete, and test membership. (I will also need to do unions).
If you're implementing it from scratch, you could go with a red-black tree, but a hash table is probably faster. I'd go with a simple hash table with linked-list buckets. Managing insertions and deletions should be only a minor pain compared to implementing a whole tree structure. You can probably find a nice hash function for floats with a bit of googling the literature. I should say, try just using python sets first, and only go into all this if they prove to be the slowest part of your program.
Thanks for the help!
Good luck, Anne
Phillip -- View this message in context: http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28... Sent from the Numpy-discussion mailing list archive at Nabble.com.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion