Re: [Numpy-discussion] efficient way to manage a set of floats?

11 May 2010

      On 10 May 2010 21:56, Dr. Phillip M. Feldman  wrote:
...
Anne Archibald-2 wrote:
...
on a 32-bit machine,
the space overhead is roughly a 32-bit object pointer or two for each
float, plus about twice the number of floats times 32-bit pointers for
the table.
Hello Anne,
I'm a bit confused by the above.  It sounds as though the hash table
approach might occupy 4 times as much storage as a single array.  Is that
right?
Probably. Hash tables usually operate at about half-full for
efficiency, so you'd have twice as many entries as you do objects. If
you're using a python hash table, you also have type tagging and
malloc information for each object. If you had a custom
implementation, you'd have to have some way to mark hash cells as
empty. In any case, expect at least a doubling of the space needed for
a simple array.
...
Also, I don't understand why hashing would be useful for the set
application.
The reason hash tables rely on hashing is not just to obtain a number
for a potentially complex object; a good hash function should produce
effectively random numbers for the user's objects. These numbers are
reduced modulo the size of the hash table to determine where the
object should go. If the user supplies a whole bunch of objects that
all happen to hash to the same value, they'll all try to go into the
same bin, and the hash table degenerates to an O(n) object as it has
to search through the whole list each time.

If you are using floating-point objects, well, for example the
exponent may well be all the same, or take only a few values. If,
after reduction modulo the table size, it's what gets used to
determine where your numbers go, they'll all go in the same bin. Or
you could be using all integers, which usually end with lots of zeros
in floating-point representation, so that all your numbers go in
exactly the same bin. You could try using non-power-of-two table
sizes, but that just sort of hides the problem: someone someday will
provide a collection of numbers, let's say a, a+b, a+2b, ... that
reduce to the same value modulo your table size, and suddenly your
hash table is agonizingly slow. There's kind of an art to designing
good general-purpose hash functions; it's very handy that python
provides one.
...
It seems as though a red-black tree might be a good implementation for a set
of floats if all that one wants to do is add, delete, and test membership.
(I will also need to do unions).
If you're implementing it from scratch, you could go with a red-black
tree, but a hash table is probably faster. I'd go with a simple hash
table with linked-list buckets. Managing insertions and deletions
should be only a minor pain compared to implementing a whole tree
structure. You can probably find a nice hash function for floats with
a bit of googling the literature.

I should say, try just using python sets first, and only go into all
this if they prove to be the slowest part of your program.
...
Thanks for the help!
Good luck,
Anne
...
Phillip
--
View this message in context: http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28...
Sent from the Numpy-discussion mailing list archive at Nabble.com.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion