Optimizing for binary storage/bit-fiddling

Sat May 4 09:09:45 EDT 2002

Thomas Weholt wrote:

> I'm working on a very simple object storage system ( there are alot of
> alternatives I know :-) ) where I got a "record" referencing to a pickled
> object in a large file. This record is constructed using the
> struct-module. The format consists of 4 integers, 1 float and at the point
> 4 boolean values. Currently I'm using a struct-format like so 4If4F. But
> with bit-fiddling I should get all the boolean values into one integers
> shouldn't I? Of the 4 integers 2 of them will never exceed 256 so I should
> be able to use something else than integers for this also ???
> 
> A simple search on Google gave me a few hits on the subject, but I read
> Python wasn't optimized for bit-fiddling. Would it be much faster to just
> ignore the space issue and use integers for boolean values? Any thoughts?

The struct module, which you are already using, is quite good at BYTE
fiddling, so, if space matters here, you could and should use it in a more
tuned way.  Right now you're using 36 bytes/record (assumign '4If4I' is
your format string -- you say '4If4F', but I don't know anything about that
'F' format-character in module struct?).  If the 3rd and 4th items are
always >=0 and <256, and so are the 6th to last items (which you say are
booleans - 0 or 1), using format '2I2Bf4B' would take just 20 bytes per
record -- substantial saving at no or little cost in terms of complicating
your program, and therefore apparently quite a worthwhile change.

Squeezing the 4 booleans into a single byte would let you save another 3
bytes per record, down to 17 (from an original 36), but I don't know if
that is worth it.  Not hard nor slow, mind you -- you just construct a
tiny auxiliary-dictionary ONCE:

audi = {}
for i in range(16):
    bits = tuple([ (i&bit)==bit for bit in 1,2,4,8 ])
    audi[i]=bits
    audi[bits]=i

and then you just need to go through audi any time you must translate a
0-15 byte into 4 separate bits, or viceversa (you need to make the bits
into a tuple to key into the dictionary, of course).  But this _is_ a
non-null amount of complication, and the further saving is small, so
it's not easy to decide whether it's worth it.  Maybe try with and
without and see what happens to your runtimes (probably go down a tiny
bit with the aux dict -- preparation time is once, at startup, then
heavily amortized -- dict lookup is *FAST*, and you do save a small but
non negligible amount of I/O) and overall program complexity (if it's
big and complex already, then these 5 lines and the extra 2 or 3 in
the functions that read or write records aren't gonna kill you).

> In conclusion: what is the most efficient way to store very large integers
> ( 0, -> ), very small numeric values ( 0-512 ) and boolean values in a
> binary file? Is the bit-fiddling so slow I should avoid it in
> circumstances like this?

Not with auxiliary dictionaries (which can easily work up to 8 bits of
course) -- it's not really slow then.  Just a little bit complicated.
"Do the simplest thing that can possibly work"...

As for other sizes of unsigned integers, do use the struct module's
builtin possibilities -- costs you little or nothing, after all, to
encode the actual expected range of your numbers rather than just a
lazy I everywhere:-).

Alex