reading binary data fast / help with optimizing ( again )

Thomas Weholt thomas at gatsoft.no
Mon May 6 07:23:15 EDT 2002


Below is a very simple example of my approach to storing "records" in a
binary fashion. It uses the struct-module and I've tried to optimize it as
much as I can, but it still seems slow when the amount of data is getting
big ( we're talking about 1M+ records, several million records in
"production" state.  ).

How can I optimize this further? Are there any other approach to this ?
Using a traditional database is not an option. It got to be pure python.

I'll be very gratefull for any hint or piece of code the group might
provide.

Best regards,
Thomas Weholt


## CODE BEGINS ##

import struct, time, profile, sys

unpack = struct.unpack
calcsize = struct.calcsize

fmt = '4Iff3I'
record_size = calcsize(fmt)
desired_buffer_size = 512*1024 # want to read approx. 512k chunks pr.
IO-call

# calculate a buffer-size based on record-size and desired buffer-size
i = 0
while 1:
    buffer_size = int((desired_buffer_size + i)/ struct.calcsize(fmt))
    if buffer_size % struct.calcsize(fmt) == 0:
        break
    i = i + 1

f = open('data.dat', 'wb')
# write 1M records to disk, approx 35+ MB
for i in range(0, 1000000):
    # write a generated dummy-set of data to disk
    f.write(struct.pack(fmt, i, i+2, i+3, i+i, time.time(), time.time(),
i+5, i+6, i+7))
f.close()

def slicer(_data, _fmt, filter = None):
    # Splits a piece of binary-data into record-sized chunks and decodes
them using unpack
    # adds the result to a list and returns the list
    result = []
    start_pos = 0

    # simple check to see if data is valid, any more efficient and more
accurate way ??
    #assert len(_data) % record_size == 0, 'slicer() : _data % record_size
( struct.format ) must be 0'

    for stop_pos in range(record_size, len(_data) + record_size,
record_size):
        result.append(unpack(fmt, _data[start_pos:stop_pos]))
        start_pos = stop_pos

    return result

def ioreader():
    f = open('data.dat','rb')
    while 1:
        d = f.read(buffer_size)
        if not d:
            break
        n = slicer(d, fmt)

    f.close()

    return

profile.run('ioreader()')





More information about the Python-list mailing list