[Python-Dev] an idea for improving struct.unpack api

Thu Jan 6 09:33:20 CET 2005

On 2005 Jan 06, at 06:27, Ilya Sandler wrote:
    ...
> We could have an optional offset argument for
>
> unpack(format, buffer, offset=None)

I do agree on one concept here: when a function wants a string argument 
S, and the value for that string argument S is likely to come from some 
other bigger string Z as a subset Z[O:O+L], being able to optionally 
specify Z, O and L (or the endpoint, O+L), rather than having to do the 
slicing, can be a simplification and a substantial speedup.

When I had this kind of problem in the past I approached it with the 
buffer built-in.  Say I've slurped in a whole not-too-huge binary file 
into `data', and now need to unpack several pieces of it from different 
offsets; rather than:
     somestuff = struct.unpack(fmt, data[offs:offs+struct.calcsize(fmt)])
I can use:
     somestuff = struct.unpack(fmt, buffer(data, offs, 
struct.calcsize(fmt)))
as a kind of "virtual slicing".  Besides the vague-to-me "impending 
deprecation" state of the buffer builtin, there is some advantage, but 
it's a bit modest.  If I could pass data and offs directly to 
struct.unpack and thus avoid churning of one-use readonly buffer 
objects I'd probably be happier.

As for "passing offset implies the length is calcsize(fmt)" 
sub-concept, I find that slightly more controversial.  It's convenient, 
but somewhat ambiguous; in other cases (e.g. string methods) passing a 
start/offset and no end/length means to go to the end.  Maybe something 
more explicit, such as a length= parameter with a default of None 
(meaning "go to the end") but which can be explicitly passed as -1 to 
mean "use calcsize internally", might go down better.

As for the next part:

> the offset argument is an object which contains a single integer field
> which gets incremented inside unpack() to point to the next byte.

...I find this just too "magical".  It's only useful when you're 
specifically unpacking data bytes that are compactly back to back (no 
"filler" e.g. for alignment purposes) and pays some conceptual price -- 
introducing a new specialized type to play the role of "mutable int" 
and having an argument mutated, which is not usual in Python's library.

> so with a new API the above code could be written as
>
>  offset=struct.Offset(0)
>  hdr=unpack("iiii", offset)
>  for i in range(hdr[0]):
>     item=unpack( "IIII", rec, offset)
>
> When an offset argument is provided, unpack() should allow some bytes 
> to
> be left unpacked at the end of the buffer..
>
> Does this suggestion make sense? Any better ideas?

All in all, I suspect that something like...:

# out of the record-by-record loop:
hdrsize = struct.calcsize(hdr_fmt)
itemsize = struct.calcsize(item_fmt)
reclen = length_of_each_record

# loop record by record
while True:
     rec = binfile.read(reclen)
     if not rec:
         break
     hdr = struct.unpack(hdr_fmt, rec, 0, hdrsize)
     for offs in itertools.islice(xrange(hdrsize, reclen, itemsize), 
hdr[0]):
         item = struct.unpack(item_fmt, rec, offs, itemsize)
         # process item

might be a better compromise.  More verbose, because more explicit, of 
course.  And if you do this kind of thing often, easy to encapsulate in 
a generator with 4 parameters -- the two formats (header and item), the 
record length, and the binfile -- just yield the hdr first, then each 
struct.unpack result from the inner loop.

Having the offset and length parameters to struct.unpack might still be 
a performance gain worth pursuing (of course, we'd need some 
performance measurements from real-life use cases) even though from the 
point of view of code simplicity, in this example, there appears to be 
little or no gain wrt slicing rec[offs:offs+itemsize] or using 
buffer(rec, offs, itemsize).

Alex