[Python-Dev] an idea for improving struct.unpack api
Alex Martelli
aleax at aleax.it
Thu Jan 6 09:33:20 CET 2005
On 2005 Jan 06, at 06:27, Ilya Sandler wrote:
...
> We could have an optional offset argument for
>
> unpack(format, buffer, offset=None)
I do agree on one concept here: when a function wants a string argument
S, and the value for that string argument S is likely to come from some
other bigger string Z as a subset Z[O:O+L], being able to optionally
specify Z, O and L (or the endpoint, O+L), rather than having to do the
slicing, can be a simplification and a substantial speedup.
When I had this kind of problem in the past I approached it with the
buffer built-in. Say I've slurped in a whole not-too-huge binary file
into `data', and now need to unpack several pieces of it from different
offsets; rather than:
somestuff = struct.unpack(fmt, data[offs:offs+struct.calcsize(fmt)])
I can use:
somestuff = struct.unpack(fmt, buffer(data, offs,
struct.calcsize(fmt)))
as a kind of "virtual slicing". Besides the vague-to-me "impending
deprecation" state of the buffer builtin, there is some advantage, but
it's a bit modest. If I could pass data and offs directly to
struct.unpack and thus avoid churning of one-use readonly buffer
objects I'd probably be happier.
As for "passing offset implies the length is calcsize(fmt)"
sub-concept, I find that slightly more controversial. It's convenient,
but somewhat ambiguous; in other cases (e.g. string methods) passing a
start/offset and no end/length means to go to the end. Maybe something
more explicit, such as a length= parameter with a default of None
(meaning "go to the end") but which can be explicitly passed as -1 to
mean "use calcsize internally", might go down better.
As for the next part:
> the offset argument is an object which contains a single integer field
> which gets incremented inside unpack() to point to the next byte.
...I find this just too "magical". It's only useful when you're
specifically unpacking data bytes that are compactly back to back (no
"filler" e.g. for alignment purposes) and pays some conceptual price --
introducing a new specialized type to play the role of "mutable int"
and having an argument mutated, which is not usual in Python's library.
> so with a new API the above code could be written as
>
> offset=struct.Offset(0)
> hdr=unpack("iiii", offset)
> for i in range(hdr[0]):
> item=unpack( "IIII", rec, offset)
>
> When an offset argument is provided, unpack() should allow some bytes
> to
> be left unpacked at the end of the buffer..
>
> Does this suggestion make sense? Any better ideas?
All in all, I suspect that something like...:
# out of the record-by-record loop:
hdrsize = struct.calcsize(hdr_fmt)
itemsize = struct.calcsize(item_fmt)
reclen = length_of_each_record
# loop record by record
while True:
rec = binfile.read(reclen)
if not rec:
break
hdr = struct.unpack(hdr_fmt, rec, 0, hdrsize)
for offs in itertools.islice(xrange(hdrsize, reclen, itemsize),
hdr[0]):
item = struct.unpack(item_fmt, rec, offs, itemsize)
# process item
might be a better compromise. More verbose, because more explicit, of
course. And if you do this kind of thing often, easy to encapsulate in
a generator with 4 parameters -- the two formats (header and item), the
record length, and the binfile -- just yield the hdr first, then each
struct.unpack result from the inner loop.
Having the offset and length parameters to struct.unpack might still be
a performance gain worth pursuing (of course, we'd need some
performance measurements from real-life use cases) even though from the
point of view of code simplicity, in this example, there appears to be
little or no gain wrt slicing rec[offs:offs+itemsize] or using
buffer(rec, offs, itemsize).
Alex
More information about the Python-Dev
mailing list