[Python-ideas] Ideas for improving the struct module

Nathaniel Smith njs at pobox.com
Fri Jan 20 18:24:16 EST 2017


On Jan 20, 2017 12:48 PM, "Elizabeth Myers" <elizabeth at interlinked.me>
wrote:

On 20/01/17 10:59, Paul Moore wrote:
> On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth at interlinked.me>
wrote:
>> Should I write up a PEP about this? I am not sure if it's justified or
>> not. It's 3 changes (calcsize and two format specifiers), but it might
>> be useful to codify it.
>
> It feels a bit minor to need a PEP, but having said that did you pick
> up on the comment about needing to return the number of bytes
> consumed?
>
> str = struct.unpack('z', b'test\0xxx')
>
> How do we know where the unpack got to, so that we can continue
> parsing from there? It seems a bit wasteful to have to scan the string
> twice to use calcsize for this...
>
> A PEP (or at least, a PEP-style design document) might capture the
> answer to questions like this. OTOH, the tracker discussion could
> easily be enough - can you put a reference to the bug report here?
>
> Paul
>

Two things:

1) struct.unpack and struct.unpack_from should remain
backwards-compatible. I don't want to return extra values from it like
(length unpacked, (data...)) for that reason. If the calcsize solution
feels a bit weird (it isn't much less efficient, because strings store
their length with them, so it's constant-time), there could also be new
functions that *do* return the length if you need it. To me though, this
feels like a use case for struct.iter_unpack.


iter_unpack is strictly less powerful - you can easily and efficiently
implement iter_unpack using unpack_from_with_offset (probably not it's real
name, but you get the idea). The reverse is not true.

And:

val, offset = somefunc(buffer, offset)

is *the* idiomatic signature for functions for unpacking complex binary
formats. I've seen it reinvented independently at least 4 times in real
projects. (It turns out that implementing sleb128 encoding in Python is
sufficiently frustrating that you end up making lots of attempts to find
someone anyone who has already done it. Or at least, I did :-).)

Here's an example of this idiom used to parse Mach-O binding tables, which
iter_unpack definitely can't do:
  https://github.com/njsmith/machomachomangler/blob/master/
machomachomangler/macho.py#L374-L429
Actually this example is a bit extreme since the format is *all*
variable-width stuff, but it gives the idea. There are also lots of formats
that have a mix of struct-style fixed width and variable width fields in a
complicated pattern, e.g.:
  https://zs.readthedocs.io/en/latest/format.html#layout-details

Definitely would prefer to avoid a bikeshed here, though other
improvements to the struct module are certainly welcome!


It doesn't necessarily have to be part of the same change, but if struct is
gaining the infrastructure to support variable-width layouts then adding
uleb128/sleb128 format specifiers would make a lot of sense. Implementing
them in pure Python is difficult (all the standard "how to en/decode
u/sleb128" documentation assumes you're working with C-style modulo
integers) and slow, and they turn up all over the place: both of those
links above, in Google protobufs, as a primitive in the .Net equivalent of
the struct module [1], etc.

-n

[1]
https://msdn.microsoft.com/en-us/library/system.io.binarywriter.write7bitencodedint.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170120/b72289a9/attachment-0001.html>


More information about the Python-ideas mailing list