On Jan 20, 2017 12:48 PM, "Elizabeth Myers" <elizabeth@interlinked.me> wrote:
On 20/01/17 10:59, Paul Moore wrote:
> On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
>> Should I write up a PEP about this? I am not sure if it's justified or
>> not. It's 3 changes (calcsize and two format specifiers), but it might
>> be useful to codify it.
>
> It feels a bit minor to need a PEP, but having said that did you pick
> up on the comment about needing to return the number of bytes
> consumed?
>
> str = struct.unpack('z', b'test\0xxx')
>
> How do we know where the unpack got to, so that we can continue
> parsing from there? It seems a bit wasteful to have to scan the string
> twice to use calcsize for this...
>
> A PEP (or at least, a PEP-style design document) might capture the
> answer to questions like this. OTOH, the tracker discussion could
> easily be enough - can you put a reference to the bug report here?
>
> Paul
>

Two things:

1) struct.unpack and struct.unpack_from should remain
backwards-compatible. I don't want to return extra values from it like
(length unpacked, (data...)) for that reason. If the calcsize solution
feels a bit weird (it isn't much less efficient, because strings store
their length with them, so it's constant-time), there could also be new
functions that *do* return the length if you need it. To me though, this
feels like a use case for struct.iter_unpack.

iter_unpack is strictly less powerful - you can easily and efficiently implement iter_unpack using unpack_from_with_offset (probably not it's real name, but you get the idea). The reverse is not true.

And:

val, offset = somefunc(buffer, offset)

is *the* idiomatic signature for functions for unpacking complex binary formats. I've seen it reinvented independently at least 4 times in real projects. (It turns out that implementing sleb128 encoding in Python is sufficiently frustrating that you end up making lots of attempts to find someone anyone who has already done it. Or at least, I did :-).)

Here's an example of this idiom used to parse Mach-O binding tables, which iter_unpack definitely can't do:
  https://github.com/njsmith/machomachomangler/blob/master/machomachomangler/macho.py#L374-L429
Actually this example is a bit extreme since the format is *all* variable-width stuff, but it gives the idea. There are also lots of formats that have a mix of struct-style fixed width and variable width fields in a complicated pattern, e.g.:
  https://zs.readthedocs.io/en/latest/format.html#layout-details

Definitely would prefer to avoid a bikeshed here, though other
improvements to the struct module are certainly welcome!

It doesn't necessarily have to be part of the same change, but if struct is gaining the infrastructure to support variable-width layouts then adding uleb128/sleb128 format specifiers would make a lot of sense. Implementing them in pure Python is difficult (all the standard "how to en/decode u/sleb128" documentation assumes you're working with C-style modulo integers) and slow, and they turn up all over the place: both of those links above, in Google protobufs, as a primitive in the .Net equivalent of the struct module [1], etc.