Re: [Python-ideas] Ideas for improving the struct module

Jan. 19, 2017

      On 19/01/17 06:47, Elizabeth Myers wrote:
...
On 19/01/17 05:58, Rhodri James wrote:
...
On 19/01/17 08:31, Mark Dickinson wrote:
...
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info>
wrote:
...
[...] struct already supports
variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports
are in some sense variable length, but are stored in a fixed-width
field. The internals of the struct module rely on each field starting
at a fixed offset, computable directly from the format string. I don't
think variable-length fields would be a good fit for the current
design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the
struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor
fit for protocol decoding, even as a base layer for something.  It's one
of the things I use python for quite frequently, and I always end up
rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless:
it actually reduces clarity a little. Consider:
...
...
...
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
...
...
...
length = int.from_bytes(test_bytes[:4], 'big')
string = test_bytes[4:length]
or you can do:
...
...
...
length = struct.unpack_from('!I', test_bytes)[0]
string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to
worse-looking code for minimal gains in efficiency. To quote Jamie
Zawinksi, it is like building a bookshelf out of mashed potatoes as it
stands.
If we had an extension similar to netstruct:
...
...
...
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you
have multiple strings), the code becomes a bit more hairy:
...
...
...
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test'
offset = 0
while offset < len(test_bytes):
...     length = struct.unpack_from('!H', test_bytes, offset)[0]
...     offset += 2
...     string = struct.unpack_from('{}s'.format(length), test_bytes,
offset)[0]
...     offset += length
It actually gets a lot worse when you have to unpack a set of strings in
a context-sensitive manner. You have to be sure to update the offset
constantly so you can always unpack strings appropriately. Yuck!

It's worth mentioning that a few years ago, a coworker and I found
ourselves needing variable length strings in the context of a binary
protocol (DHCP), and wound up abandoning the struct module entirely
because it was unsuitable. My co-worker said the same thing I did: "it's
like building a bookshelf out of mashed potatoes."

I do understand it might require a possible major rewrite or major
changes the struct module, but in the long run, I think it's worth it
(especially because the struct module is not all that big in scope). As
it stands, the struct module simply is not suited for protocols where
you have variable-length strings, and in my experience, that is the vast
majority of modern binary protocols on the Internet.

--
Elizabeth