Ideas for improving the struct module
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
Hello, I've noticed a lot of binary protocols require variable length bytestrings (with or without a null terminator), but it is not easy to unpack these in Python without first reading the desired length, or reading bytes until a null terminator is reached. I've noticed the netstruct library (https://github.com/stendec/netstruct) has a format specifier, $, which assumes the previous type to pack/unpack is the string's length. This is an interesting idea in of itself, but doesn't handle the null-terminated string chase. I know $ is similar to pascal strings, but sometimes you need more than 255 characters :p. For null-terminated strings, it may be simpler to have a specifier for those. I propose 0, but this point can be bikeshedded over endlessly if desired ;) (I thought about using n/N but they're :P). It's worth noting that (maybe one of?) Perl's equivalent to the struct module, whose name escapes me atm, has a module which can handle this case. I can't remember if it handled variable length or zero-terminated though; maybe it did both. Perl is more or less my 10th language. :p This pain point is an annoyance imo and would greatly simplify a lot of code if implemented, or something like it. I'd be happy to take a look at implementing it if the idea is received sufficiently warmly. -- Elizabeth
![](https://secure.gravatar.com/avatar/54d54fbc570660b6b04fb0c6234af007.jpg?s=120&d=mm&r=g)
+1 on the idea of supporting variable-length strings with the length encoded in the preceding packed element! Several months ago I was trying to write a parser and writer of PostgreSQL's COPY ... WITH BINARY format. I started out trying to implement it in pure python using the struct module. Due to the existence of variable-length strings encoded in precisely the way you mention, it was not possible to parse an entire row of data without invoking any pure-python-level logic. This made the implementation infeasibly slow. I had to switch to using cython to get it done fast enough (implementation is here: https://github.com/spitz-dan-l/postgres-binary-parser). I believe that with this single change ($, or whatever format specifier one wishes to use), assuming it were implemented efficiently in c, I could have avoided using cython and gotten a satisfactory level of performance with the struct module and python/numpy's already-performant bytestring manipulation faculties. -Dan Spitz On Wed, Jan 18, 2017 at 5:32 AM Elizabeth Myers <elizabeth@interlinked.me> wrote:
Hello,
I've noticed a lot of binary protocols require variable length bytestrings (with or without a null terminator), but it is not easy to unpack these in Python without first reading the desired length, or reading bytes until a null terminator is reached.
I've noticed the netstruct library (https://github.com/stendec/netstruct) has a format specifier, $, which assumes the previous type to pack/unpack is the string's length. This is an interesting idea in of itself, but doesn't handle the null-terminated string chase. I know $ is similar to pascal strings, but sometimes you need more than 255 characters :p.
For null-terminated strings, it may be simpler to have a specifier for those. I propose 0, but this point can be bikeshedded over endlessly if desired ;) (I thought about using n/N but they're :P).
It's worth noting that (maybe one of?) Perl's equivalent to the struct module, whose name escapes me atm, has a module which can handle this case. I can't remember if it handled variable length or zero-terminated though; maybe it did both. Perl is more or less my 10th language. :p
This pain point is an annoyance imo and would greatly simplify a lot of code if implemented, or something like it. I'd be happy to take a look at implementing it if the idea is received sufficiently warmly.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/5615a372d9866f203a22b2c437527bbb.jpg?s=120&d=mm&r=g)
On Wed, Jan 18, 2017 at 04:24:39AM -0600, Elizabeth Myers wrote:
Hello,
I've noticed a lot of binary protocols require variable length bytestrings (with or without a null terminator), but it is not easy to unpack these in Python without first reading the desired length, or reading bytes until a null terminator is reached.
This sounds like a fairly straight-forward feature request for the struct module, which probably could go straight to the bug tracker. Unfortunately I can't *quite* work out what the feature request is :-) If you're asking for struct to support Pascal strings, with a single byte (0...255) for the length, it already does with format code "p". I was going to suggest P for "large" Pascal string, with the length given by *two* bytes rather than one (0...65535), but P is already in use. Are you proposing the "$" format code from netstruct? That would be interesting, as it would allow format codes: B$ standard Pascal string, like p I$ Pascal string with a two-byte length L$ Pascal string with a four-byte length 4294967295 bytes should be enough for anyone :-) Another common format is "ASCIIZ", or a one-byte Pascal string including a null terminator. People actually use this: http://stackoverflow.com/questions/11850950/unpacking-a-struct-ending-with-a... Which just leaves C-style null terminated strings. c/n/N are all already in use; I guess that C (for C-string) or S (for c-String) are possibilities. All of these seem like perfectly reasonable formats for the struct module to support. They're all in use. struct already supports variable-width formats. I think its just a matter of raising one or more feature requests, and then doing the work. I guess this is just my long-winded way of saying +1. -- Steve
![](https://secure.gravatar.com/avatar/c49652c88a43a35bbf0095abfdae3515.jpg?s=120&d=mm&r=g)
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module. For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself. -- Mark
![](https://secure.gravatar.com/avatar/3d07afbc6277770ca981b1982d3badb8.jpg?s=120&d=mm&r=g)
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely. -- Rhodri James *-* Kynesim Ltd
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
length = int.from_bytes(test_bytes[:4], 'big') string = test_bytes[4:length]
or you can do:
length = struct.unpack_from('!I', test_bytes)[0] string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;) Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands. If we had an extension similar to netstruct:
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 19/01/17 06:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
length = int.from_bytes(test_bytes[:4], 'big') string = test_bytes[4:length]
or you can do:
length = struct.unpack_from('!I', test_bytes)[0] string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck! It's worth mentioning that a few years ago, a coworker and I found ourselves needing variable length strings in the context of a binary protocol (DHCP), and wound up abandoning the struct module entirely because it was unsuitable. My co-worker said the same thing I did: "it's like building a bookshelf out of mashed potatoes." I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet. -- Elizabeth
![](https://secure.gravatar.com/avatar/d67ab5d94c2fed8ab6b727b62dc1b213.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 5:08 AM, Elizabeth Myers <elizabeth@interlinked.me> wrote:
I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet.
To be fair, the name "struct" implies a C-style structure, which _does_ have a fixed size, or at least fixed offsets for its members (the last member can be variable-sized). A quick search of PyPI shows up a struct-variant specifically designed for network protocols: https://pypi.python.org/pypi/netstruct/1.1.2 It even uses the dollar sign as you describe. So perhaps what you're looking for is this module coming into the stdlib? ChrisA
![](https://secure.gravatar.com/avatar/5615a372d9866f203a22b2c437527bbb.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 05:16:28AM +1100, Chris Angelico wrote:
To be fair, the name "struct" implies a C-style structure, which _does_ have a fixed size, or at least fixed offsets for its members
Ah, the old "everyone thinks in C terms" fallacy raises its ugly head agan :-) The name doesn't imply any such thing to me, or those who haven't been raised on C. It implies the word "structure", which has no implication of being fixed-width. The docs for the struct module describes it as: struct — Interpret bytes as packed binary data which applies equally to the fixed- and variable-width case. The fact that we can sensibly talk about "fixed-width" and "variable-width" structs without confusion, shows that the concept is bigger than the C data-type. (Even if the most common use will probably remain C-style fixed-width structs.) Python is not C, and we shouldn't be limited by what C does. If we wanted C, we would use C. -- Steve
![](https://secure.gravatar.com/avatar/d67ab5d94c2fed8ab6b727b62dc1b213.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 11:38 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jan 20, 2017 at 05:16:28AM +1100, Chris Angelico wrote:
To be fair, the name "struct" implies a C-style structure, which _does_ have a fixed size, or at least fixed offsets for its members
Ah, the old "everyone thinks in C terms" fallacy raises its ugly head agan :-)
The name doesn't imply any such thing to me, or those who haven't been raised on C. It implies the word "structure", which has no implication of being fixed-width.
Fair point. Objection retracted - and it was only minor anyway. This would be a handy feature to add. +1. ChrisA
![](https://secure.gravatar.com/avatar/ae579d9b841a67b490920674e2308b6d.jpg?s=120&d=mm&r=g)
Nevertheless the C meaning *is* the etymology of the module name. :-) --Guido (mobile) On Jan 19, 2017 16:54, "Chris Angelico" <rosuav@gmail.com> wrote:
On Fri, Jan 20, 2017 at 11:38 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jan 20, 2017 at 05:16:28AM +1100, Chris Angelico wrote:
To be fair, the name "struct" implies a C-style structure, which _does_ have a fixed size, or at least fixed offsets for its members
Ah, the old "everyone thinks in C terms" fallacy raises its ugly head agan :-)
The name doesn't imply any such thing to me, or those who haven't been raised on C. It implies the word "structure", which has no implication of being fixed-width.
Fair point. Objection retracted - and it was only minor anyway. This would be a handy feature to add. +1.
ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/2240a37aad5f5834a92809a5e5f01fe1.jpg?s=120&d=mm&r=g)
I am for upgrading struct to these, if possible. But besides my +1, I am writting in to remember folks thatthere is another "struct" model in the stdlib: ctypes.Structure - For reading a lot of records with the same structure it is much more handy than struct, since it gives one a suitable Python object on instantiation. However, it also can't handle variable lenght fields automatically. But maybe, the improvement could be made on that side, or another package altogether taht works more like it than current "struct". On 19 January 2017 at 16:08, Elizabeth Myers <elizabeth@interlinked.me> wrote:
On 19/01/17 06:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
length = int.from_bytes(test_bytes[:4], 'big') string = test_bytes[4:length]
or you can do:
length = struct.unpack_from('!I', test_bytes)[0] string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
It's worth mentioning that a few years ago, a coworker and I found ourselves needing variable length strings in the context of a binary protocol (DHCP), and wound up abandoning the struct module entirely because it was unsuitable. My co-worker said the same thing I did: "it's like building a bookshelf out of mashed potatoes."
I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/1a71658d81f8a82a8122050f21bb86d3.jpg?s=120&d=mm&r=g)
ctypes.Structure is *literally* the interface to the C struct that as Chris mentions has fixed offsets for all members. I don't think that should (can?) be altered. In file formats (beyond net protocols) the string size + variable length string motif comes up often and I am frequently re-implementing the two-line read-an-int + read-{}.format-bytes. On Thu, Jan 19, 2017 at 12:17 PM, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
I am for upgrading struct to these, if possible.
But besides my +1, I am writting in to remember folks thatthere is another "struct" model in the stdlib:
ctypes.Structure -
For reading a lot of records with the same structure it is much more handy than struct, since it gives one a suitable Python object on instantiation.
However, it also can't handle variable lenght fields automatically.
But maybe, the improvement could be made on that side, or another package altogether taht works more like it than current "struct".
On 19 January 2017 at 16:08, Elizabeth Myers <elizabeth@interlinked.me> wrote:
On 19/01/17 06:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info
wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
length = int.from_bytes(test_bytes[:4], 'big') string = test_bytes[4:length]
or you can do:
length = struct.unpack_from('!I', test_bytes)[0] string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
It's worth mentioning that a few years ago, a coworker and I found ourselves needing variable length strings in the context of a binary protocol (DHCP), and wound up abandoning the struct module entirely because it was unsuitable. My co-worker said the same thing I did: "it's like building a bookshelf out of mashed potatoes."
I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
I haven't had a chance to use it myself yet, but I've heard good things about https://construct.readthedocs.io/en/latest/ It's certainly far more comprehensive than struct for this and other problems. As usual, there's some tension between adding stuff to the stdlib versus using more specialized third-party packages. The existence of packages like construct doesn't automatically mean that we should stop improving the stdlib, but OTOH not every useful thing can or should be in the stdlib. Personally, I find myself parsing uleb128-prefixed strings more often than u4-prefixed strings. On Jan 19, 2017 10:42 AM, "Nick Timkovich" <prometheus235@gmail.com> wrote:
ctypes.Structure is *literally* the interface to the C struct that as Chris mentions has fixed offsets for all members. I don't think that should (can?) be altered.
In file formats (beyond net protocols) the string size + variable length string motif comes up often and I am frequently re-implementing the two-line read-an-int + read-{}.format-bytes.
On Thu, Jan 19, 2017 at 12:17 PM, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
I am for upgrading struct to these, if possible.
But besides my +1, I am writting in to remember folks thatthere is another "struct" model in the stdlib:
ctypes.Structure -
For reading a lot of records with the same structure it is much more handy than struct, since it gives one a suitable Python object on instantiation.
However, it also can't handle variable lenght fields automatically.
But maybe, the improvement could be made on that side, or another package altogether taht works more like it than current "struct".
On 19 January 2017 at 16:08, Elizabeth Myers <elizabeth@interlinked.me> wrote:
On 19/01/17 06:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano < steve@pearwood.info> wrote: > [...] struct already supports > variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
> test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
> length = int.from_bytes(test_bytes[:4], 'big') > string = test_bytes[4:length]
or you can do:
> length = struct.unpack_from('!I', test_bytes)[0] > string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
> length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
It's worth mentioning that a few years ago, a coworker and I found ourselves needing variable length strings in the context of a binary protocol (DHCP), and wound up abandoning the struct module entirely because it was unsuitable. My co-worker said the same thing I did: "it's like building a bookshelf out of mashed potatoes."
I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/1a71658d81f8a82a8122050f21bb86d3.jpg?s=120&d=mm&r=g)
Construct has radical API changes and should remain apart. It feels to me like a straw-man to introduce a large library to the discussion as justification for it being too-specialized. This proposal to me seems much more modest: add another format character (or two) to the existing set of a dozen or so that will be packed/unpacked just like the others. It also has demonstrable use in various formats/protocols. On Thu, Jan 19, 2017 at 12:50 PM, Nathaniel Smith <njs@pobox.com> wrote:
I haven't had a chance to use it myself yet, but I've heard good things about
https://construct.readthedocs.io/en/latest/
It's certainly far more comprehensive than struct for this and other problems.
As usual, there's some tension between adding stuff to the stdlib versus using more specialized third-party packages. The existence of packages like construct doesn't automatically mean that we should stop improving the stdlib, but OTOH not every useful thing can or should be in the stdlib.
Personally, I find myself parsing uleb128-prefixed strings more often than u4-prefixed strings.
On Jan 19, 2017 10:42 AM, "Nick Timkovich" <prometheus235@gmail.com> wrote:
ctypes.Structure is *literally* the interface to the C struct that as Chris mentions has fixed offsets for all members. I don't think that should (can?) be altered.
In file formats (beyond net protocols) the string size + variable length string motif comes up often and I am frequently re-implementing the two-line read-an-int + read-{}.format-bytes.
On Thu, Jan 19, 2017 at 12:17 PM, Joao S. O. Bueno <jsbueno@python.org.br
wrote:
I am for upgrading struct to these, if possible.
But besides my +1, I am writting in to remember folks thatthere is another "struct" model in the stdlib:
ctypes.Structure -
For reading a lot of records with the same structure it is much more handy than struct, since it gives one a suitable Python object on instantiation.
However, it also can't handle variable lenght fields automatically.
But maybe, the improvement could be made on that side, or another package altogether taht works more like it than current "struct".
On 19/01/17 06:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote: > On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano < steve@pearwood.info> > wrote: >> [...] struct already supports >> variable-width formats. > > Unfortunately, that's not really true: the Pascal strings it supports > are in some sense variable length, but are stored in a fixed-width > field. The internals of the struct module rely on each field starting > at a fixed offset, computable directly from the format string. I don't > think variable-length fields would be a good fit for the current > design of the struct module. > > For the OPs use-case, I'd suggest a library that sits on top of the > struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a
On 19 January 2017 at 16:08, Elizabeth Myers <elizabeth@interlinked.me> wrote: poor
fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
>> test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
>> length = int.from_bytes(test_bytes[:4], 'big') >> string = test_bytes[4:length]
or you can do:
>> length = struct.unpack_from('!I', test_bytes)[0] >> string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
>> length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
> test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' > offset = 0 > while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
It's worth mentioning that a few years ago, a coworker and I found ourselves needing variable length strings in the context of a binary protocol (DHCP), and wound up abandoning the struct module entirely because it was unsuitable. My co-worker said the same thing I did: "it's like building a bookshelf out of mashed potatoes."
I do understand it might require a possible major rewrite or major changes the struct module, but in the long run, I think it's worth it (especially because the struct module is not all that big in scope). As it stands, the struct module simply is not suited for protocols where you have variable-length strings, and in my experience, that is the vast majority of modern binary protocols on the Internet.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/8848a81d538f2fc428934988af5c8b42.jpg?s=120&d=mm&r=g)
On 19Jan2017 12:08, Elizabeth Myers <elizabeth@interlinked.me> wrote:
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
Whenever I'm doing iterative stuff like this, either variable length binary or lexical stuff, I always end up with a bunch of functions which can be called like this: datalen, offset = get_bs(chunk, offset=offset) The notable thing here is just that they return the data and the new offset, which makes updating the offset impossible to forget, and also makes the calling code more succinct, like the internal call to get_bs() below: such as this decoder for a length encoded field: def get_bsdata(chunk, offset=0): ''' Fetch a length-prefixed data chunk. Decodes an unsigned value from a bytes at the specified `offset` (default 0), and collects that many following bytes. Return those following bytes and the new offset. ''' ##is_bytes(chunk) offset0 = offset datalen, offset = get_bs(chunk, offset=offset) data = chunk[offset:offset+datalen] ##is_bytes(data) if len(data) != datalen: raise ValueError("bsdata(chunk, offset=%d): insufficient data: expected %d bytes, got %d bytes" % (offset0, datalen, len(data))) offset += datalen return data, offset Cheers, Cameron Simpson <cs@zip.com.au>
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 19/01/17 20:40, Cameron Simpson wrote:
On 19Jan2017 12:08, Elizabeth Myers <elizabeth@interlinked.me> wrote:
I also didn't mention that when you are unpacking iteratively (e.g., you have multiple strings), the code becomes a bit more hairy:
test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test' offset = 0 while offset < len(test_bytes): ... length = struct.unpack_from('!H', test_bytes, offset)[0] ... offset += 2 ... string = struct.unpack_from('{}s'.format(length), test_bytes, offset)[0] ... offset += length
It actually gets a lot worse when you have to unpack a set of strings in a context-sensitive manner. You have to be sure to update the offset constantly so you can always unpack strings appropriately. Yuck!
Whenever I'm doing iterative stuff like this, either variable length binary or lexical stuff, I always end up with a bunch of functions which can be called like this:
datalen, offset = get_bs(chunk, offset=offset)
The notable thing here is just that they return the data and the new offset, which makes updating the offset impossible to forget, and also makes the calling code more succinct, like the internal call to get_bs() below:
such as this decoder for a length encoded field:
def get_bsdata(chunk, offset=0): ''' Fetch a length-prefixed data chunk. Decodes an unsigned value from a bytes at the specified `offset` (default 0), and collects that many following bytes. Return those following bytes and the new offset. ''' ##is_bytes(chunk) offset0 = offset datalen, offset = get_bs(chunk, offset=offset) data = chunk[offset:offset+datalen] ##is_bytes(data) if len(data) != datalen: raise ValueError("bsdata(chunk, offset=%d): insufficient data: expected %d bytes, got %d bytes" % (offset0, datalen, len(data))) offset += datalen return data, offset
Gotta be honest, this seems less elegant than just adding something like what netstruct does to the struct module. It's also way more verbose. Perhaps some kind of higher level module could be built on struct at some point, maybe in stdlib, maybe not (construct imo is not that lib for previous raised objections).
Cheers, Cameron Simpson <cs@zip.com.au> _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/5ce43469c0402a7db8d0cf86fa49da5a.jpg?s=120&d=mm&r=g)
On 2017-01-19 12:47, Elizabeth Myers wrote:
On 19/01/17 05:58, Rhodri James wrote:
On 19/01/17 08:31, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
For the OPs use-case, I'd suggest a library that sits on top of the struct module, rather than an expansion to the struct module itself.
Unfortunately as the OP explained, this makes the struct module a poor fit for protocol decoding, even as a base layer for something. It's one of the things I use python for quite frequently, and I always end up rolling my own and discarding struct entirely.
Yes, for variable-length fields the struct module is worse than useless: it actually reduces clarity a little. Consider:
test_bytes = b'\x00\x00\x00\x0chello world!'
With this, you can do:
length = int.from_bytes(test_bytes[:4], 'big') string = test_bytes[4:length]
Shouldn't that be: string = test_bytes[4:4+length]
or you can do:
length = struct.unpack_from('!I', test_bytes)[0] string = struct.unpack_from('{}s'.format(length), test_bytes, 4)[0]
Which looks more readable without consulting the docs? ;)
Which is more likely to be correct? :-)
Building anything on top of the struct library like this would lead to worse-looking code for minimal gains in efficiency. To quote Jamie Zawinksi, it is like building a bookshelf out of mashed potatoes as it stands.
If we had an extension similar to netstruct:
length, string = struct.unpack('!I$', test_bytes)
MUCH improved readability, and also less verbose. :)
![](https://secure.gravatar.com/avatar/5615a372d9866f203a22b2c437527bbb.jpg?s=120&d=mm&r=g)
On Thu, Jan 19, 2017 at 08:31:03AM +0000, Mark Dickinson wrote:
On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <steve@pearwood.info> wrote:
[...] struct already supports variable-width formats.
Unfortunately, that's not really true: the Pascal strings it supports are in some sense variable length, but are stored in a fixed-width field. The internals of the struct module rely on each field starting at a fixed offset, computable directly from the format string. I don't think variable-length fields would be a good fit for the current design of the struct module.
I know nothing and care even less (is caring a negative amount possible?) about the internal implementation of the struct module. Since Elizabeth is volunteering to do the work to make it work, will it be accepted? Subject to the usual code quality reviews, contributor agreement, etc. Are there objections to the *idea* of adding support for null terminated strings to the struct module? Does it require a PEP just to add one more format code? (Maybe it will, if the format code requires a complete re-write of the entire module.) It seems to me that if Elizabeth is willing to do the work, and somebody to review it, this would be a welcome addition to the module. It would require at least one API change: struct.calcsize won't work for formats containing null-terminated strings. But that's a minor matter. -- Steve
![](https://secure.gravatar.com/avatar/c49652c88a43a35bbf0095abfdae3515.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 12:30 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Does it require a PEP just to add one more format code? (Maybe it will, if the format code requires a complete re-write of the entire module.)
Yes, I think a PEP would be useful in this case. The proposed change *would* entail some fairly substantial changes to the design of the module (I encourage you to take a look at the source to appreciate what's involved), and if we're going to that level of effort it's probably worth stepping back and seeing whether those changes are compatible with other proposed directions for the struct module, and whether it makes sense to do more than add that one format code. That level of change probably isn't worth it "just to add one more format code", but might be worth it if it allows other possible expansions of the struct module functionality. There are also performance considerations to look at, behaviour of alignment to consider, and other details. -- Mark
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
This is a neat idea, but this will only work for parsing framed binary protocols. For example, if you protocol prefixes all packets with a length field, you can write an efficient read buffer and use your proposal to decode all of message's fields in one shot. Which is good. Not all protocols use framing though. For instance, your proposal won't help to write Thrift or Postgres protocols parsers. Overall, I'm not sure that this is worth the hassle. With proposal: data, = struct.unpack('!H$', buf) buf = buf[2+len(data):] with the current struct module: len, = struct.unpack('!H', buf) data = buf[2:2+len] buf = buf[2+len:] Another thing: struct.calcsize won't work with structs that use variable length fields. Yury On 2017-01-18 5:24 AM, Elizabeth Myers wrote:
Hello,
I've noticed a lot of binary protocols require variable length bytestrings (with or without a null terminator), but it is not easy to unpack these in Python without first reading the desired length, or reading bytes until a null terminator is reached.
I've noticed the netstruct library (https://github.com/stendec/netstruct) has a format specifier, $, which assumes the previous type to pack/unpack is the string's length. This is an interesting idea in of itself, but doesn't handle the null-terminated string chase. I know $ is similar to pascal strings, but sometimes you need more than 255 characters :p.
For null-terminated strings, it may be simpler to have a specifier for those. I propose 0, but this point can be bikeshedded over endlessly if desired ;) (I thought about using n/N but they're :P).
It's worth noting that (maybe one of?) Perl's equivalent to the struct module, whose name escapes me atm, has a module which can handle this case. I can't remember if it handled variable length or zero-terminated though; maybe it did both. Perl is more or less my 10th language. :p
This pain point is an annoyance imo and would greatly simplify a lot of code if implemented, or something like it. I'd be happy to take a look at implementing it if the idea is received sufficiently warmly.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/8848a81d538f2fc428934988af5c8b42.jpg?s=120&d=mm&r=g)
On 19Jan2017 16:04, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
This is a neat idea, but this will only work for parsing framed binary protocols. For example, if you protocol prefixes all packets with a length field, you can write an efficient read buffer and use your proposal to decode all of message's fields in one shot. Which is good.
Not all protocols use framing though. For instance, your proposal won't help to write Thrift or Postgres protocols parsers.
Sure, but a lot of things fit the proposal. Seems a win: both simple and useful.
Overall, I'm not sure that this is worth the hassle. With proposal:
data, = struct.unpack('!H$', buf) buf = buf[2+len(data):]
with the current struct module:
len, = struct.unpack('!H', buf) data = buf[2:2+len] buf = buf[2+len:]
Another thing: struct.calcsize won't work with structs that use variable length fields.
True, but it would be enough for it to raise an exception of some kind. It won't break any in play code, and it will prevent accidents for users of new variable sizes formats. We've all got things we wish struct might cover (I have a few, but strangely the top of the list is nonsemantic: I wish it let me put meaningless whitespace inside the format for readability). +1 on the proposal from me. Oh: subject to one proviso: reading a struct will need to return how many bytes of input data were scanned, not merely returning the decoded values. Cheers, Cameron Simpson <cs@zip.com.au>
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 19/01/17 20:54, Cameron Simpson wrote:
On 19Jan2017 16:04, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
This is a neat idea, but this will only work for parsing framed binary protocols. For example, if you protocol prefixes all packets with a length field, you can write an efficient read buffer and use your proposal to decode all of message's fields in one shot. Which is good.
Not all protocols use framing though. For instance, your proposal won't help to write Thrift or Postgres protocols parsers.
Sure, but a lot of things fit the proposal. Seems a win: both simple and useful.
Overall, I'm not sure that this is worth the hassle. With proposal:
data, = struct.unpack('!H$', buf) buf = buf[2+len(data):]
with the current struct module:
len, = struct.unpack('!H', buf) data = buf[2:2+len] buf = buf[2+len:]
Another thing: struct.calcsize won't work with structs that use variable length fields.
True, but it would be enough for it to raise an exception of some kind. It won't break any in play code, and it will prevent accidents for users of new variable sizes formats.
We've all got things we wish struct might cover (I have a few, but strangely the top of the list is nonsemantic: I wish it let me put meaningless whitespace inside the format for readability).
+1 on the proposal from me.
Oh: subject to one proviso: reading a struct will need to return how many bytes of input data were scanned, not merely returning the decoded values.
This is a little difficult without breaking backwards compatibility, but, it is not difficult to compute the lengths yourself. That said, calcsize could require an extra parameter if given a format string with variable-length specifiers in it, e.g.: struct.calcsize("z", (b'test')) Would return 5 (zero-length terminator), so you don't have to compute it yourself. Also, I filed a bug, and proposed use of Z and z.
Cheers, Cameron Simpson <cs@zip.com.au> _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 20/01/17 10:47, Elizabeth Myers wrote:
On 19/01/17 20:54, Cameron Simpson wrote:
On 19Jan2017 16:04, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
This is a neat idea, but this will only work for parsing framed binary protocols. For example, if you protocol prefixes all packets with a length field, you can write an efficient read buffer and use your proposal to decode all of message's fields in one shot. Which is good.
Not all protocols use framing though. For instance, your proposal won't help to write Thrift or Postgres protocols parsers.
Sure, but a lot of things fit the proposal. Seems a win: both simple and useful.
Overall, I'm not sure that this is worth the hassle. With proposal:
data, = struct.unpack('!H$', buf) buf = buf[2+len(data):]
with the current struct module:
len, = struct.unpack('!H', buf) data = buf[2:2+len] buf = buf[2+len:]
Another thing: struct.calcsize won't work with structs that use variable length fields.
True, but it would be enough for it to raise an exception of some kind. It won't break any in play code, and it will prevent accidents for users of new variable sizes formats.
We've all got things we wish struct might cover (I have a few, but strangely the top of the list is nonsemantic: I wish it let me put meaningless whitespace inside the format for readability).
+1 on the proposal from me.
Oh: subject to one proviso: reading a struct will need to return how many bytes of input data were scanned, not merely returning the decoded values.
This is a little difficult without breaking backwards compatibility, but, it is not difficult to compute the lengths yourself. That said, calcsize could require an extra parameter if given a format string with variable-length specifiers in it, e.g.:
struct.calcsize("z", (b'test'))
Would return 5 (zero-length terminator), so you don't have to compute it yourself.
Also, I filed a bug, and proposed use of Z and z.
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
![](https://secure.gravatar.com/avatar/d995b462a98fea412efa79d17ba3787a.jpg?s=120&d=mm&r=g)
On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
It feels a bit minor to need a PEP, but having said that did you pick up on the comment about needing to return the number of bytes consumed? str = struct.unpack('z', b'test\0xxx') How do we know where the unpack got to, so that we can continue parsing from there? It seems a bit wasteful to have to scan the string twice to use calcsize for this... A PEP (or at least, a PEP-style design document) might capture the answer to questions like this. OTOH, the tracker discussion could easily be enough - can you put a reference to the bug report here? Paul
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Jan 20, 2017 09:00, "Paul Moore" <p.f.moore@gmail.com> wrote: On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
It feels a bit minor to need a PEP, but having said that did you pick up on the comment about needing to return the number of bytes consumed? str = struct.unpack('z', b'test\0xxx') How do we know where the unpack got to, so that we can continue parsing from there? It seems a bit wasteful to have to scan the string twice to use calcsize for this... unpack() is OK, because it already has the rule that it raises an error if it doesn't exactly consume the buffer. But I agree that if we do this then we'd really want versions of unpack_from and pack_into that return the new offset. (Further arguments that calcsize is insufficient: it doesn't work for potential other variable length items, e.g. if we added uleb128 support; it quickly becomes awkward if you have multiple strings; in practice I think everyone who needs this would just end up writing a wrapper that calls calcsize and returns the new offset anyway, so should just provide that up front.) For pack_into this is also easy, since currently it always returns None, so if it started returning an integer no one would notice (and it'd be kinda handy in its own right, honestly). unpack_from is the tricky one, because it already has a return value and this isn't it. Ideally it would have worked this way from the beginning, but too late for that now... I guess the obvious solution would be to come up with a new function that's otherwise identical to unpack_from but returns a (values, offset) tuple. What to call this, though, I don't know :-). unpack_at? unpack_next? (Hinting that this is the natural primitive you'd use to implement unpack_iter.) -n
![](https://secure.gravatar.com/avatar/2240a37aad5f5834a92809a5e5f01fe1.jpg?s=120&d=mm&r=g)
On 20 January 2017 at 15:13, Nathaniel Smith <njs@pobox.com> wrote:
On Jan 20, 2017 09:00, "Paul Moore" <p.f.moore@gmail.com> wrote:
On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
It feels a bit minor to need a PEP, but having said that did you pick up on the comment about needing to return the number of bytes consumed?
str = struct.unpack('z', b'test\0xxx')
How do we know where the unpack got to, so that we can continue parsing from there? It seems a bit wasteful to have to scan the string twice to use calcsize for this...
unpack() is OK, because it already has the rule that it raises an error if it doesn't exactly consume the buffer. But I agree that if we do this then we'd really want versions of unpack_from and pack_into that return the new offset. (Further arguments that calcsize is insufficient: it doesn't work for potential other variable length items, e.g. if we added uleb128 support; it quickly becomes awkward if you have multiple strings; in practice I think everyone who needs this would just end up writing a wrapper that calls calcsize and returns the new offset anyway, so should just provide that up front.)
For pack_into this is also easy, since currently it always returns None, so if it started returning an integer no one would notice (and it'd be kinda handy in its own right, honestly).
unpack_from is the tricky one, because it already has a return value and this isn't it. Ideally it would have worked this way from the beginning, but too late for that now... I guess the obvious solution would be to come up with a new function that's otherwise identical to unpack_from but returns a (values, offset) tuple. What to call this, though, I don't know :-). unpack_at? unpack_next? (Hinting that this is the natural primitive you'd use to implement unpack_iter.)
Yes - maybe a PEP. Then we could also, for example, add the suggestion of whitespace on the struct description string - which is nice. And we could things of: unpack methods returns a specialized object- not a tuple, which has attributes with the extra information. So, instead of a, str = struct.unpack("IB$", data) people who want the length can do: tmp = struct.unpack("IB$", data) do_things_with_len(tmp.tell) a, str = tmp The struct "object" could allow other things as well. Since we are at it, maybe a 0 copy version, that would return items from their implace buffer positions. But, ok, maybe most of this should just go in a third party package - anyway, a PEP could be open for more improvements than the variable-lenght fields proposed. (The idea of having attributes with extra information about size, for example - I think that is better than having: size, (a, str) = struct.unpack2(... ) ) js -><-
-n
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/de311342220232e618cb27c9936ab9bf.jpg?s=120&d=mm&r=g)
On 01/20/2017 10:09 AM, Joao S. O. Bueno wrote:
On 20 January 2017 at 16:51, Elizabeth Myers wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
Yes - maybe a PEP.
I agree, especially if the change, simple as it is, requires a lot of rewrite. In that case someone (ELizabeth?) should collect ideas for other improvements and shepherd it through the PEP process. -- ~Ethan~
![](https://secure.gravatar.com/avatar/ae579d9b841a67b490920674e2308b6d.jpg?s=120&d=mm&r=g)
I'd be wary of making a grab-bag of small improvements, it encourages bikeshedding. --Guido (mobile) On Jan 20, 2017 10:16 AM, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/20/2017 10:09 AM, Joao S. O. Bueno wrote:
On 20 January 2017 at 16:51, Elizabeth Myers wrote:
Should I write up a PEP about this? I am not sure if it's justified or
not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
Yes - maybe a PEP.
I agree, especially if the change, simple as it is, requires a lot of rewrite. In that case someone (ELizabeth?) should collect ideas for other improvements and shepherd it through the PEP process.
-- ~Ethan~ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 20/01/17 10:59, Paul Moore wrote:
On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
It feels a bit minor to need a PEP, but having said that did you pick up on the comment about needing to return the number of bytes consumed?
str = struct.unpack('z', b'test\0xxx')
How do we know where the unpack got to, so that we can continue parsing from there? It seems a bit wasteful to have to scan the string twice to use calcsize for this...
A PEP (or at least, a PEP-style design document) might capture the answer to questions like this. OTOH, the tracker discussion could easily be enough - can you put a reference to the bug report here?
Paul
Two things: 1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason. If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack. 2) I want to avoid making a weird incongruity, where only variable-length strings return the length actually parsed. This also doesn't really help with length calculations unless you're doing calcsize without the variable-length specifiers, then adding it on. It's just more of an annoyance. On 20/01/17 12:18, Guido van Rossum wrote:
I'd be wary of making a grab-bag of small improvements, it encourages bikeshedding.
--Guido (mobile)
Definitely would prefer to avoid a bikeshed here, though other improvements to the struct module are certainly welcome! (Though about a better interface, I made a neat little prototype module for an object-oriented interface to struct, but I want to clean it up before I release it to the world... but I'm not sure I want to include it in the standard library, that's for another day and another proposal :p). -- Elizabeth
![](https://secure.gravatar.com/avatar/d995b462a98fea412efa79d17ba3787a.jpg?s=120&d=mm&r=g)
On 20 January 2017 at 20:47, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Two things:
1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason. If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack.
2) I want to avoid making a weird incongruity, where only variable-length strings return the length actually parsed. This also doesn't really help with length calculations unless you're doing calcsize without the variable-length specifiers, then adding it on. It's just more of an annoyance.
Fair points, both. And you've clearly thought the issues through, so I'm +1 on your decision. You have the actual use case, and I'm just theorising, so I'm happy to defer the decision to you. Paul
![](https://secure.gravatar.com/avatar/8848a81d538f2fc428934988af5c8b42.jpg?s=120&d=mm&r=g)
On 20Jan2017 14:47, Elizabeth Myers <elizabeth@interlinked.me> wrote:
1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason.
Fully agree with this.
If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack.
Often, maybe, but there are still going to be protocols that the new format doesn't support, where the performant thing to do (in pure Python) is to scan what you can with struct and "hand scan" the special bits with special code. Consider, for example, a format like MP4/ISO14496, where there's a regular block structure (which is somewhat struct parsable) that can contain embedded arbitraily weird information. Or the flipside where struct parsable data are embedded in a format not supported by struct. The mixed situation is where you need to know where the parse got up to. Calling calcsize or its variable size equivalent after a parse seems needlessly repetetive of the parse work. For myself, I would want there to be some kind of call that returned the parse and the length scanned, with the historic interface preserved for the fixed size formats or for users not needing the length.
2) I want to avoid making a weird incongruity, where only variable-length strings return the length actually parsed.
Fully agree. Arguing for two API calls: the current one and one that also returns the scan length. Cheers, Cameron Simpson <cs@zip.com.au>
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 20/01/17 16:46, Cameron Simpson wrote:
On 20Jan2017 14:47, Elizabeth Myers <elizabeth@interlinked.me> wrote:
1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason.
Fully agree with this.
If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack.
Often, maybe, but there are still going to be protocols that the new format doesn't support, where the performant thing to do (in pure Python) is to scan what you can with struct and "hand scan" the special bits with special code. Consider, for example, a format like MP4/ISO14496, where there's a regular block structure (which is somewhat struct parsable) that can contain embedded arbitraily weird information. Or the flipside where struct parsable data are embedded in a format not supported by struct.
The mixed situation is where you need to know where the parse got up to. Calling calcsize or its variable size equivalent after a parse seems needlessly repetetive of the parse work.
For myself, I would want there to be some kind of call that returned the parse and the length scanned, with the historic interface preserved for the fixed size formats or for users not needing the length.
2) I want to avoid making a weird incongruity, where only variable-length strings return the length actually parsed.
Fully agree. Arguing for two API calls: the current one and one that also returns the scan length.
Cheers, Cameron Simpson <cs@zip.com.au>
Some of the responses on the bug are discouraging... mostly seems to boil down to people just not wanting to expand the struct module or discourage its use. Everyone is a critic. I didn't know adding two format specifiers was going to be this controversial. You'd think I proposed adding braces or something :/. I'm hesitant to go forward on this until the bug has a resolution.
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 20/01/17 17:26, Elizabeth Myers wrote:
On 20/01/17 16:46, Cameron Simpson wrote:
On 20Jan2017 14:47, Elizabeth Myers <elizabeth@interlinked.me> wrote:
1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason.
Fully agree with this.
If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack.
Often, maybe, but there are still going to be protocols that the new format doesn't support, where the performant thing to do (in pure Python) is to scan what you can with struct and "hand scan" the special bits with special code. Consider, for example, a format like MP4/ISO14496, where there's a regular block structure (which is somewhat struct parsable) that can contain embedded arbitraily weird information. Or the flipside where struct parsable data are embedded in a format not supported by struct.
The mixed situation is where you need to know where the parse got up to. Calling calcsize or its variable size equivalent after a parse seems needlessly repetetive of the parse work.
For myself, I would want there to be some kind of call that returned the parse and the length scanned, with the historic interface preserved for the fixed size formats or for users not needing the length.
2) I want to avoid making a weird incongruity, where only variable-length strings return the length actually parsed.
Fully agree. Arguing for two API calls: the current one and one that also returns the scan length.
Cheers, Cameron Simpson <cs@zip.com.au>
Some of the responses on the bug are discouraging... mostly seems to boil down to people just not wanting to expand the struct module or discourage its use. Everyone is a critic. I didn't know adding two format specifiers was going to be this controversial. You'd think I proposed adding braces or something :/.
I'm hesitant to go forward on this until the bug has a resolution. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Also, btw, adding 128-bit length specifiers sounds like a good idea in theory, but the difficulty stems from the fact there's no real native 128-bit type that's portable. I don't know much about how python handles big ints internally, either, but I could learn. I was looking into implementing this already, and it appears it should be possible by teaching the module that "not all data is fixed length" and allowing functions to report back (via a Py_ssize_t *) how much data was actually unpacked/packed. But again, waiting on that bug to have a resolution before I do anything. I don't want to waste hours of effort on something the developers ultimately decide they don't want and will just reject. -- Elizabeth
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 3:37 PM, Elizabeth Myers <elizabeth@interlinked.me> wrote: [...]
Some of the responses on the bug are discouraging... mostly seems to boil down to people just not wanting to expand the struct module or discourage its use. Everyone is a critic. I didn't know adding two format specifiers was going to be this controversial. You'd think I proposed adding braces or something :/.
I'm hesitant to go forward on this until the bug has a resolution. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Also, btw, adding 128-bit length specifiers sounds like a good idea in theory, but the difficulty stems from the fact there's no real native 128-bit type that's portable. I don't know much about how python handles big ints internally, either, but I could learn.
The "b128" in "uleb128" is short for "base 128"; it refers to how each byte contains one 7-bit "digit" of the integer being encoded -- so just like decimal needs 1 digit for 0-9, 2 digits for 10 - 99 = (10**2 - 1), etc., uleb128 uses 1 byte for 0-127, 2 bytes for 128 - 16383 = (128**2 - 1), etc. In practice most implementations are written in C and use some kind of native fixed width integer as the in-memory representation, and just error out if asked to decode a uleb128 that doesn't fit. In Python I suppose we could support encoding and decoding arbitrary size integers if we really wanted, but I also doubt anyone would be bothered if we were restricted to "only" handling numbers between 0 and 2**64 :-).
I was looking into implementing this already, and it appears it should be possible by teaching the module that "not all data is fixed length" and allowing functions to report back (via a Py_ssize_t *) how much data was actually unpacked/packed. But again, waiting on that bug to have a resolution before I do anything. I don't want to waste hours of effort on something the developers ultimately decide they don't want and will just reject.
That's not really how Python bugs work in practice. For better or worse (and it's very often both), CPython development generally follows a traditional open-source model in which new proposals are only accepted if they have a champion who's willing to run the gauntlet of first proposing them, and then keep pushing forward through the hail of criticism and bike-shedding from random kibbitzers. This is at least in part a test to see how dedicated/stubborn you are about this feature. If you stop posting, then what will happen is that everyone else stops posting too, and the bug will just sit there unresolved indefinitely until you get (more) frustrated and give up. On the one hand, this does tend to guarantee that accepted proposals are very high quality and solve some important issue (b/c the champion didn't *really care* about the issue then they wouldn't put up with this). On the other hand, it's often pretty hellish for the individuals involved, and probably drives away all kinds of helpful contributions. But maybe it helps to know it's not personal? Having social capital definitely helps, but well-known/experienced contributors get put through this wringer too; the main difference is that we do it with eyes open and have strategies for navigating the system (at least until we get burned out). Some of these strategies that you might find helpful (or not): - realize that it's really common for someone to be all like "this is TERRIBLE and should definitely not be merged because of <issue> which is a TOTAL SHOW-STOPPER", but then if you ignore the histrionics and move forward anyway, it often turns out that all that person *actually* wanted was to see a brief paragraph in your design summary that acknowledges that you are aware of the existence of <issue>, and once they see this they're happy. (See also: [1]) - speaking of which, it is often very helpful to write up a short document to summarize and organize the different ideas proposed, critiques raised, and what you conclude based on them! That's basically what a "PEP" is - just an email in a somewhat standard format that reviews all the important issues that were raised and then says what you conclude and why, and which eventually also ends up on the website as a record. If you decide to try this then there are some guidelines [2][3] and a sample PEP [4] to start with. (The guidelines make it sound much more formal and scary than it really is, though -- e.g. when they say "your submission may be AUTOMATICALLY REJECTED" then in my experience what they actually mean is you might get a reply back saying "hey fyi the formatter script barfed on your document because you were missing a quote so I fixed it for you".) This particular proposal is really on the lower boundary of needing a PEP and you might well be able to push it through without one, but it might be easier to go this way than not. - sift through the responses to pick the ones that seem actually useful to you, then close the browser tab and go off and implement what you actually think should be implemented and come back with a patch. This does a few things: (a) it helps get everyone on the same page and make the discussion much more concrete, which tends to eliminate a lot of idle criticism/bikeshedding, (b) it tends to attract higher-quality responses because it demonstrates you're serious and makes it look more like this is a thing that will actually happen (see again the "trial by combat" thing above), (c) many of the experts whose good opinion is important are attention-scattered volunteers who are bad at time management and prioritization (I include myself in this category!), so if you stick a patch in front of their faces then you can trick them into switching into code review mode instead of design critique mode. And it's much easier to respond to "your semicolon is in the wrong place" than "but what is the struct module really *for*, I mean, in its heart of hearts?", you know? - join the core-mentorship list [5] and ask for help there. Actually this should probably be the first suggestion on this list! it's a group of folks who're specifically volunteering to help people like you get through this process :-) Hope that helps, -n [1] http://www.ftrain.com/wwic.html [2] https://www.python.org/dev/peps/pep-0001/#submitting-a-pep [3] https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep [4] https://github.com/python/peps/blob/master/pep-0012.txt [5] https://mail.python.org/mailman/listinfo/core-mentorship -- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Fri, Jan 20, 2017 at 7:39 PM, Nathaniel Smith <njs@pobox.com> wrote:
[...] Some of these strategies that you might find helpful (or not):
Oh right, and of course just after I hit send I realized I forgot one of my favorites! - come up with a real chunk of code from a real project that would benefit from the change being proposed, and show what it looks like before/after the feature is added. This can be incredibly persuasive *but* it's *super important* that the code be as real as possible. The ideal is for it to solve a *concrete* *real-world* problem that can be described in a few sentences, and be drawn from a real code base that faces that problem. One of the biggest challenges for maintainers is figuring out how Python is actually used in the real world, because we all have very little visibility outside our own little bubbles, so people really appreciate this -- but at the same time, python-ideas is absolutely awash with people coming up with weird hypothetical situations where their pet idea would be just the ticket, so anything that comes across as cherry-picked like that tends to be heavily discounted. Sure, there *are* situations where the superpower of breathing underwater can help you fight crime, but... http://strongfemaleprotagonist.com/issue-6/page-63-3/ http://strongfemaleprotagonist.com/issue-6/page-64-3/ -n -- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/f3ba3ecffd20251d73749afbfa636786.jpg?s=120&d=mm&r=g)
On 21 January 2017 at 14:51, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jan 20, 2017 at 7:39 PM, Nathaniel Smith <njs@pobox.com> wrote:
[...] Some of these strategies that you might find helpful (or not):
Oh right, and of course just after I hit send I realized I forgot one of my favorites!
- come up with a real chunk of code from a real project that would benefit from the change being proposed, and show what it looks like before/after the feature is added. This can be incredibly persuasive *but* it's *super important* that the code be as real as possible. The ideal is for it to solve a *concrete* *real-world* problem that can be described in a few sentences, and be drawn from a real code base that faces that problem. One of the biggest challenges for maintainers is figuring out how Python is actually used in the real world, because we all have very little visibility outside our own little bubbles, so people really appreciate this -- but at the same time, python-ideas is absolutely awash with people coming up with weird hypothetical situations where their pet idea would be just the ticket, so anything that comes across as cherry-picked like that tends to be heavily discounted.
In the specific case of this proposal, an interesting stress test of any design proposal would be to describe the layout of a CPython tuple in memory. If you trace through the struct and macro definition details in https://hg.python.org/cpython/file/tip/Include/object.h and https://hg.python.org/cpython/file/tip/Include/tupleobject.h you'll find that the last two fields in PyTupleObject are: Py_ssize_t ob_size; PyObject *ob_item[1]; So this is a C struct definition *in the CPython code base* that the struct module currently cannot describe (other PyVarObject definitions are similar to tuples, but don't necessarily guarantee that ob_size is the last field before the variable length section). Similarly, PyASCIIObject and PyCompactUnicodeObject append a data buffer to the preceding struct in order to include both a string's metadata and its contents into the same memory allocation. In that case, the buffer is also null-terminated in addition to having its length specified in the string metadata. So both of the proposals Elizabeth is making reflect ways that CPython itself uses C structs (matching the heritage of the module's name), even though the primary practical motivation and use case is common over-the-wire protocols. Cheers, Nick. P.S. "the reference interpreter does this" and "the standard library does this" can be particularly compelling sources of real world example code :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
![](https://secure.gravatar.com/avatar/8848a81d538f2fc428934988af5c8b42.jpg?s=120&d=mm&r=g)
On 20Jan2017 17:26, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Some of the responses on the bug are discouraging... mostly seems to boil down to people just not wanting to expand the struct module or discourage its use. Everyone is a critic. I didn't know adding two format specifiers was going to be this controversial. You'd think I proposed adding braces or something :/.
I'm hesitant to go forward on this until the bug has a resolution.
Yes, they are, but I think they're being overly negative myself. The struct module _is_ explicitly targeted at C structs, and maybe its internals are quite rigid (I haven't looked). But as you say, bot NUL terminated strings and run length encoded strings are very common, and struct does not support them. Waiting for a bug resolution seems unrealistic to me; plenty of bugs don't get resolutions at all, and to resolve this someone needs to take ownership of the bug and decide on something, and that the opposing views don't carry enouygh weight. Why not write a PEP? If nothing else, even if it gets rejected (plenty of PEPs are rejected, and kept on record to preserve the arguments) it will be visible and on the record. And it will be a concrete proposal, not awash in bikeshedding. You can update the PEP to reflect the salient parts of the bikeshedding as it happens. Make it narrow focus, explicitly the variable length thing, just like your issue. List the arguments for this (real world use cases, perhaps example real world code now and how it would be with the new feature) and the arguments against. Describe the additional API (at the least it needs an additional calcsize-like function that will return the data length scanned). Make it clear that the current API will continue to work unchanged. Have you read the struct module? Do you think your additions would be very intrusive to it, or relatively simple? Will the present performance be likely to be the same with your additions (not necessarily to cost to parse the new formats, but the performance with any existing fixed length structs)? Cheers, Cameron Simpson <cs@zip.com.au>
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Jan 20, 2017 12:48 PM, "Elizabeth Myers" <elizabeth@interlinked.me> wrote: On 20/01/17 10:59, Paul Moore wrote:
On 20 January 2017 at 16:51, Elizabeth Myers <elizabeth@interlinked.me> wrote:
Should I write up a PEP about this? I am not sure if it's justified or not. It's 3 changes (calcsize and two format specifiers), but it might be useful to codify it.
It feels a bit minor to need a PEP, but having said that did you pick up on the comment about needing to return the number of bytes consumed?
str = struct.unpack('z', b'test\0xxx')
How do we know where the unpack got to, so that we can continue parsing from there? It seems a bit wasteful to have to scan the string twice to use calcsize for this...
A PEP (or at least, a PEP-style design document) might capture the answer to questions like this. OTOH, the tracker discussion could easily be enough - can you put a reference to the bug report here?
Paul
Two things: 1) struct.unpack and struct.unpack_from should remain backwards-compatible. I don't want to return extra values from it like (length unpacked, (data...)) for that reason. If the calcsize solution feels a bit weird (it isn't much less efficient, because strings store their length with them, so it's constant-time), there could also be new functions that *do* return the length if you need it. To me though, this feels like a use case for struct.iter_unpack. iter_unpack is strictly less powerful - you can easily and efficiently implement iter_unpack using unpack_from_with_offset (probably not it's real name, but you get the idea). The reverse is not true. And: val, offset = somefunc(buffer, offset) is *the* idiomatic signature for functions for unpacking complex binary formats. I've seen it reinvented independently at least 4 times in real projects. (It turns out that implementing sleb128 encoding in Python is sufficiently frustrating that you end up making lots of attempts to find someone anyone who has already done it. Or at least, I did :-).) Here's an example of this idiom used to parse Mach-O binding tables, which iter_unpack definitely can't do: https://github.com/njsmith/machomachomangler/blob/master/ machomachomangler/macho.py#L374-L429 Actually this example is a bit extreme since the format is *all* variable-width stuff, but it gives the idea. There are also lots of formats that have a mix of struct-style fixed width and variable width fields in a complicated pattern, e.g.: https://zs.readthedocs.io/en/latest/format.html#layout-details Definitely would prefer to avoid a bikeshed here, though other improvements to the struct module are certainly welcome! It doesn't necessarily have to be part of the same change, but if struct is gaining the infrastructure to support variable-width layouts then adding uleb128/sleb128 format specifiers would make a lot of sense. Implementing them in pure Python is difficult (all the standard "how to en/decode u/sleb128" documentation assumes you're working with C-style modulo integers) and slow, and they turn up all over the place: both of those links above, in Google protobufs, as a primitive in the .Net equivalent of the struct module [1], etc. -n [1] https://msdn.microsoft.com/en-us/library/system.io.binarywriter.write7bitenc...
![](https://secure.gravatar.com/avatar/aedec50baca048bb0121d9bb59c87d67.jpg?s=120&d=mm&r=g)
On 19/01/17 15:04, Yury Selivanov wrote:
This is a neat idea, but this will only work for parsing framed binary protocols. For example, if you protocol prefixes all packets with a length field, you can write an efficient read buffer and use your proposal to decode all of message's fields in one shot. Which is good.
Not all protocols use framing though. For instance, your proposal won't help to write Thrift or Postgres protocols parsers.
It won't help them, no, but it will help others who have to do similar tasks, or help people build things on top of the struct module.
Overall, I'm not sure that this is worth the hassle. With proposal:
data, = struct.unpack('!H$', buf) buf = buf[2+len(data):]
with the current struct module:
len, = struct.unpack('!H', buf) data = buf[2:2+len] buf = buf[2+len:]
I find such a construction is not really needed most of the time if I'm dealing with repeated frames. I could just use struct.iter_unpack. It's not useful in all cases, but as it stands, neither is the present struct module. Just because it is not useful to everyone does not mean it is not useful to others, perhaps immensely so. The existence of third party libraries that implement a portion of my rather modest proposal I think already justifies its existence.
Another thing: struct.calcsize won't work with structs that use variable length fields.
Should probably raise an error if the format has a variable-length string in it. If you're using variable-length strings, you probably aren't a consumer of struct.calcsize anyway.
Yury
On 2017-01-18 5:24 AM, Elizabeth Myers wrote:
Hello,
I've noticed a lot of binary protocols require variable length bytestrings (with or without a null terminator), but it is not easy to unpack these in Python without first reading the desired length, or reading bytes until a null terminator is reached.
I've noticed the netstruct library (https://github.com/stendec/netstruct) has a format specifier, $, which assumes the previous type to pack/unpack is the string's length. This is an interesting idea in of itself, but doesn't handle the null-terminated string chase. I know $ is similar to pascal strings, but sometimes you need more than 255 characters :p.
For null-terminated strings, it may be simpler to have a specifier for those. I propose 0, but this point can be bikeshedded over endlessly if desired ;) (I thought about using n/N but they're :P).
It's worth noting that (maybe one of?) Perl's equivalent to the struct module, whose name escapes me atm, has a module which can handle this case. I can't remember if it handled variable length or zero-terminated though; maybe it did both. Perl is more or less my 10th language. :p
This pain point is an annoyance imo and would greatly simplify a lot of code if implemented, or something like it. I'd be happy to take a look at implementing it if the idea is received sufficiently warmly.
-- Elizabeth _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (16)
-
Cameron Simpson
-
Chris Angelico
-
Daniel Spitz
-
Elizabeth Myers
-
Ethan Furman
-
Guido van Rossum
-
Joao S. O. Bueno
-
Mark Dickinson
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Nick Timkovich
-
Paul Moore
-
Rhodri James
-
Steven D'Aprano
-
Yury Selivanov