Lazy unpacking for struct module
Hi. We extensively use the struct module to crunch large amounts of binary data. There are basically two operations for us that only seem to the naked eye as one: Filtering (see if certain fields have certain values, throw everything away if not) and inspection (care about all the fields' values). The filtering-part is very important as most of the binary data can actually be thrown away and never have to be inspected any further. When thinking about how to increase performance, one thought was that a lot of objects are generated by the struct module that we never need: Unpacking six fields in order to look at one and then throwing everything away is inefficient concerning the five other fields. It also makes filtering and inspecting basically the same operation regarding the (slow) unpacking of values so we don't really benefit from filtering. This is a huge problem when crunching gigabytes of data and creating millions of fields. One solution to this is using two format-strings instead of only one (e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields (e.g. '8x i 8x') and one that unpacks all the fields except the one already created by the filter (e.g. '4s4s 4x 4s2s2s'). This solution works very well and increases throughput by far. It however also creates complexity in the code as we have to keep track and combine field-values that came from the filtering-part with the ones unpacked during inspection-part (we don't want to simply unpack twice). I'd like to propose an enhancement to the struct module that should solve this dilemma and ask for your comments. The function s_unpack_internal() inside _struct.c currently unpacks all values from the buffer-object passed to it and returns a tuple holding these values. Instead, the function could create a tuple-like object that holds a reference to it's own Struct-object (which holds the format) and a copy of the memory it is supposed to unpack. This object allows access to the unpacked values through the sequence protocol, basically unpacking the fields if - and only if - accessed through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] == 'ab'). The object can also unpack all fields only once (as all unpacked objects are immutable, we can hold references to them and return these instead once known). This approach is possible because there are no further error conditions inside the unpacking-functions that we would *have* to deal with at the time .unpack() is called; in other words: Unpacking can't fail if the format-string's syntax had been correct and can therefor be deferred (while packing can't). I understand that this may seem like a single-case-optimization. We can however assume that most people will benefit from the new behavior unknowingly while everyone else takes now harm: The object mimicking the otherwise returned tuple is immutable (therefor it's not suddenly part of GC) and the memory overhead caused by holding references to the original memory a little longer (reclaimed after the result becomes unreachable) should be comparable to the memory used by unneeded fields (reclaimed directly after creation). I'd like to hear your thoughts and am perfectly willing to provide a patch if it has a chance of inclusion. Best regards Lukas
On Jun 12, 2011, at 8:29 AM, Lukas Lueg wrote:
Hi.
We extensively use the struct module to crunch large amounts of binary data. There are basically two operations for us that only seem to the naked eye as one: Filtering (see if certain fields have certain values, throw everything away if not) and inspection (care about all the fields' values). The filtering-part is very important as most of the binary data can actually be thrown away and never have to be inspected any further. When thinking about how to increase performance, one thought was that a lot of objects are generated by the struct module that we never need: Unpacking six fields in order to look at one and then throwing everything away is inefficient concerning the five other fields. It also makes filtering and inspecting basically the same operation regarding the (slow) unpacking of values so we don't really benefit from filtering. This is a huge problem when crunching gigabytes of data and creating millions of fields.
One solution to this is using two format-strings instead of only one (e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields (e.g. '8x i 8x') and one that unpacks all the fields except the one already created by the filter (e.g. '4s4s 4x 4s2s2s'). This solution works very well and increases throughput by far. I
This is what people normally do (unpack just the values they need, when they need them).
t however also creates complexity in the code as we have to keep track and combine field-values that came from the filtering-part with the ones unpacked during inspection-part (we don't want to simply unpack twice).
I'd like to propose an enhancement to the struct module that should solve this dilemma and ask for your comments.
The function s_unpack_internal() inside _struct.c currently unpacks all values from the buffer-object passed to it and returns a tuple holding these values. Instead, the function could create a tuple-like object that holds a reference to it's own Struct-object (which holds the format) and a copy of the memory it is supposed to unpack. This object allows access to the unpacked values through the sequence protocol, basically unpacking the fields if - and only if - accessed through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] == 'ab'). The object can also unpack all fields only once (as all unpacked objects are immutable, we can hold references to them and return these instead once known). This approach is possible because there are no further error conditions inside the unpacking-functions that we would *have* to deal with at the time .unpack() is called; in other words: Unpacking can't fail if the format-string's syntax had been correct and can therefor be deferred (while packing can't).
I understand that this may seem like a single-case-optimization.
Yes, it does.
We can however assume that most people will benefit from the new behavior unknowingly while everyone else takes now harm: The object mimicking the otherwise returned tuple is immutable (therefor it's not suddenly part of GC) and the memory overhead caused by holding references to the original memory a little longer (reclaimed after the result becomes unreachable) should be comparable to the memory used by unneeded fields (reclaimed directly after creation).
I'd like to hear your thoughts and am perfectly willing to provide a patch if it has a chance of inclusion.
The problem you're trying to solve isn't unique to structs. That's why we get periodic requests for ropes-like behaviors for strings for example. Someone could also request lazy extraction of fields in regular expressions or lazy parses of json objects. I don't think there is a net win from adding complexity to the struct module. Introducing lazy behaviors creates its own overhead that would compete with code optimized using the traditional approach (unpack what you need, when you need it). Also, the new behaviors add to the cognitive load when learning and remembering how to use this module. In general, Python has opted for the most straight-forward, least magical implementations of object (why we use array based lists instead of blist for example). The are exceptions such as Python 3's version of super() but this isn't the norm. I do suggest that you publish your code as a third-party module to make the optional available and to validate whether there is any real interest in this. Raymond
This is what people normally do (unpack just the values they need, when they need them). Due to the fact that there hundreds of format-strings which dynamically compiled from a more verbose language at runtime, we will have significant complexity in the code in order to generate format strings that parse just the fields that are needed for filtering. It's not just put-a-string-here-and-there.
I don't think there is a net win from adding complexity to the struct module. Introducing lazy behaviors creates its own overhead that would compete with code optimized using the traditional approach (unpack what you need, when you need it). Also, the new behaviors add to the cognitive load when learning and remembering how to use this module.
The complexity is very well handled. Remember that the interface to the module does not change at all and the documentation would be exactly the same. There is no special case introduced here the user has to know about. I also think this case has very little black magic in it since we are dealing only with immutable objects and do not have delayed error conditions (both usually being the primary source of headaches when introducing lazy behavior).
Raymond Hettinger wrote:
The problem you're trying to solve isn't unique to structs. That's why we get periodic requests for ropes-like behaviors
I don't think he's asking for rope-like behaviour here. Rather, he's asking for iterator-like or view-like behaviour -- for the same reasons we have both zip() and izip(), for example, or that dict.items() became a view in Py3. It seems like a reasonable request to me. However, I'm wondering whether the ctypes module might already provide what he wants. -- Greg
On Jun 12, 2011, at 3:18 PM, Greg Ewing wrote:
Raymond Hettinger wrote:
The problem you're trying to solve isn't unique to structs. That's why we get periodic requests for ropes-like behaviors
I don't think he's asking for rope-like behaviour here.
How would you describe the creation of a lazy result that keeps a reference to the underlying buffer? That is my understanding of how ropes work. Raymond
Raymond Hettinger wrote:
How would you describe the creation of a lazy result that keeps a reference to the underlying buffer?
I'd call it a view. There is plenty of precedence for this kind of object in Python -- I gave a few examples before. The distinguishing feature of ropes, as I understand the term, is that you get a lazy object when you *combine* other objects, e.g. concatenating strings. That's not being asked for here -- there is only taking apart going on, not putting together. It's also different from the oft-rejected idea of lazy string slicing, because extracting a field would give you a separate object, e.g. an int or string, not a reference to the original data object being unpacked. So I really can't see what harm it could do, except for maybe a tiny performance reduction in the case where you extract all the fields, or refer to extracted fields repeatedly. Even if that turned out to be the case, and was considered objectionable, it could still be useful to provide view-oriented unpacking as an option. -- Greg
So I really can't see what harm it could do, except for maybe a tiny performance reduction in the case where you extract all the fields, or refer to extracted fields repeatedly.
Referring to the view-object multiple times should not be a problem since the object can create and hold references to the unpacked values it created; remember that they are all immutable.
The followup: I've implemented a new StructView-object for 3.3a-trunk. The object takes and existing Struct-object and provides on-access unpacking. The breaking point where this object is faster than calling Struct.unpack seems to be somewhere around 12 fields in the format-string. Format strings with less fields expose too much overhead of entering the C-code and staying there a little longer to unpack all fields is actually faster. Having fifteen or more fields in a format-string seems unlikely and I'll therefor abandon the idea of providing this mechanism. 2011/6/14 Lukas Lueg <lukas.lueg@googlemail.com>:
So I really can't see what harm it could do, except for maybe a tiny performance reduction in the case where you extract all the fields, or refer to extracted fields repeatedly.
Referring to the view-object multiple times should not be a problem since the object can create and hold references to the unpacked values it created; remember that they are all immutable.
Le dimanche 12 juin 2011 à 10:27 -0700, Raymond Hettinger a écrit :
I do suggest that you publish your code as a third-party module to make the optional available and to validate whether there is any real interest in this.
See the Hachoir project: it is a lazy parser supporting sub-structures (whereas struct only supports flat structures). It is difficult to implement a lazy parser: I chose to use Python generators to implement them. Hachoir should not enter Python standard library: it evoles too fast and it is too big (60K lines of Python). See also: bdec: http://www.protocollogic.com/ Construct: http://construct.wikispaces.com/ FileAlyzer: http://www.safer-networking.org/fr/filealyzer/index.html DataWorkshop: http://www.dataworkshop.de/ DataWorkshop project is dead since 2005. I don't remember if FileAlyzer is a free software or not. I agree with Raymond: struct module should be kept simple, and if you want a lazy parser: it should be a third party project. Victor
On 6/12/2011 11:29 AM, Lukas Lueg wrote: This sort of speculative idea might fit the python-ideas list better. [Summary: we often need to extract a field or two from a binary record in order to decide whether to toss it or unpack it all and process.]
One solution to this is using two format-strings instead of only one (e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields (e.g. '8x i 8x') and one that unpacks all the fields except the one already created by the filter (e.g. '4s4s 4x 4s2s2s'). This solution works very well and increases throughput by far. It however also creates complexity in the code as we have to keep track and combine field-values that came from the filtering-part with the ones unpacked during inspection-part (we don't want to simply unpack twice).
With just 1 or 2 filter fields, and very many other fields, I would just unpack everything, including the filter field. I expect the extra time to do that would be comparalbe to the extra time to combine. It certainly would make your code easier. I suspect you could write a function to create the filter field only format by field number from the everything format.
I'd like to propose an enhancement to the struct module that should solve this dilemma and ask for your comments.
The function s_unpack_internal() inside _struct.c currently unpacks all values from the buffer-object passed to it and returns a tuple holding these values. Instead, the function could create a tuple-like object that holds a reference to it's own Struct-object (which holds the format) and a copy of the memory it is supposed to unpack. This object allows access to the unpacked values through the sequence protocol, basically unpacking the fields if - and only if - accessed through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] == 'ab'). The object can also unpack all fields only once (as all unpacked objects are immutable, we can hold references to them and return these instead once known). This approach is possible because there are no further error conditions inside the unpacking-functions that we would *have* to deal with at the time .unpack() is called; in other words: Unpacking can't fail if the format-string's syntax had been correct and can therefor be deferred (while packing can't).
I understand that this may seem like a single-case-optimization.
Yep.
We can however assume that most people will benefit from the new behavior unknowingly while everyone else takes now harm:
I will not assume that without code and timings. I would expect that unpacking one field at a time would take longer than all at once. To me, this is the sort of thing that should be written, listed on PyPI, and tested by multiple users on multiple systems first. -- Terry Jan Reedy
On Mon, Jun 13, 2011 at 6:31 AM, Terry Reedy <tjreedy@udel.edu> wrote:
With just 1 or 2 filter fields, and very many other fields, I would just unpack everything, including the filter field. I expect the extra time to do that would be comparalbe to the extra time to combine. It certainly would make your code easier. I suspect you could write a function to create the filter field only format by field number from the everything format.
Indeed, while the "filter format" part makes sense to me, the decision to go with field combination rather than just extracting the filter fields a second time isn't such an obvious improvement. OTOH, it also seems like this is something that could be published as a cookbook recipe that generated the appropriate filtered formats on the fly from an existing struct definition. So given format "a b c d e", it could either extract each field individually, or else be asked to generate specific formats and their complements (e.g, asking for the 2nd and 3rd field could return a 2-tuple of formats "x b c x x" and "a x x c d e"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (6)
-
Greg Ewing
-
Lukas Lueg
-
Nick Coghlan
-
Raymond Hettinger
-
Terry Reedy
-
Victor Stinner