[Python-Dev] Lazy unpacking for struct module

Sun Jun 12 19:27:06 CEST 2011

On Jun 12, 2011, at 8:29 AM, Lukas Lueg wrote:

> Hi.
> 
> We extensively use the struct module to crunch large amounts of binary
> data. There are basically two operations for us that only seem to the
> naked eye as one: Filtering (see if certain fields have certain
> values, throw everything away if not) and inspection (care about all
> the fields' values). The filtering-part is very important as most of
> the binary data can actually be thrown away and never have to be
> inspected any further. When thinking about how to increase
> performance, one thought was that a lot of objects are generated by
> the struct module that we never need: Unpacking six fields in order to
> look at one and then throwing everything away is inefficient
> concerning the five other fields. It also makes filtering and
> inspecting basically the same operation regarding the (slow) unpacking
> of values so we don't really benefit from filtering. This is a huge
> problem when crunching gigabytes of data and creating millions of
> fields.
> 
> One solution to this is using two format-strings instead of only one
> (e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields
> (e.g. '8x i 8x') and one that unpacks all the fields except the one
> already created by the filter (e.g. '4s4s  4x  4s2s2s'). This solution
> works very well and increases throughput by far. I

This is what people normally do (unpack just the values they need,
when they need them).

> t however also
> creates complexity in the code as we have to keep track and combine
> field-values that came from the filtering-part with the ones unpacked
> during inspection-part (we don't want to simply unpack twice).
> 
> I'd like to propose an enhancement to the struct module that should
> solve this dilemma and ask for your comments.
> 
> The function s_unpack_internal() inside _struct.c currently unpacks
> all values from the buffer-object passed to it and returns a tuple
> holding these values. Instead, the function could create a tuple-like
> object that holds a reference to it's own Struct-object (which holds
> the format) and a copy of the memory it is supposed to unpack. This
> object allows access to the unpacked values through the sequence
> protocol, basically unpacking the fields if - and only if - accessed
> through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] ==
> 'ab'). The object can also unpack all fields only once (as all
> unpacked objects are immutable, we can hold references to them and
> return these instead once known). This approach is possible because
> there are no further error conditions inside the unpacking-functions
> that we would *have* to deal with at the time .unpack() is called; in
> other words: Unpacking can't fail if the format-string's syntax had
> been correct and can therefor be deferred (while packing can't).
> 
> I understand that this may seem like a single-case-optimization.

Yes, it does.

> We
> can however assume that most people will benefit from the new behavior
> unknowingly while everyone else takes now harm: The object mimicking
> the otherwise returned tuple is immutable (therefor it's not suddenly
> part of GC) and the memory overhead caused by holding references to
> the original memory a little longer (reclaimed after the result
> becomes unreachable) should be comparable to the memory used by
> unneeded fields (reclaimed directly after creation).
> 
> I'd like to hear your thoughts and am perfectly willing to provide a
> patch if it has a chance of inclusion.

The problem you're trying to solve isn't unique to structs.
That's why we get periodic requests for ropes-like behaviors
for strings for example.  Someone could also request lazy
extraction of fields in regular expressions or lazy parses of
json objects.  

I don't think there is a net win from adding complexity to the struct
module.  Introducing lazy behaviors creates its own overhead
that would compete with code optimized using the traditional
approach (unpack what you need, when you need it).  Also,
the new behaviors add to the cognitive load when learning
and remembering how to use this module.  

In general, Python has opted for the most straight-forward, least
magical implementations of object (why we use array based
lists instead of blist for example).  The are exceptions 
such as Python 3's version of super() but this isn't the norm.

I do suggest that you publish your code as a third-party module
to make the optional available and to validate whether there
is any real interest in this.

Raymond

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110612/77804cee/attachment.html>