Vstruct module: unpack heterogenous, variable-length binary data

Sat May 3 05:57:04 EDT 2003

This is very much a work-in-progress. Comments welcome.

--
Francis Avila

---vstruct.py---
"""Unpack heterogenous, variable-length binary data.

This module, vstruct, contains functions to unpack heterogenous binary
data.  Unlike struct, it can also unpack structures of variable
length, by determining the repeat count of a structure from the value
of some slice of bytes in the data itself.  That is, vstruct
understands indexed, heterogenous arrays.  No more fiddling with struct
on network data!

The general paradigm for such structure unpacking functions is to
apply a template--which describes the data types--to a string of raw
binary data.  In struct, this template is a string of characters,
called a "format string."  In vstruct, this template is a tuple or
list, called a "format sequence," which may consist of format
sequences or format strings.

Format sequences are S-expressions, for you lisp folks: the car
defines the data type of a multiplier for the cdr, while the cdr is
the data structure of an element in the array.

The car is either None or a single struct format string of an integer
datatype.  None means that there is nothing in the raw data stream
that specifies a repeat count: just unpack the cdr once.  The struct
format string that specifies the repeat count may be any integer
type--negative values are treated as 0 (i.e., don't unpack the cdr),
although arguably they should throw an error.

The elements of the cdr are either a struct format string, or another
sequence.

Some examples:

(None, '=b', '=h')

This is a simple case, and can be represented by the struct format
string '=bh'.

('=b', '=s'), which is the same as (None, ('=b', '=s'))

This defines a pascal-like string: the value of an initial byte
specifies the number of characters to follow. Applied to the data
'\x02Hi!', sequnpack returns (('Hi',), '!'). '\x04Hi!' would raise an
error.

This is not the same as struct's 'p', for two reasons: the repeat of p
is the length of the whole structure *including* the repeat count
byte, and p will silently pad or truncate with nulls when packing.

(None, (None, '!H', '!H', '!H'), '!H',
    ('!B', '!H'), ('!B', ('!H', '!s')),
    ('!H', '!s'), ('!H', '!s'),
    ('!B', ('!H', '!s')), ('!H', '!s'))

This is the structure of an XDMCP Request packet. The following might
be more readable:

>>> S = '!s'; C8 = '!B'; C16 = '!H'; A8 = (C16, S); A16 = (C8, C16)
>>> AA8 = (C8, A8); HEADER = (None, C16, C16, C16)
>>> (None, HEADER, C16, A16, AA8, A8, A8, AA8, A8)
(None, (None, '!H', '!H', '!H'), '!H', ('!B', '!H'), ('!B', ('!H', '!s')),
('!H', '!s'), ('!H', '!s'), ('!B', ('!H', '!s')), ('!H', '!s'))

Cool, no?

For now, there are two limitations concerning format strings. These
will hopefully be eliminated later, or some nice interface will be
made to create format sequences which are guaranteed to be
well-formed.

* Byte order must be explicit for ALL format strings, and @ (native
  order and alignment) is not allowed.  The byte order prefix
  requirement can be fixed fairly easily (albeit tediously), but I
  truly have no idea how to deal with alignment issues. On the plus
  side, unlike struct, you can specify different byte orders within
  the same format sequence. (Whether there is any such data structure
  on the face of the earth is another question.)

* Format strings must be atomic, that is, must represent only one
  occurrence of one data type. ('!hh' is illegal, for instance, and
  must be written '!h', '!h'.)  And don't use whitespace, even though
  struct allows it. (We're talking really brittle here.)

These two taken together mean that every format string is exactly two
characters long.  Format strings must additionally conform to any
well-formededness requirements of the underlying struct module
implementation (i.e., use valid datatype characters). Vstruct itself
is ignorant of the semantics of the format strings and datatypes,
except in the case of 's' and 'p'.  See the struct module
documentation for details.

FUNCTIONS IN THIS MODULE

The unpack algorithm is recursive.  The entry point is
sequnpack, which calls other functions in the following way:

sequnpack
  splitcarcdr                           # Split sequence into car/cdr
  getrc(car)                  # Determine repeat count of cdr from car
  [if cdr is atomic]             # Mostly to handle things like '=10s'
    strunpack(rc+cdr)
  [else]
    getres(cdr) * rc                    # Unpack items in cdr rc times
      [if is seq]
        sequnpack                   # Recurse into the nested sequence
          ...
      [elif is string]
        strunpack                       # struct.unpack can unpack it
[Done]

BUGS

* This is a recursive algorithm, so maybe there might be stack
  problems for really deeply nested format sequences?  I don't really
  know.  Perhaps generators would yield a cleaner implementation?

* No unit tests (yet)!  And not thoroughly tested.

* No native alignment. I have no idea how to deal with alignment, but
  it seems like it would be a mess.

* No corresponding format sequence packer. Yet.

* Negative repeat counts are treated as 0.  I'm not certain whether it
  would be better that these raise an error.

* No checking for exceptions yet.  It seems like exceptions in helper
  functions should be caught and rethrown as a module-wide exception
  class, with a more context-sensitive and helpful error message.

* The algorithm is sound (I think), but the implementation is very
  brittle.  Format sequences that are not perfectly formed will
  silently produce undefined behavior.  Frankly, it is easier to write
  something that outputs well-formed sequences than to add all the
  necessary checks to this module.  Plus, it'll slow everything down.

* Requiring the byte-order prefix in fmtstrings is a kludge. Using a
  prefix without a global variable (i.e., passing it around from
  function to function) would make things horrendously complex.  But
  using some sort of global demands that we put this whole thing into
  a class (which is also more complex, plus contrary to the
  batch-oriented nature of the operation).

These last two would be best resolved by writing an interface to
create format sequences, rather than by changing the algorithm.
Fixing these latter two in the algorithm would either slow things down
(string catenation for every strunpack call, for example), or make
things more complex than they already are.

So I guess the last bug is that there is no such interface written....

"""
# $Id: vstruct.py,v 1.4 2003/05/03 09:40:59 franga Exp $

__author__ = 'Francis Avila (franga at shentel dot net)'
__version__ = '$Revision: 1.4 $'
__date__ = '$Date: 2003/05/03 09:40:59 $'

import struct
from types import StringType, TupleType, ListType

_debug = 0

def __pd(*x):
    global _debug
    if _debug > 0:
        import pprint
        pprint.pprint(x)

def splitcarcdr(fmt, data):
    """Return (carfmt, cardata), (cdrfmt, cdrdata)."""
    __pd('>splitcarcdr:', fmt, data)
    carfmt, cdrfmt = fmt[0], fmt[1:]
    try:
        ln = struct.calcsize(carfmt)
    except TypeError:                   # carfmt==None
        ln = 0
    cardata, cdrdata = data[:ln], data[ln:]
    __pd('<splitcarcdr:', (carfmt, cardata), (cdrfmt, cdrdata))
    return (carfmt, cardata), (cdrfmt, cdrdata)

def getrc(carfmt, cardata):
    """Return the cdr's repeat count."""
    if carfmt==None:
        rc = 1
    else:
        rc = struct.unpack(carfmt, cardata)[0]
    __pd('<getrc:', rc)
    return rc

def getres(cdrfmt, cdrdata):
    """Return unpacked cdrfmt and unused data."""
    __pd('>getres:', cdrfmt, cdrdata)
    res = []
    for s in cdrfmt:

        # StringType elements are atomic, fixed-length struct-format
        # strings, which can be fed to struct.unpack.  Tuple or
        # ListType elements are nested fmt sequences, and must be
        # descended into to unpack them.  Semantically, all values of
        # the same level of nesting ought to be grouped together.
        # Thus the values of fmtstrings should be extended, and of
        # fmtsequences should be appended.

        if type(s) is StringType:
            unp, cdrdata = strunpack(s, cdrdata)
            res += unp
        elif type(s) is TupleType or \
                 type(s) is ListType:
            unp, cdrdata = sequnpack(s, cdrdata)
            res.append(unp)
    __pd('<getres:', res, cdrdata)
    return tuple(res), cdrdata

def sequnpack(fmt, data):
    """Return an unpacked sequence and unused data.

    This is the entry point. See this module's docstring for more
    information on format sequences.

    """
    __pd('>sequnpack:', fmt, data)
    result = []
    car, cdr = splitcarcdr(fmt, data)
    cdrfmt, cdrdata = cdr
    del(cdr)                            # cdrdata will be altered,
                                        # so cdr[1] won't be reliable
    n = getrc(*car)

    # The following is an optimization to reduce the number of struct
    # calls by using a repeat count in the struct format string (e.g.,
    # '=10b').  This is necessary for strings to come out as ('ab',)
    # instead of (('a',), ('b',)). In order to do this, however, the
    # entire cdr must be atomic.

    if len(cdrfmt)==1 and type(cdrfmt[0]) is StringType:
        unp, cdrdata = strunpack(cdrfmt[0][0]+`n`+cdrfmt[0][1],
                                 cdrdata)
        # Do not append unp to result!
        # Do not return unp as a tuple!
        # See comments in strunpack for why.
        return unp, cdrdata
    else:
        for i in range(n):
            unp, cdrdata = getres(cdrfmt, cdrdata)
            result.append(unp)
    __pd('<sequnpack:', result, cdrdata)
    return tuple(result), cdrdata

def strunpack(sfmt, cdrdata):
    """Return the value of atomic string sfmt, and unused data.

    sfmt must be very carefully crafted.  See this module's docstring
    for more details.

    """
    __pd('>strunpack:', sfmt, cdrdata)
    ln = struct.calcsize(sfmt)
    data, cdrdata = cdrdata[:ln], cdrdata[ln:]
    res = struct.unpack(sfmt, data)

    # For string-like datatypes, we have the behavior where the result
    # of 'ssss' != the result of '4s'.  This is a huge PITA, and a big
    # reason why I demand that all fmtstrings in a fmtsequence be
    # atomic.  For 's' and 'p' datatypes, the python-sequence (not
    # fmtsequence) type we're really interested in receiving is not
    # the tuple, but the string itself.  But struct always returns
    # tuples, which in this case (because only atomic values are
    # allowed) will always be of length 1.  We need to strip off the
    # tuple and return the string itself.

    if sfmt[-1] == 's' or sfmt[-1] == 'p': # These two lines took
        res = res[0]                       # a few hours...
    __pd('<strunpack:', res, cdrdata)
    return res, cdrdata

def _test():
    r = sequnpack(F, D)
    print 'Expect:', R[0][0]
    print '   Got:', r[0][0]

D = '\x00\x00\x03\x00\x02Hi\x00\x05There\x01\x01!\x02\x02\x00\x00\x00'
F = (None, '=h', ('=h', ('=b', '=s'), '=b'), '=l')
R = (((0, (('Hi', 0), ('There', 1), ('!', 2)), 2),), '')

if __name__ == '__main__': _test()