[Python-Dev] proposed struct module format code addition

Sun Oct 3 03:08:57 CEST 2004

Good day everyone,

I have produced a patch against the latest CVS to add support for two 
new formatting characters in the struct module.  It is currently an RFE, 
which I include a link to at the end of this post.  Please read the 
email before you respond to it.

Generally, the struct module is for packing and unpacking of binary 
data.  It includes support to pack and unpack the c types:
byte, char, short, long, long long, char[], *, and certain variants of 
those (signed/unsigned, big/little endian, etc.)

Purpose
-------
I had proposed two new formatting characters, 'g' and 'G' (for biGint or 
lonG int).

There was one primary purpose, to offer users the opportunity to specify 
their own integer lengths (very useful for cryptography, and real-world 
applications that involve non-standard sized integers).  Current 
solutions involve shifting, masking, and multiple passes over data.

There is a secondary purpose, and that is that future n-byte integers 
(like 16-byte/128-bit integers as supported by SSE2) are already taken 
care of.

It also places packing and unpacking of these larger integers in the 
same module as packing and packing of other integers, floats, etc.  This 
makes documentation easy.

Functionality-wise, it merely uses the two C functions 
_PyLong_FromByteArray() and _PyLong_ToByteArray(), with a few lines to 
handle interfacing with the pack and unpack functions in the struct module.

An example of use is as follows:
 >>> struct.pack('>3g', -1)
'\xff\xff\xff'
 >>> struct.pack('>3g', 2**23-1)
'\x7f\xff\xff'
 >>> struct.pack('>3g', 2**23)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
OverflowError: long too big to convert
 >>> struct.pack('>3G', 2**23)
'\x80\x00\x00'

It follows the struct module standard 'lowercase for signed, uppercase 
for unsigned'.

Arguments
---------
There seem to be a few arguments against its inclusion into 
structmodule.c...

Argument:
    The size specifier is variable, so you must know the size/magnitude 
of the thing you are (un)packing before you (un)pack it.

My Response:
    All use cases I have for this particular mechanism involve not using 
'variable' sized structs, but fixed structs with integers of 
non-standard byte-widths.  Specifically, I have a project in which I use 
some 3 and 5 byte unsigned integers.  One of my (un)pack format 
specifiers is '>H3G3G', and another is '>3G5G' (I have others, but these 
are the most concise).
    Certainly this does not fit the pickle/cPickle long (un)packing 
use-case, but that problem relies on truely variable long integer 
lengths, of which this specifier does not seek to solve.
    Really, the proposed 'g' and 'G' format specifiers are only as 
variable as the previously existing 's' format specifier.

Argument:
    The new specifiers are not standard C types.

My Response:
    Certainly they are not standard C types, but they are flexible 
enough to subsume all current integer C type specifiers.  The point was 
to allow a user to have the option of specifying their own integer 
lengths.  This supports use cases involving certain kinds of large 
dataset processing (my use case, which I may discuss after we release) 
and cryptography, specifically in the case of PKC...
while 1:
     blk = get_block()
     iblk = struct.unpack('>128G', blk)[0]
     uiblk = pow(iblk, power, modulous)
     write_block(struct.pack('>128G', uiblk))

    The 'p' format specifier is also not a standard C type, and yet it 
is included in struct, specifically because it is useful.

Argument:
    You can already do the same thing with:
     pickle.encode_long(long_int)
     pickle.decode_long(packed_long)
and some likely soon-to-be included additions to the binascii module.

My Response:
    That is not the same.  Nontrivial problems require multiple passes 
over your data with multiple calls.  A simple:
     struct.unpack('H3G3G', st)
becomes:
     pickle.decode_long(st[:2]) #or an equivalent struct call
     pickle.decode_long(st[2:5])
     pickle.decode_long(st[5:8])
And has no endian or sign options, or requires the status quo using of 
masks and shifts to get the job done.  As previously stated, one point 
of the module is to reduce the amount of bit shifting and masking required.

Argument:
    We could just document a method for packing/unpacking these kinds of 
things in the struct module, if this really is where people would look 
for such a thing.

My Response:
    I am not disputing that there are other methods of doing this, I am 
saying that the struct module includes a framework and documentation 
location that can include this particular modification with little 
issue, which is far better than any other proposed location for 
equivalent functionality.
    Note that functionality equivalent to pickle.encode/decode_long is 
NOT what this proposed enhancement is for.

Argument:
    The struct module has a steep learning curve already, and this new 
format specifier doesn't help it.

My Response:
    I can't see how a new format specifier would necessarily make the 
learning curve any more difficult, if it was even difficult in the first 
place.

Why am I even posting
---------------------
Raymond has threatened to close this RFE due to the fact that only I 
have been posting to state that I would find such an addition useful.

If you believe this functionality is useful, or even if you think that I 
am full of it, tell us: http://python.org/sf/1023290

  - Josiah