
Good day everyone, I have produced a patch against the latest CVS to add support for two new formatting characters in the struct module. It is currently an RFE, which I include a link to at the end of this post. Please read the email before you respond to it. Generally, the struct module is for packing and unpacking of binary data. It includes support to pack and unpack the c types: byte, char, short, long, long long, char[], *, and certain variants of those (signed/unsigned, big/little endian, etc.) Purpose ------- I had proposed two new formatting characters, 'g' and 'G' (for biGint or lonG int). There was one primary purpose, to offer users the opportunity to specify their own integer lengths (very useful for cryptography, and real-world applications that involve non-standard sized integers). Current solutions involve shifting, masking, and multiple passes over data. There is a secondary purpose, and that is that future n-byte integers (like 16-byte/128-bit integers as supported by SSE2) are already taken care of. It also places packing and unpacking of these larger integers in the same module as packing and packing of other integers, floats, etc. This makes documentation easy. Functionality-wise, it merely uses the two C functions _PyLong_FromByteArray() and _PyLong_ToByteArray(), with a few lines to handle interfacing with the pack and unpack functions in the struct module. An example of use is as follows:
It follows the struct module standard 'lowercase for signed, uppercase for unsigned'. Arguments --------- There seem to be a few arguments against its inclusion into structmodule.c... Argument: The size specifier is variable, so you must know the size/magnitude of the thing you are (un)packing before you (un)pack it. My Response: All use cases I have for this particular mechanism involve not using 'variable' sized structs, but fixed structs with integers of non-standard byte-widths. Specifically, I have a project in which I use some 3 and 5 byte unsigned integers. One of my (un)pack format specifiers is '>H3G3G', and another is '>3G5G' (I have others, but these are the most concise). Certainly this does not fit the pickle/cPickle long (un)packing use-case, but that problem relies on truely variable long integer lengths, of which this specifier does not seek to solve. Really, the proposed 'g' and 'G' format specifiers are only as variable as the previously existing 's' format specifier. Argument: The new specifiers are not standard C types. My Response: Certainly they are not standard C types, but they are flexible enough to subsume all current integer C type specifiers. The point was to allow a user to have the option of specifying their own integer lengths. This supports use cases involving certain kinds of large dataset processing (my use case, which I may discuss after we release) and cryptography, specifically in the case of PKC... while 1: blk = get_block() iblk = struct.unpack('>128G', blk)[0] uiblk = pow(iblk, power, modulous) write_block(struct.pack('>128G', uiblk)) The 'p' format specifier is also not a standard C type, and yet it is included in struct, specifically because it is useful. Argument: You can already do the same thing with: pickle.encode_long(long_int) pickle.decode_long(packed_long) and some likely soon-to-be included additions to the binascii module. My Response: That is not the same. Nontrivial problems require multiple passes over your data with multiple calls. A simple: struct.unpack('H3G3G', st) becomes: pickle.decode_long(st[:2]) #or an equivalent struct call pickle.decode_long(st[2:5]) pickle.decode_long(st[5:8]) And has no endian or sign options, or requires the status quo using of masks and shifts to get the job done. As previously stated, one point of the module is to reduce the amount of bit shifting and masking required. Argument: We could just document a method for packing/unpacking these kinds of things in the struct module, if this really is where people would look for such a thing. My Response: I am not disputing that there are other methods of doing this, I am saying that the struct module includes a framework and documentation location that can include this particular modification with little issue, which is far better than any other proposed location for equivalent functionality. Note that functionality equivalent to pickle.encode/decode_long is NOT what this proposed enhancement is for. Argument: The struct module has a steep learning curve already, and this new format specifier doesn't help it. My Response: I can't see how a new format specifier would necessarily make the learning curve any more difficult, if it was even difficult in the first place. Why am I even posting --------------------- Raymond has threatened to close this RFE due to the fact that only I have been posting to state that I would find such an addition useful. If you believe this functionality is useful, or even if you think that I am full of it, tell us: http://python.org/sf/1023290 - Josiah

On Oct 3, 2004, at 4:29 AM, Josiah Carlson wrote:
I'm +1 on this.. I've dealt with 24, 48, and 128 bit integers before and it's always been a pain with the struct module. Another addition I'd like to see is bit length struct fields, but that opens an entirely different can of worms. -bob

Those functions ought to exist whether or not this RFE is accepted. Here's the crux I think. Is this used often enough in a context where (a) the length of the number is fixed (not determined by a count in a previous field) and (b) preceded or followed by other fixed-length fields so that it makes sense to use the struct module for parsing or formatting those other fields? I have often found that amongst less-experienced programmers there is a great mystery about the correspondence between the "binary" representation of numbers (in strings of bytes) and the numeric objects that Python makes available (int, lont). Often the struct module is considered the only way to cross this boundary, while in fact there are many other approaches; often using the built-in functions ord() or chr() and shifting and masking works just as well, but you have to think about it the right way. I apologize for not having read the entire post before responding; in case the motivation is already there, that's great, and let it be a response to Raymond. If it is not there, I like Raymond's proposal better. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

No argument here. I believe the binascii and struct additions have orthogonal use-cases.
All use-cases I have for it, yes. Seemingly this is a yes with Bob Ippolito as well (he stated he uses 3, 6 and 16 byte integers, which certainly seem fixed).
In my case, yes. As provided in the email, two that I use right now are 'H3G3G' and '3G5G'.
Indeed, this /can/ be the case, but it is not in my case. Before I knew of struct, I had written my own binary packers and unpackers using ord and chr. While it worked, I nearly wept with joy when I discovered struct, which did 90% of what I needed it to do, in a much cleaner fashion.
I apologize for not having read the entire post before responding; in
Yeah, it was a bit long, I believed there was a lot to say. - Josiah

[Josiah Carlson]
I never used it <wink>.
That isn't an argument I've seen before, although I pointed out these pickle functions in the tracker item. The argument there is that pickle demonstrates actual long<->bytes use cases in the core, implemented painfully in Python, and that the struct patch wouldn't make them easier. The argument was not that you can already use the pickle functions, it was that the pickle functions shouldn't need to exist -- they're isolated hacks that address only one part of the whole problem, and even that part is addressed painfully now. The workalike implementation in cPickle.c was actually easier than the Python implementation in pickle.py, because the former get to use the flexible C API functions I added for "this kind of thing".
and some likely soon-to-be included additions to the binascii module.
The pickle functions *may* argue in favor of binascii functions, if those were actually specified. I'm not sure Raymond has ever spelled out what their signatures would be, so I don't know. If suitable functions in binascii will exist, then that (as Guido said) raises the bar for adding similar functionality to struct too, but does not (as Guido also said) preclude adding similar functionality to struct.
My problem with that use case is that I've never had it, and have never seen an app that had it. Since I've been around for decades, that triggers a suspicion that it's not a common need. ...
That's another argument I haven't seen before, but bears an hallucinatory resemblance to one I made. People have wondered how to convert between long and bytestring in Python for years, and prior to this iteration, they have always asked whether there's "a function" to do it. Like all the use cases I ever had, they have one long to convert, or one bytestring, at a time. "Ah, see the functions in binascii" would be a direct answer to their question, and one that wouldn't require that they climb any part of struct's learning curve. IOW, if it *could* be done without struct, I'm sure that would make life easier for most people who ever asked about it. For people who already know struct, the marginal learning burden of adding another format code is clearly minor.
I can't see how a new format specifier would necessarily make the learning curve any more difficult,
Neither can I, for people who already know struct.
if it was even difficult in the first place.
It is difficult. There are many format codes, they don't all work the same way, and there are distinctions among: - native, standard, and forced endianness - native and standard (== no) alignment - native and standard sizes for some types Newbie confusion about how to use struct is common on c.l.py, and is especially acute among those who don't know C (just *try* to read the struct docs from the POV of someone who hasn't used C).
I certainly would like to see more people say they'd *use* the g and G codes in struct even if "one shot" functions in binascii were available. I'd also like to see a specific design for binascii functions. I don't think "simple" would be an accurate description of those, if they're flexible enough to handle the common use cases I know about. They'd be more like long2bytes(n, length=None, unsigned=False, msdfirst=False) where, by default (length is None), long2bytes(n) is the same as pickle.encode_long(n), except that long2bytes(0) would produce '\0' instead of an empty string. Specifying length <= 0 is a ValueError. length > 0 would be akin to the C API _PyLong_AsByteArray(n, buf, length, not msdfirst, not unsigned) ValueError if n < 0 and unsigned=True. OverflowError if length > 0 and n's magnitude is too large to fit in length bytes. bytes2long(bytes, unsigned=False, msdfirst=False) would be akin to the C API _PyLong_FromByteArray(bytes, len(bytes), not msdfirst, not unsigned) except that bytes=="" would raise ValueError instead of returning 0.

Since "long" is supposed to be a full-fledged member of the python building blocks, I'm +1 for functions being added in both binascii and struct. One of the greatest things I use struct is for is packing (and unpacking) the python building blocks for "external use" -- network, database, and (usually C) libraries. I think it would be best if all the building blocks could be packed and unpacked from one module. The additions to binascii would be more convenient to use of the two additions. But truth to tell, I rarely use binascii. I tend to prefer struct.pack with str.encode. What do you think about adding long.tobytes()/long.frombytes() to go with the new bytes() type? <wink> -Shane Holloway

On Sun, 03 Oct 2004 13:28:30 -0600, Shane Holloway (IEEE) <shane.holloway@ieee.org> wrote:
Sorry for introducing my not-very-qualified words on this topic, but... I've read the thread up to this point wondering why the bytes() type were not being thought of as a clean and definitive solution to this problem. It would allow to greatly simplify everything regarding struct, binascii and arbitrary low level data manipulation for networking and similar stuff. I also agree with Tim Peters comments regarding struct's C heritage -- I never really liked C even when I *had* to use it daily, and the struct syntax still reads alien to me. I know this is another timeframe entirely, but *if* my vote counted, I would be +1 for a future struct implementation tightly integrated with the bytes() type. But that's me anyway. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro@gmail.com mail: carribeiro@yahoo.com

Carlos Ribeiro wrote:
No, it wouldn't. If you have a 'long' value, and you want to convert it to 'bytes', how exactly would you do that? Two's complement, I suppose - but that would close out people who want unsigned numbers. Also, do you want big-endian or little-endian? What about a minimum width, what about overflows? Tim has proposed a signature for binascii that covers all these scenarios, and I doubt it could get simpler then that and still useful.
I think you will find that the struct module *already* supports the bytes type. The bytes type will be just a synonym for the current string type, except that people will stop associating characters with the individual bytes; plus the bytes type will be possibly mutable. As the struct module creates (byte) strings today, it will trivially support the bytes type. Regards, Martin

On Sun, 03 Oct 2004 23:18:20 +0200, Martin v. Löwis <martin@v.loewis.de> wrote:
Well, I don't intend to get way too off topic. But in my mind, it makes sense to have a few methods to allow any struct-type hack to work *directly* with the bytes() type. For example: the bytes() type could have a constructor that would take a struct-type string, as in: bytes(data, 'format-string-in-struct-format') or bytes.fromstruct(data, 'format-string-in-struct-format') Alternatively, an append method could take two parameters, one being the data to be appended, and the other the 'transformation rule' -- big endian, little endian, etc: bytes.append(data, 'format-string-in-struct-format') The interface for the data specification could also be a little bit cleaner; I don't see great value at specifying everything with single-character codes. It may look obvious to old C coders, but it doesn't mean that it's the only, or the better, way to do it. Besides concatenation, a few other transformations are useful for bytes -- shifting and rotation in both directions (working at bit level, perhaps?). That's how I thought it should work. (... and, as far as binascii is concerned, I see it more as a way to encode/decode binary data in true string formats than anything else.)
As I've explained above, it's a good first step, but a true bytes() type could have a little bit more functionality than char strings have. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro@gmail.com mail: carribeiro@yahoo.com

If you have used PostgreSQL, all strings are a variant of pascal strings. It may be the case in other databases, but I have little experience with them.
It may be domain specific. I've only been using Python for 4 1/2 years, yet I have used such structs to build socket protocols, databases and search engines, for both class and contract. Heck, I find it useful enough that I have considered donating to the PSF just to see this feature included.
This was one argument that Raymond has offered a few times. In the case of native alignment issues that seem to be the cause of much frustration among new struct learners, this particular format specifier doesn't much apply; it is not native.
binascii" would be a direct answer to their question, and one that ...
So there is no confusion, I agree with Raymond, Guido, and you, that the binascii function additions should occur.
Good point about the docs regarding not using C. Does it make sense to include documentation regarding C structs (perhaps in an appendix) to help those who have otherwise not experienced C?
<raise hand> <wink>
[snip conversion operations] Those would work great for packing and unpacking of single longs. - Josiah

On Sun, 3 Oct 2004 14:14:35 -0400, Tim Peters <tim.peters@gmail.com> wrote:
I have an application where I have to read and write a series of 24-bit integers in a binary file. The g and G codes would make this more convenient, as well as making all the reading and writing code more consistent (as the rest of it uses struct).

On Mon, 04 Oct 2004 08:16:33 +0200, Martin v. Löwis <martin@v.loewis.de> wrote:
There are a fixed number of them -- though it's somewhere in the thousands range... The array module would handle them quite nicely if it supported 3-byte integers; but in general, a more generic struct module will be handier than a more generic array module (I've never dealt with a tuple with thousands of entries before -- is it likely to be a problem? Anyway, wrapping it in a function to read all the ints in blocks and put them in a list is very little trouble).

On Oct 3, 2004, at 4:29 AM, Josiah Carlson wrote:
I'm +1 on this.. I've dealt with 24, 48, and 128 bit integers before and it's always been a pain with the struct module. Another addition I'd like to see is bit length struct fields, but that opens an entirely different can of worms. -bob

Those functions ought to exist whether or not this RFE is accepted. Here's the crux I think. Is this used often enough in a context where (a) the length of the number is fixed (not determined by a count in a previous field) and (b) preceded or followed by other fixed-length fields so that it makes sense to use the struct module for parsing or formatting those other fields? I have often found that amongst less-experienced programmers there is a great mystery about the correspondence between the "binary" representation of numbers (in strings of bytes) and the numeric objects that Python makes available (int, lont). Often the struct module is considered the only way to cross this boundary, while in fact there are many other approaches; often using the built-in functions ord() or chr() and shifting and masking works just as well, but you have to think about it the right way. I apologize for not having read the entire post before responding; in case the motivation is already there, that's great, and let it be a response to Raymond. If it is not there, I like Raymond's proposal better. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

No argument here. I believe the binascii and struct additions have orthogonal use-cases.
All use-cases I have for it, yes. Seemingly this is a yes with Bob Ippolito as well (he stated he uses 3, 6 and 16 byte integers, which certainly seem fixed).
In my case, yes. As provided in the email, two that I use right now are 'H3G3G' and '3G5G'.
Indeed, this /can/ be the case, but it is not in my case. Before I knew of struct, I had written my own binary packers and unpackers using ord and chr. While it worked, I nearly wept with joy when I discovered struct, which did 90% of what I needed it to do, in a much cleaner fashion.
I apologize for not having read the entire post before responding; in
Yeah, it was a bit long, I believed there was a lot to say. - Josiah

[Josiah Carlson]
I never used it <wink>.
That isn't an argument I've seen before, although I pointed out these pickle functions in the tracker item. The argument there is that pickle demonstrates actual long<->bytes use cases in the core, implemented painfully in Python, and that the struct patch wouldn't make them easier. The argument was not that you can already use the pickle functions, it was that the pickle functions shouldn't need to exist -- they're isolated hacks that address only one part of the whole problem, and even that part is addressed painfully now. The workalike implementation in cPickle.c was actually easier than the Python implementation in pickle.py, because the former get to use the flexible C API functions I added for "this kind of thing".
and some likely soon-to-be included additions to the binascii module.
The pickle functions *may* argue in favor of binascii functions, if those were actually specified. I'm not sure Raymond has ever spelled out what their signatures would be, so I don't know. If suitable functions in binascii will exist, then that (as Guido said) raises the bar for adding similar functionality to struct too, but does not (as Guido also said) preclude adding similar functionality to struct.
My problem with that use case is that I've never had it, and have never seen an app that had it. Since I've been around for decades, that triggers a suspicion that it's not a common need. ...
That's another argument I haven't seen before, but bears an hallucinatory resemblance to one I made. People have wondered how to convert between long and bytestring in Python for years, and prior to this iteration, they have always asked whether there's "a function" to do it. Like all the use cases I ever had, they have one long to convert, or one bytestring, at a time. "Ah, see the functions in binascii" would be a direct answer to their question, and one that wouldn't require that they climb any part of struct's learning curve. IOW, if it *could* be done without struct, I'm sure that would make life easier for most people who ever asked about it. For people who already know struct, the marginal learning burden of adding another format code is clearly minor.
I can't see how a new format specifier would necessarily make the learning curve any more difficult,
Neither can I, for people who already know struct.
if it was even difficult in the first place.
It is difficult. There are many format codes, they don't all work the same way, and there are distinctions among: - native, standard, and forced endianness - native and standard (== no) alignment - native and standard sizes for some types Newbie confusion about how to use struct is common on c.l.py, and is especially acute among those who don't know C (just *try* to read the struct docs from the POV of someone who hasn't used C).
I certainly would like to see more people say they'd *use* the g and G codes in struct even if "one shot" functions in binascii were available. I'd also like to see a specific design for binascii functions. I don't think "simple" would be an accurate description of those, if they're flexible enough to handle the common use cases I know about. They'd be more like long2bytes(n, length=None, unsigned=False, msdfirst=False) where, by default (length is None), long2bytes(n) is the same as pickle.encode_long(n), except that long2bytes(0) would produce '\0' instead of an empty string. Specifying length <= 0 is a ValueError. length > 0 would be akin to the C API _PyLong_AsByteArray(n, buf, length, not msdfirst, not unsigned) ValueError if n < 0 and unsigned=True. OverflowError if length > 0 and n's magnitude is too large to fit in length bytes. bytes2long(bytes, unsigned=False, msdfirst=False) would be akin to the C API _PyLong_FromByteArray(bytes, len(bytes), not msdfirst, not unsigned) except that bytes=="" would raise ValueError instead of returning 0.

Since "long" is supposed to be a full-fledged member of the python building blocks, I'm +1 for functions being added in both binascii and struct. One of the greatest things I use struct is for is packing (and unpacking) the python building blocks for "external use" -- network, database, and (usually C) libraries. I think it would be best if all the building blocks could be packed and unpacked from one module. The additions to binascii would be more convenient to use of the two additions. But truth to tell, I rarely use binascii. I tend to prefer struct.pack with str.encode. What do you think about adding long.tobytes()/long.frombytes() to go with the new bytes() type? <wink> -Shane Holloway

On Sun, 03 Oct 2004 13:28:30 -0600, Shane Holloway (IEEE) <shane.holloway@ieee.org> wrote:
Sorry for introducing my not-very-qualified words on this topic, but... I've read the thread up to this point wondering why the bytes() type were not being thought of as a clean and definitive solution to this problem. It would allow to greatly simplify everything regarding struct, binascii and arbitrary low level data manipulation for networking and similar stuff. I also agree with Tim Peters comments regarding struct's C heritage -- I never really liked C even when I *had* to use it daily, and the struct syntax still reads alien to me. I know this is another timeframe entirely, but *if* my vote counted, I would be +1 for a future struct implementation tightly integrated with the bytes() type. But that's me anyway. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro@gmail.com mail: carribeiro@yahoo.com

Carlos Ribeiro wrote:
No, it wouldn't. If you have a 'long' value, and you want to convert it to 'bytes', how exactly would you do that? Two's complement, I suppose - but that would close out people who want unsigned numbers. Also, do you want big-endian or little-endian? What about a minimum width, what about overflows? Tim has proposed a signature for binascii that covers all these scenarios, and I doubt it could get simpler then that and still useful.
I think you will find that the struct module *already* supports the bytes type. The bytes type will be just a synonym for the current string type, except that people will stop associating characters with the individual bytes; plus the bytes type will be possibly mutable. As the struct module creates (byte) strings today, it will trivially support the bytes type. Regards, Martin

On Sun, 03 Oct 2004 23:18:20 +0200, Martin v. Löwis <martin@v.loewis.de> wrote:
Well, I don't intend to get way too off topic. But in my mind, it makes sense to have a few methods to allow any struct-type hack to work *directly* with the bytes() type. For example: the bytes() type could have a constructor that would take a struct-type string, as in: bytes(data, 'format-string-in-struct-format') or bytes.fromstruct(data, 'format-string-in-struct-format') Alternatively, an append method could take two parameters, one being the data to be appended, and the other the 'transformation rule' -- big endian, little endian, etc: bytes.append(data, 'format-string-in-struct-format') The interface for the data specification could also be a little bit cleaner; I don't see great value at specifying everything with single-character codes. It may look obvious to old C coders, but it doesn't mean that it's the only, or the better, way to do it. Besides concatenation, a few other transformations are useful for bytes -- shifting and rotation in both directions (working at bit level, perhaps?). That's how I thought it should work. (... and, as far as binascii is concerned, I see it more as a way to encode/decode binary data in true string formats than anything else.)
As I've explained above, it's a good first step, but a true bytes() type could have a little bit more functionality than char strings have. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro@gmail.com mail: carribeiro@yahoo.com

If you have used PostgreSQL, all strings are a variant of pascal strings. It may be the case in other databases, but I have little experience with them.
It may be domain specific. I've only been using Python for 4 1/2 years, yet I have used such structs to build socket protocols, databases and search engines, for both class and contract. Heck, I find it useful enough that I have considered donating to the PSF just to see this feature included.
This was one argument that Raymond has offered a few times. In the case of native alignment issues that seem to be the cause of much frustration among new struct learners, this particular format specifier doesn't much apply; it is not native.
binascii" would be a direct answer to their question, and one that ...
So there is no confusion, I agree with Raymond, Guido, and you, that the binascii function additions should occur.
Good point about the docs regarding not using C. Does it make sense to include documentation regarding C structs (perhaps in an appendix) to help those who have otherwise not experienced C?
<raise hand> <wink>
[snip conversion operations] Those would work great for packing and unpacking of single longs. - Josiah

On Sun, 3 Oct 2004 14:14:35 -0400, Tim Peters <tim.peters@gmail.com> wrote:
I have an application where I have to read and write a series of 24-bit integers in a binary file. The g and G codes would make this more convenient, as well as making all the reading and writing code more consistent (as the rest of it uses struct).

On Mon, 04 Oct 2004 08:16:33 +0200, Martin v. Löwis <martin@v.loewis.de> wrote:
There are a fixed number of them -- though it's somewhere in the thousands range... The array module would handle them quite nicely if it supported 3-byte integers; but in general, a more generic struct module will be handier than a more generic array module (I've never dealt with a tuple with thousands of entries before -- is it likely to be a problem? Anyway, wrapping it in a function to read all the ints in blocks and put them in a list is very little trouble).
participants (9)
-
"Martin v. Löwis"
-
Andrew Durdin
-
Bob Ippolito
-
Carlos Ribeiro
-
Guido van Rossum
-
Josiah Carlson
-
Raymond Hettinger
-
Shane Holloway (IEEE)
-
Tim Peters