Access to bits for a PyLongObject

I'm working on PEP 3101, Advanced String Formatting. About the only built-in numeric formatting I have left to do is for converting a PyLongOjbect to binary. I need to know how to access the bits in a PyLong. After reading longobject.c, I can figure it out. But I'm looking for something that might be preferable to me poking around directly in ob_size and ob_digit[]. I'm looking for something along the lines of: for (i = 0; i < _PyLong_NumBits(v); i++) { // get i-th bit of v } I don't care if it's increasing or decreasing bits, I can handle either. I realize the actual code will likely involve the 2 nested loops, but logically this is what I need. I can certainly do this myself, by looping over ob_digit and then over each bit. But there's a comment in longintrepr.h that says the internals are published only for marshal.c. Should I go ahead and include longintrepr.h and loop over ob_digit myself, or is there some other method? I really don't want to use _PyLong_AsByteArray, since I don't want to do the copy. If I'm missing some PyLongObject API, please let me know. By the way, while doing this I noticed a bug in stringobject.c and unicodeobject.c, relating to a missing check for 'G' precision. I'm not sure if it could be a buffer overflow or not, without spending some more time analyzing at it. But it seems like the potential is certainly there. The bug is at http://python.org/sf/1673757 and a patch at http://python.org/sf/1673759. (I'll save my comments on how "approachable" python-dev is after I have this question answered!) Thanks for your time. Eric.

Eric V. Smith schrieb:
I think it would be a major flaw in PEP 3101 if you really needed it. The long int representation should be absolutely opaque - even the fact that it is a sign+magnitude representation should be hidden. Looking at the PEP, I see that a class can implement __format__. Wouldn't it be appropriate if the long type implemented that? Implementation-wise, I would expect that long_format already does the bulk of what you need to do. OTOH, also look at _PyString_FormatLong. Regards, Martin

Martin v. Löwis wrote:
Yes, I think that would be appropriate. However, it conflicts with the current implementation strategy, which is to make a stand-alone module until we can flush out all of the issues. Not that our current implementation should drive the correct decision, of course. Also, it would either mean duplicating lots of code from the int formatter, or having a formatter library that both can call. This is because __format__ must implement all formats, including padding, parenthesis for negatives, etc., not just the "missing" binary format. Not that that's necessarily bad, either. But see the next point.
OTOH, also look at _PyString_FormatLong.
I think a solution would be to add 'b' to _PyString_FormatLong, which I'm already calling for hex, octal, and decimal formatting. Does that sound reasonable? It seems to me that if binary is useful enough for PEP 3101, it should generally be available in _PyString_FormatLong. The obvious implementation of this would require adding a nb_binary to PyNumberMethods. I'm not sure what the impact of that change would be, but it sounds really big and probably a show-stopper. Maybe a direct call to a binary formatter would be better. OTOH, this approach isn't as efficient as I'd like (for all formatting outputs, not just binary), because it has to build a string object and then copy data out of it. Having written all of this, I'm now thinking that Nick's suggestion of _PyLong_AsByteArray might be the way to go. I would use that for all of my formatting for longs. I think I can use my output buffer as the buffer for _PyLong_AsByteArray, since all formats (binary, decimal, octal, hex) are less "bit dense" than the byte array. As long as I read, format, and write the data in the correct order, I'd be okay. So even though I'd copy the data into my buffer, at least I wouldn't be allocating more memory or another object just to extract data from the long. Maybe I'm over-emphasizing performance, given the early status of the implementation. But I'd like PEP 3101 to be as efficient as possible, because once it's available I'll replace all of the '%' string formatting in my code with it. I think others will consider that as well. Thank you for your insights. I apologize for the length of this message, but as I believe Pascal said, I did not have time to make it shorter. Eric.

Eric V. Smith schrieb:
Ah, I had missed the point that it's just binary formatting that you are concerned with (and then I missed that binary is "base 2", rather than "sequence of bits")
That sounds fine.
Sure, introducing _PyLong_Dual (or _PyLong_AsDualString) would be appropriate, that can then forward to long_format.
Ah, but that's a proof-of-concept implementation only, right? A "true" implementation should use __format__ (or whatever it's called). If *that* then isn't efficient, you should be worried (and consider introduction of a slot in the type object).
How would you do negative numbers, then? AsByteArray gives you two's complement.
Maybe I'm over-emphasizing performance, given the early status of the implementation.
Most definitely.
That is fine. However, don't trade efficiency for maintainability. Keep encapsulation of types, this is what OO is for. Modularize along with type boundaries. If that loses efficiency, come up with interfaces that still modularize in that way but are efficient. Don't "hack" to achieve performance. (Any other way I can formulate the same objective :-?) Regards, Martin

Martin v. Löwis wrote:
Apologies for not being clear. It's easy to forget that others don't share the context of something you've been immersed in.
_PyLong_Sign
Point taken. I currently have it using PyLong internals, just to get our tests to pass and so Pat can work on his part. As we still want to be a standalone module for a while, I'm going to modify the code to use AsByteArray and Sign to do the binary formatting only. When/if we integrate this into 3.0 (and 2.6, I hope), I'll look at adding __format__ to long, and possibly the other built in types. To do so we'll need to factor some code out to a library, because it doesn't make sense for all the built-in types to understand how to parse and operate on the format specifiers (the [[fill]align][sign][width][.precision][type] stuff). Thanks for your comments! Eric.

Eric V. Smith schrieb:
I think it would be a major flaw in PEP 3101 if you really needed it. The long int representation should be absolutely opaque - even the fact that it is a sign+magnitude representation should be hidden. Looking at the PEP, I see that a class can implement __format__. Wouldn't it be appropriate if the long type implemented that? Implementation-wise, I would expect that long_format already does the bulk of what you need to do. OTOH, also look at _PyString_FormatLong. Regards, Martin

Martin v. Löwis wrote:
Yes, I think that would be appropriate. However, it conflicts with the current implementation strategy, which is to make a stand-alone module until we can flush out all of the issues. Not that our current implementation should drive the correct decision, of course. Also, it would either mean duplicating lots of code from the int formatter, or having a formatter library that both can call. This is because __format__ must implement all formats, including padding, parenthesis for negatives, etc., not just the "missing" binary format. Not that that's necessarily bad, either. But see the next point.
OTOH, also look at _PyString_FormatLong.
I think a solution would be to add 'b' to _PyString_FormatLong, which I'm already calling for hex, octal, and decimal formatting. Does that sound reasonable? It seems to me that if binary is useful enough for PEP 3101, it should generally be available in _PyString_FormatLong. The obvious implementation of this would require adding a nb_binary to PyNumberMethods. I'm not sure what the impact of that change would be, but it sounds really big and probably a show-stopper. Maybe a direct call to a binary formatter would be better. OTOH, this approach isn't as efficient as I'd like (for all formatting outputs, not just binary), because it has to build a string object and then copy data out of it. Having written all of this, I'm now thinking that Nick's suggestion of _PyLong_AsByteArray might be the way to go. I would use that for all of my formatting for longs. I think I can use my output buffer as the buffer for _PyLong_AsByteArray, since all formats (binary, decimal, octal, hex) are less "bit dense" than the byte array. As long as I read, format, and write the data in the correct order, I'd be okay. So even though I'd copy the data into my buffer, at least I wouldn't be allocating more memory or another object just to extract data from the long. Maybe I'm over-emphasizing performance, given the early status of the implementation. But I'd like PEP 3101 to be as efficient as possible, because once it's available I'll replace all of the '%' string formatting in my code with it. I think others will consider that as well. Thank you for your insights. I apologize for the length of this message, but as I believe Pascal said, I did not have time to make it shorter. Eric.

Eric V. Smith schrieb:
Ah, I had missed the point that it's just binary formatting that you are concerned with (and then I missed that binary is "base 2", rather than "sequence of bits")
That sounds fine.
Sure, introducing _PyLong_Dual (or _PyLong_AsDualString) would be appropriate, that can then forward to long_format.
Ah, but that's a proof-of-concept implementation only, right? A "true" implementation should use __format__ (or whatever it's called). If *that* then isn't efficient, you should be worried (and consider introduction of a slot in the type object).
How would you do negative numbers, then? AsByteArray gives you two's complement.
Maybe I'm over-emphasizing performance, given the early status of the implementation.
Most definitely.
That is fine. However, don't trade efficiency for maintainability. Keep encapsulation of types, this is what OO is for. Modularize along with type boundaries. If that loses efficiency, come up with interfaces that still modularize in that way but are efficient. Don't "hack" to achieve performance. (Any other way I can formulate the same objective :-?) Regards, Martin

Martin v. Löwis wrote:
Apologies for not being clear. It's easy to forget that others don't share the context of something you've been immersed in.
_PyLong_Sign
Point taken. I currently have it using PyLong internals, just to get our tests to pass and so Pat can work on his part. As we still want to be a standalone module for a while, I'm going to modify the code to use AsByteArray and Sign to do the binary formatting only. When/if we integrate this into 3.0 (and 2.6, I hope), I'll look at adding __format__ to long, and possibly the other built in types. To do so we'll need to factor some code out to a library, because it doesn't make sense for all the built-in types to understand how to parse and operate on the format specifiers (the [[fill]align][sign][width][.precision][type] stuff). Thanks for your comments! Eric.
participants (2)
-
"Martin v. Löwis"
-
Eric V. Smith