bitwise operations on bytes

Hello, As previously mentioned on python-ideas [1] (circa 2006), it would make sense to be able to perform bitwise operations on bytes/bytearray. Stealing the example from the original suggestion: Suppose I have a string (say, read in from a binary file) that has a 4-byte field in it. Bit 11 indicates whether the accompanying data is in glorped form (something I'm just making up for this example). For example, if the field has 0x000480c4, the data is not glorped. If the data is glorped, bit 11 would be on, i.e., 0x001480c4. Let's say I want to turn on the glorp bit; what I have to do now: GLORPED = 0x10 newchar = flags[1] | GLORPED flags = flags[0] + newchar + flags[2:] What I'd like to be able to do is something like: GLORPED = b"\x00\x10\x00\x00" flags |= GLORPED # test if the glorped bit is on any(flags & GLORPED) I have run into this a few times, at least when reading/writing binary formats etc. This approach is more intuitive than the typical/archaic way of converting to an integer, performing a bitwise operation on the integer, converting back to bytes. Arguably, bitwise operations on a high-level integer type don't make sense, as base 2 is an implementation detail. At the very least, bytes and bytearray should be usable with the ~ ^ | & operators Example behavior:
Unresolved problems: If the two arguments are of different length, either it could either raise a ValueError or mimic the behavior of ints. Xoring an int to a byte seems less than well defined in general, due to endianness ambiguity of the int and size ambiguity. I would think this should not be allowed. Also conceivable is using the shift operators >> and << on bytes, but I personally would use that less often, and the result of such an operation is ambiguous due to endianness. -Eric Eisner [1] http://mail.python.org/pipermail/python-ideas/2006-December/000001.html

Eric Eisner wrote:
You can already do bitwise operations on bytes, and retrieving the relevant byte and using single-byte operations on it is simple enough. As I see it, the major underlying problem here is that byte-arrays are immutable, since what you really want is to be able to change a single byte of the array in-place. In general, Python's byte arrays and strings are very ill suited for dealing with data which isn't byte-based, and definitely horrible at building or modifying data streams. I recommend using an external library for working with data structures and/or data streams. Construct[1] is my personal favorite, since it's especially Pythonic and easy to use. In any case, if you wish to propose a mutable byte/bit array which supports array-wise binary operators, I can say I would certainly be glad to have such a class at my disposal. - Tal [1] http://construct.wikispaces.com/

On Fri, Aug 07, 2009 at 02:43:54PM +0300, Tal Einat wrote:
Like this http://pypi.python.org/pypi/BitVector/ ? Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

Tal Einat wrote:
Careful with your terminology here. In Python 3.x, the following two types both exist: bytes: immutable sequence of integers between 0 and 255 inclusive bytearray: similar to bytes, but mutable The operations ~x, x&y, x|y and x^y make sense for both bytes and bytearray. The augmented assignment operators only make sense for bytearray. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, 7 Aug 2009 09:43:54 pm Tal Einat wrote:
You can already do bitwise operations on bytes,
Are you sure about that? I tried looking it up in the docs, but python.org is down at the moment, so excuse me if I've missed something obvious. In Python 3.0:
Have I missed something?
I'm not speaking for the original poster, but for myself, not necessarily. While it would be nice to be able to flip bits directly, I'd be happy with bitwise operators to return new instances, as they currently do for ints. Something like: # not actual code
This could work for bit-flipping operations, although I'd welcome an easier to use API: # still not actual code
-- Steven D'Aprano

On Fri, Aug 7, 2009 at 11:04 AM, Eric Eisner<ede@mit.edu> wrote:
+1 from me. I'd say that it makes just as much sense to do bit operations on bytes or bytearrays as it does on integers. I've always felt a little bit dirty when I've abused ints as bitstrings in the past; byte strings seem better suited to this task, especially where input and output are also involved.
I'd say ValueError, since it's not really clear what 'mimic the behaviour of ints' means here. E.g., should bytes([1,2,3,4]) | bytes([16, 32, 64]) be equal to bytes([17, 34, 67, 64]) or bytes([1, 18, 35, 68]). It seems better to restrict the use to the cases where there's a single obvious interpretation.
Agreed.
Agreed. To make sense of the shift operators you effectively have to give 'position' interpretations for the individual bits, and there's no single obvious way of doing this; for the plain bitwise operations this isn't necessary. One other question: if a has type bytes and b has type bytearray, what types would you propose that a & b and b & a should have? Mark

On Fri, 7 Aug 2009 11:19:20 pm Mark Dickinson wrote:
To me, the single obvious meaning of left- and right-shift is to shift to the left and the right :) E.g. b"abcd" >> 8 => "abc" b"abcd" << 8 => "abcd\0" which would have the benefit of matching what ints already do:
I'm not sure what other "obvious" meanings you could give them. Have I missed something? -- Steven D'Aprano

On Fri, Aug 7, 2009 at 10:54 PM, Steven D'Aprano<steve@pearwood.info> wrote:
Yes, byte order. It is not at all "obvious" whether the lowest-order byte is on the left or on the right. Your interpretation is big-endian. But mathematically speaking, a little-endian interpretation is somewhat easier, because the value in byte number i corresponds to that value multiplied by 256**i. Another way to look at is, is b'abc' supposed to be equal to b'\0abc' (big-endian) or b'abc\0' (little-endian) ? I find ignoring trailing nulls more logical than ignoring leading nulls, since the indexes of the significant digits are the same in the little-endian case. In the grander scheme of things, I worry that interpreting byte strings as integers and implementing bitwise operators on them is going to cause more confusion and isn't generally useful enough to warrant the extra code. I'd be okay with a standard API to transform a byte array into an integer and vice versa -- there you can be explicit about byte order and what to do about negative numbers. I can't remember right now if we already have such an API for arbitrary sizes -- the struct module only handles sizes 2, 4 and 8. I can hack it by going via a hex representation: i = 10**100 b = bytes.fromhex(hex(i)[2:]) import binascii j = int(binascii.hexlify(b), 16) assert j == i but this is a pretty gross hack. Still, most likely faster than writing out a loop in Python. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum schrieb:
What about operations that don't require the bytes to be interpreted as anything else? The OP proposed bytes_a | bytes_b to mean bytes(b_a | b_b for (b_a, b_b) in zip(bytes_a, bytes_b)) except that (probably) an equal length would have to be asserted. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

I think it's a very rarely used feature that is more likely to baffle the casual reader. The question of what to do with byte strings of unequal length is just the first issue that crops up. The (still legitimate) question why we would support | and & but not << and >> is another. It's a slippery slope... --Guido On Sat, Aug 8, 2009 at 3:09 PM, Georg Brandl<g.brandl@gmx.net> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I think it's a very rarely used feature that is more likely to baffle the casual reader.
However there's currently no way to efficiently do bitwise operations en masse on raw bytes without converting to and from long integers, which doesn't seem very satisfactory. Refusing to provide this ability for the above reason sounds like purity beating practicality to me.
The question of what to do with byte strings of unequal length is just the first issue that crops up.
I would raise an exception.
Ambiguity due to byte order seems like a good enough reason not to implement shifts as operators. Maybe provide methods that specify a byte order explicitly? -- Greg

On Sun, 09 Aug 2009 10:41:55 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
It seems to me that bytes are a container type. How many builtin operators are there that operate on every element in any kind of container (other than queries, many of which already work on bytes)? It seems to me that what you really want is a high-performance array module of some flavor. Isn't that part of what NumPy provides? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Sat, Aug 8, 2009 at 6:31 PM, Guido van Rossum<guido@python.org> wrote:
Indeed. But let me rephrase this. I don't know Python, I'm just starting with Py3k, and I see a bytes object. I don't know what a "byte string" is (and I even feel that the term is strange), but I understand that b"b3\xdd" is an array of three bytes, which, of course, is an array of 24 bits (unless I know that bytes are not always octets, but I think it's the same for this case). So, I think that providing bit semantics is a must for bytes objects. I don't care (or I don't know enough to know that this exist) about big/little endian, I just assume that this "just works" (it's is the same case with line ending bytes: I may not know that different OSs finish lines with different marks). *I* (me as Facundo), don't really know enough to be able to propose "what could work and keep surprise to a minimum", but just wanted to present an "end user case" about this.
Yes. Bit arrays and integers suffer (suffer?) from the same issue that Unicode. An integer, and an Unicode character, are encoded into bits... and if you have bits, you need to know how to decode them to get again your Unicode character, or your integer. Maybe we could use the same names? What about something like b"....". decode("little_endian__two's_complement") --> int? (please, very please, better encoding names) Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

On Sat, Aug 8, 2009 at 10:29 PM, Facundo Batista<facundobatista@gmail.com> wrote:
Sounds like a non-sequitur if anything.
From the perspective of "I don't know Python" you cannot expect to draw valid conclusions.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

For some background, I think the separation of text and data is a powerful idea, and now that I have my nice builtin object for data (and its mutable cousin), I want to be able to manipulate the raw data directly. Currently, all of the bitwise manipulations live in the int type (for what I like to think of as historical reasons). Currently, when manipulating a byte at a time, the bytes type does the right thing and shows the elements as ints, giving me access to these bitwise manipulations. However there currently lacks builtin functionality to manipulate multi-bytes chunks. In the grander scheme of things, this is what I would like to see addressed. The most general solution, which Guido mentions, and which I have also been thinking about, is a builtin conversion from bytes to ints that explicitly requires resolving ambiguities (eg endianness, negatives, data length). I personally use those pretty gross hacks to fake the functionality now, but they assumes big-endian, two's complement etc. I proposed direct bitwise operators (~ & ^ |) for bytes because these seemed to be the subset of all bitwise manipulations that were (mostly) unambiguous for raw bytes. Even if python gains conversions between bytes and ints, I still think these unambiguous operations would be useful to have (especially ~, the most unambiguous among them). -Eric

On Sat, Aug 8, 2009 at 10:31 PM, Guido van Rossum<guido@python.org> wrote:
That would also be a welcome addition. It's been requested on bugs.python.org at least a couple of times[1][2], and the C code to do the conversions already exists (_PyLong_{As,From}_ByteArray in longobject.c), so it wouldn't be too much work to implement. The main problem would be deciding exactly what the API should be and where to put it. Mark [1] http://bugs.python.org/issue1023290 [2] http://bugs.python.org/issue923643

Mark Dickinson wrote:
My suggestion would be to provide the relevant constructors as class methods on int, bytes and bytearray: bytes.from_int bytearray.from_int int.from_bytes Alternatively, the int.from_bytes classmethod could be replaced with a "to_int" instance method on bytes and bytearray. The method signatures would need to closely resemble the C API. In particular, for the conversion from int to bytes being able to state a desired size would both allow detection of cases where the value is too large as well as proper padding of the two's complement sign bit. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Mon, Aug 10, 2009 at 19:30, Nick Coghlan<ncoghlan@gmail.com> wrote:
For completeness, any function converting from int to bytes needs to accept the arguments size, endianness, and two's complement handling. By default, size and two's complement could be inferred from the int's value. Any function converting from bytes to int needs to accept the arguments endianness and two's complement. The function cannot really make a reasonable assumption about either. Having this much to specify in bytes.from_int or bytes.to_int seems a little overwhelming (to me anyway). It is already more complicated than the analogous bytes.decode for strings. What about a putting this in module support, and putting each of these options in its own named method? Some dummy examples that may need better names: base256.encode_le(0x123456, size=5) # returns b'\x56\x34\x12\x00\x00' base256.decode_le_positive(b'\xff\xce') # returns 0xceff -Eric

Eric Eisner wrote:
Keep in mind that Mark's suggestion isn't to completely reinvent the wheel - it is more about exposing an existing internal C API to Python code. Since that C API is provided by the int object, we should also consider the option of providing it in Python purely as methods of that object. Bytes to int: bytes pointer: pointer to first byte to be converted size: number of bytes to be converted little_endian: flag indicating MSB is at offset 0 signed: flag indicating whether or not the value is signed For a Python version of this conversion, the size argument is unnecessary, reducing the required parameters to a bytes/bytearray/memoryview reference, an endianness marker and a 'signed' flag (to indicate whether the buffer contains an unsigned value or a signed two's complement value). One slight quirk of the C API that probably shouldn't be replicated is a size of 0 translating to an integer result of zero. For the Python API, passing in an empty byte sequence should trigger a ValueError. In the C API, int to bytes takes the same parameters as the bytes to int conversion, but the meaning is slightly different. Int to bytes: bytes pointer: pointer to first byte to be converted size: number of bytes in result little_endian: flag indicating to write MSB to offset 0 signed: flag indicating negative value can be converted In this case, the size matters even for a Python version as an OverflowError will be raised if the integer won't fit in the specified size and sufficient sign bits are added to pad the result out to the requested size otherwise. That suggests to me the following signatures for the conversion functions (regardless of the names they might be given): int.from_bytes(data, *, little_endian=None, signed=True) little_endian would become a three-valued parameter for the Python version: None = native; False = little-endian; True = big-endian. The three valued parameter is necessary since Python code can't get at the "IS_LITTLE_ENDIAN" macro that the PyLong code uses to determine the endianness of the system when calling the C API functions. signed would just be an ordinary boolean flag int.to_bytes(data, *, size=0, little_endian=None, signed=True) A size <= 0 would mean to produce as many bytes as are needed to represent the integer and no more. Otherwise it would represent the maximum number of bytes allowed in the response (raising OverflowError if the value won't fit). little_endian and signed would be interpreted as for the conversion from bytes to an integer Sure, these could be moved into a module of their own, but I'm not sure what would be gained by doing so. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan <ncoghlan@...> writes:
int.to_bytes(data, *, size=0, little_endian=None, signed=True)
It would be better IMO if this method belonged to the target type, so that you can write `bytearray.from_int()` without going through an intermediate bytes object. What if `data` is equal to 0? Does it return an empty bytes object? It doesn't look dangerous to do so, so I'd say go for it.

Antoine Pitrou wrote:
Actually, that's just a typo after copying the from_bytes signature line when writing my message. int.to_bytes would just be a normal instance method. If the value was 0, then the result would be a single byte with the value zero (or all zeroes if a size was specified). As far as bytearray goes, my original post did suggest that approach (i.e. class methods on the target types), but that then gets tedious as every new type that wants to support conversion from an integer needs to either go through one of the existing types or else have it's own C extension call the underlying C API directly. Something analagous to the file.readinto() method seems like a more appropriate solution for populating the mutable types. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan wrote:
I'd suggest that int.to_bytes should take an optional destination parameter, and it should accept any object having a mutable buffer interface. That would keep all the int-related methods with the int class, without restricting the source or destination to any particular type. -- Greg

Nick Coghlan wrote:
Why? This seems like a perfectly logical limiting case to me.
I don't like the idea of a three-valued boolean. I also don't like boolean parameters whose sense is abritrary (why is it called "little_endian" and not "big_endian", and how do I remember which convention was chosen?) My suggestion would be to use the same characters that the struct module uses to represent endianness (">" for big-endian, "<" for little-endian, etc.) -- Greg

On Mon, Aug 10, 2009 at 2:13 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
Alexandre Vassalotti has posted a patch at http://bugs.python.org/issue1023290 that implements methods very much like the ones that Nick describes. Mark

This patch seems very complete, with only the API to hammered out. Here is my summary of some options: method names: patch behavior: int.as_bytes / int.frombytes int.as_bytes / int.from_bytes int.asbytes / int.frombytes Endianness: patch behavior: default flag: little_endian=False byteorder option accepting 'big' or 'little', this can also accept sys.byteorder sign: patch behavior: default flag: signed=True maybe unsigned as the default? byte length: patch behavior: fixed_length=None other names: length, bytelength As for my own opinions: I think the method names should have consistent underscore usage. I think it is important to have all of big, little, and native as byteorder options, but I would be against having native as the default. I think it is unclean for core functionality to be platform dependent (there may be some examples of this that I'm not thinking of though). One option would be to have the defaults of int.as_bytes mirror the hex builtin function, eg big endian and unsigned. This way it could be more consistent with related functionality already in the core. Thoughts? -Eric On Sat, Aug 15, 2009 at 20:23, Mark Dickinson<dickinsm@gmail.com> wrote:

Eric Eisner wrote:
Having the byte conversion mirror the hex builtin is probably the least arbitrary style guide we could come up with for the defaults. The other option would be to not *have* defaults for either of these settings and force the programmer to make a deliberate choice. Then once it has been out in the field for a while and we have evidence about the way people use it, add sensible defaults in the following release. I would prefer either of those two options to attempting to just guess what the appropriate defaults would be in the absence of a wide selection of use cases. I also like the idea of using sys.byteorder as the model for the byteorder selection option (i.e. byteorder=sys.byteorder to select native layout and 'big'/'little' to choose a specific one) I think Mark also makes a good point that in cases where the size doesn't matter, marshal or pickle is probably a better choice than this direct conversion so +0 on requiring an explicit size on the conversion to bytes. Again, this is a case where *adding* a default later would be easy, but changing a default or removing a variable size feature would be difficult. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

[Oops, I forgot to CC python-ideas. Sorry Eric for the double post.] On Sat, Aug 15, 2009 at 8:13 AM, Eric Eisner<ede@mit.edu> wrote:
In my patch, I chosen to use the name 'as_bytes' because it was consistent with 'float.as_integer_ratio'. Similarly, I chosen the name 'frombytes' because it was consistent with 'float.fromhex'. I don't mind the inconsistent use of the underscore in the names, but I admit there is room for improvement. So, what do you think of `int.frombytes` and `int.tobytes`?
I like the byteorder option better. I believe the byteorder option shouldn't default to use the native byte-order however. As you mentioned, it would be a bad choice to encourage the default behaviour to be platform-dependent. And since the primary purpose of the API is long serialization, it would be short-sighted to choose the option that cannot be used for serialization as the default. Whether it should default to 'little' or 'big' is pretty much an arbitrary choice. In my patch, I choose to default big-endian since it is the standard network byte-order. But maybe the option should default to little-endian instead since it more widely used. In addition, the patch is slightly more efficient with little-endian.
Either is fine by me. The advantage with 'signed' as the default is 'signed' works with all longs (and not only with non-negative ones).
I still like `fixed_length` better than proposed alternatives. The name `fixed_length` makes it clear that the returned object has a fixed and constant length. And, I find `fixed_length=None` is more telling than `length=None`. -- Alexandre

On Sat, Aug 15, 2009 at 4:45 PM, Alexandre Vassalotti<alexandre@peadrop.com> wrote:
Given all this, it sounds like byteorder should be a required argument. 'In the face of ambiguity... ' and all that. As Nick pointed out, we can always add a default later when a larger set of use-cases has emerged. You say that the 'primary purpose of the API is long serialization'. I'd argue that that's not quite true. That is, I see two separate uses: (1) fixed-size conversions: e.g., interpreting a three-byte sequence as an integer for the purposes of bit operations, or converting an int generated by random.getrandbits(k) to a random byte sequence. (See http://bugs.python.org/msg69285 and http://bugs.python.org/msg54262.) (2) Provide a primitive operation that's useful for serialization protocols. Here I'd guess that the details of e.g., how a negative integer is serialized would vary from protocol to protocol, so that the serialization code would in most cases still end up having to specify the fixed_length argument. I *don't* see int.tobytes and int.frombytes (or whatever the names turn out to be) as providing integer serialization by themselves. There's no need for this, since pickle and marshal already do this job. Incidentally, the commenters in http://bugs.python.org/issue467384 have quite a lot to say on this subject. It's on this basis that I'm suggesting that the size argument should be required for int.tobytes.
Agreed. Of course, this advantage disappears if the size argument is mandatory. I don't have any strong opinions about the method and parameter names, so I'll keep quiet on that subject. :) Mark

On Sat, Aug 15, 2009 at 12:25 PM, Mark Dickinson<dickinsm@gmail.com> wrote:
And you are totally right. Honestly, the only reason I was thinking about long serialization is because I have my hand full with pickle guts presently. :-) -- Alexandre

On Sat, Aug 15, 2009 at 5:48 PM, Alexandre Vassalotti<alexandre@peadrop.com> wrote:
Eh? But I came here for an argument! Isn't this room 12? A couple of other things: If these additions to int go in, then presumably the _PyLong_AsBytes and _PyLong_FromBytes functions should be documented and made public (and have their leading underscores removed, too). Those functions have been stable for a good while, and are well-used within the Python source; I think they're robust enough for public consumption. There may be some additional argument validation required; I'll take a look at this. In the issue tracker, Josiah Carlson asked about the possibility of backporting to 2.7. I can't see any problem with this, though there would be some small extra work involved in making things work for int as well as long. Does anyone else see any issues with this? Mark

Mark Dickinson wrote:
What would we be converting them to in that case? 2.x strings? (I don't have a problem with that, just pointing out there may be some additional work due to changing the target type). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Sat, Aug 15, 2009 at 2:42 PM, Mark Dickinson<dickinsm@gmail.com> wrote:
Oh, oh, I am sorry. This is agreement. You want 12A, next door. =)
You are referring to _PyLong_FromByteArray and _PyLong_AsByteArray, right? -- Alexandre

[Nick Coghlan]
Yes, the 2.x 'str' type. I haven't thought about whether this could cause 2-to-3 translation problems for people using this in 2.x. (I don't immediately see why it would.) Might there be political reasons not to backport this to 2.x? I seem to recall it being suggested at the PyCon language summit that we should consider making new features 3.x only, but I don't entirely remember what the rationale for this was. [Alexandre Vassalotti]
Whoops! Yes, that's what I meant. Thanks. :) Sorry for my silence on the issue tracker. I'll try to find time to look at your new patch this weekend. Mark

Mark Dickinson wrote:
I wasn't there, but there were rumbles about having a few nice carrots in 3.x to help people think it was worthwhile to switch. That said, I think the final consensus was that new features *must* go into 3.x, but if they can be backported and someone is happy to do the work then backporting isn't an issue. And in this case, the fact that the methods will produce/consume strings in 2.x shouldn't cause any more problems than any other cases that involve 2.x storing binary data in text strings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

(I'm repeating some of the comments already made in the bug-tracker; as Antoine pointed out, discussion should probably remain here until the API is settled.) On Mon, Aug 10, 2009 at 2:13 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
Sounds good to me. I'm not sure about the 'signed=True' default; to me, a default of unsigned seems more natural. But this is bikeshedding, and I'd happily accept either default. I agree with other posters that there seems little reason not to accept the empty string. It's a natural end-case for unsigned input; whether it's natural for signed input (where there should really be at least one 'sign bit', and hence at least one byte) is arguable, but I can't see the harm in accepting it.
I'm not convinced that it's valuable to a have a variable-size version of this; I'd make size a required argument. The problem with the variable size version is that the choice of byte-length for the output for a given integer input is a little bit arbitrary. For a particular requirement (producing code to conform with some existing serialization protocol, for example) it seems likely that the choice Python makes will disagree with what's required by that protocol, so that size still has to be given explicitly. On the other hand, if a user just wants a quick and easy way to serialize ints, without caring about the exact form of the serialization, then there are number of solutions already available within Python. +1 on raising OverflowError for out-of-range inputs, instead of wrapping modulo 2**whatever. This also fits with the way that the struct module currently behaves. Does anyone see other use-cases for variable-size conversion? [Greg Ewing]
How about a parameter byteorder=None, accepting values 'big' and 'little'? Then one could use byteorder=sys.byteorder to explicitly specify native byteorder. Mark

On Sat, Aug 15, 2009 at 8:14 AM, Mark Dickinson<dickinsm@gmail.com> wrote:
Well, the only use-case in the standard library I found (i.e., simplifying encode_long() and decode_long() in pickle.py) needed the variable-length version. However, unlike I originally thought, the variable length version is not difficult to emulate using `int.bit_length()`. For example, with my patch I can rewrite: def encode_long(x): if x == 0: return b"" return x.as_bytes(little_endian=True) as: def encode_long(x) if x == 0: return b"" nbytes = (x.bit_length() >> 3) + 1 result = x.as_bytes(nbytes, little_endian=True) if x < 0 and nbytes > 1: if result[-1] == 0xff and (result[-2] & 0x80) != 0: result = result[:-1] return result I usually hate with passion APIs that requires you to know the length of the result in advance. But this doesn't look bad. The only use-case for the variable-length version I have is the encode_long() function in pickle.py. In addition, it sounds reasonable to leave the duty of long serialization to pickle. So, +1 from me. -- Alexandre

On Sat, Aug 8, 2009 at 10:31 PM, Guido van Rossum<guido@python.org> wrote:
The first part also doesn't work if hex(i) has odd length. [py3k]:
I think the fact that it's non-trivial to get this right first time is further evidence that it would be useful to have built-in int <-> bytes conversions somewhere. Mark

Mark Dickinson wrote:
Are there going to possibly be other conversions to bytes and back? (float, string, struct, ...) It seems to me the type conversion to and from bytes should be on the encoded non-byte type, and other types including user created ones could follow that pattern. That may allow bytes to work with any type that has the required special methods to do the conversions. Then most of the methods on bytes would be for manipulating bytes in various ways. The constructor for the int type already does base and string conversions, extending it to bytes seems like it would be natural. int(bytes) # just like int(string) bytes = bytes(int) # calls int.__to_bytes__() to do the actual work. Ron

The struct module already handles all those -- long ints are pretty much the only common type that it doesn't cover, because it only deals with fixed-length values. On Mon, Aug 10, 2009 at 10:16 AM, Ron Adam<rrr@ronadam.com> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tue, Aug 11, 2009 at 02:16, Ron Adam<rrr@ronadam.com> wrote:
The constructor for bytes currently accepts a single int as an argument, producing that many zero bytes. As far as I can tell this behavior is undocumented, but I don't know if that can be changed easily... The first thing I did when I tried out the bytes type in 3.x was to try to convert an int to a byte via the constructor, and the current behavior surprised me. -Eric

Eric Eisner wrote:
It seems that bytes are quite different here in 2.6 and 3.0. In 2.6 and bytes is an alias for string(), so the bytes constructor behavior just converts the int to a string. ra@Gutsy:~$ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
class str(basestring) | str(object) -> string | | Return a nice string representation of the object. | If the argument is a string, the return value is the same object. | (clipped) Ron

Eric Eisner schrieb:
It is documented, see http://docs.python.org/py3k/library/functions#bytearray. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

Eric Eisner wrote:
You can already do bitwise operations on bytes, and retrieving the relevant byte and using single-byte operations on it is simple enough. As I see it, the major underlying problem here is that byte-arrays are immutable, since what you really want is to be able to change a single byte of the array in-place. In general, Python's byte arrays and strings are very ill suited for dealing with data which isn't byte-based, and definitely horrible at building or modifying data streams. I recommend using an external library for working with data structures and/or data streams. Construct[1] is my personal favorite, since it's especially Pythonic and easy to use. In any case, if you wish to propose a mutable byte/bit array which supports array-wise binary operators, I can say I would certainly be glad to have such a class at my disposal. - Tal [1] http://construct.wikispaces.com/

On Fri, Aug 07, 2009 at 02:43:54PM +0300, Tal Einat wrote:
Like this http://pypi.python.org/pypi/BitVector/ ? Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

Tal Einat wrote:
Careful with your terminology here. In Python 3.x, the following two types both exist: bytes: immutable sequence of integers between 0 and 255 inclusive bytearray: similar to bytes, but mutable The operations ~x, x&y, x|y and x^y make sense for both bytes and bytearray. The augmented assignment operators only make sense for bytearray. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, 7 Aug 2009 09:43:54 pm Tal Einat wrote:
You can already do bitwise operations on bytes,
Are you sure about that? I tried looking it up in the docs, but python.org is down at the moment, so excuse me if I've missed something obvious. In Python 3.0:
Have I missed something?
I'm not speaking for the original poster, but for myself, not necessarily. While it would be nice to be able to flip bits directly, I'd be happy with bitwise operators to return new instances, as they currently do for ints. Something like: # not actual code
This could work for bit-flipping operations, although I'd welcome an easier to use API: # still not actual code
-- Steven D'Aprano

On Fri, Aug 7, 2009 at 11:04 AM, Eric Eisner<ede@mit.edu> wrote:
+1 from me. I'd say that it makes just as much sense to do bit operations on bytes or bytearrays as it does on integers. I've always felt a little bit dirty when I've abused ints as bitstrings in the past; byte strings seem better suited to this task, especially where input and output are also involved.
I'd say ValueError, since it's not really clear what 'mimic the behaviour of ints' means here. E.g., should bytes([1,2,3,4]) | bytes([16, 32, 64]) be equal to bytes([17, 34, 67, 64]) or bytes([1, 18, 35, 68]). It seems better to restrict the use to the cases where there's a single obvious interpretation.
Agreed.
Agreed. To make sense of the shift operators you effectively have to give 'position' interpretations for the individual bits, and there's no single obvious way of doing this; for the plain bitwise operations this isn't necessary. One other question: if a has type bytes and b has type bytearray, what types would you propose that a & b and b & a should have? Mark

One other question: if a has type bytes and b has type bytearray, what types would you propose that a & b and b & a should have?
For consistency they should probably do what they currently do with concatenation: the type of the first expression is the type of the result:
-Eric

On Fri, 7 Aug 2009 11:19:20 pm Mark Dickinson wrote:
To me, the single obvious meaning of left- and right-shift is to shift to the left and the right :) E.g. b"abcd" >> 8 => "abc" b"abcd" << 8 => "abcd\0" which would have the benefit of matching what ints already do:
I'm not sure what other "obvious" meanings you could give them. Have I missed something? -- Steven D'Aprano

On Fri, Aug 7, 2009 at 10:54 PM, Steven D'Aprano<steve@pearwood.info> wrote:
Yes, byte order. It is not at all "obvious" whether the lowest-order byte is on the left or on the right. Your interpretation is big-endian. But mathematically speaking, a little-endian interpretation is somewhat easier, because the value in byte number i corresponds to that value multiplied by 256**i. Another way to look at is, is b'abc' supposed to be equal to b'\0abc' (big-endian) or b'abc\0' (little-endian) ? I find ignoring trailing nulls more logical than ignoring leading nulls, since the indexes of the significant digits are the same in the little-endian case. In the grander scheme of things, I worry that interpreting byte strings as integers and implementing bitwise operators on them is going to cause more confusion and isn't generally useful enough to warrant the extra code. I'd be okay with a standard API to transform a byte array into an integer and vice versa -- there you can be explicit about byte order and what to do about negative numbers. I can't remember right now if we already have such an API for arbitrary sizes -- the struct module only handles sizes 2, 4 and 8. I can hack it by going via a hex representation: i = 10**100 b = bytes.fromhex(hex(i)[2:]) import binascii j = int(binascii.hexlify(b), 16) assert j == i but this is a pretty gross hack. Still, most likely faster than writing out a loop in Python. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum schrieb:
What about operations that don't require the bytes to be interpreted as anything else? The OP proposed bytes_a | bytes_b to mean bytes(b_a | b_b for (b_a, b_b) in zip(bytes_a, bytes_b)) except that (probably) an equal length would have to be asserted. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

I think it's a very rarely used feature that is more likely to baffle the casual reader. The question of what to do with byte strings of unequal length is just the first issue that crops up. The (still legitimate) question why we would support | and & but not << and >> is another. It's a slippery slope... --Guido On Sat, Aug 8, 2009 at 3:09 PM, Georg Brandl<g.brandl@gmx.net> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I think it's a very rarely used feature that is more likely to baffle the casual reader.
However there's currently no way to efficiently do bitwise operations en masse on raw bytes without converting to and from long integers, which doesn't seem very satisfactory. Refusing to provide this ability for the above reason sounds like purity beating practicality to me.
The question of what to do with byte strings of unequal length is just the first issue that crops up.
I would raise an exception.
Ambiguity due to byte order seems like a good enough reason not to implement shifts as operators. Maybe provide methods that specify a byte order explicitly? -- Greg

On Sun, 09 Aug 2009 10:41:55 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
It seems to me that bytes are a container type. How many builtin operators are there that operate on every element in any kind of container (other than queries, many of which already work on bytes)? It seems to me that what you really want is a high-performance array module of some flavor. Isn't that part of what NumPy provides? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Sat, Aug 8, 2009 at 6:31 PM, Guido van Rossum<guido@python.org> wrote:
Indeed. But let me rephrase this. I don't know Python, I'm just starting with Py3k, and I see a bytes object. I don't know what a "byte string" is (and I even feel that the term is strange), but I understand that b"b3\xdd" is an array of three bytes, which, of course, is an array of 24 bits (unless I know that bytes are not always octets, but I think it's the same for this case). So, I think that providing bit semantics is a must for bytes objects. I don't care (or I don't know enough to know that this exist) about big/little endian, I just assume that this "just works" (it's is the same case with line ending bytes: I may not know that different OSs finish lines with different marks). *I* (me as Facundo), don't really know enough to be able to propose "what could work and keep surprise to a minimum", but just wanted to present an "end user case" about this.
Yes. Bit arrays and integers suffer (suffer?) from the same issue that Unicode. An integer, and an Unicode character, are encoded into bits... and if you have bits, you need to know how to decode them to get again your Unicode character, or your integer. Maybe we could use the same names? What about something like b"....". decode("little_endian__two's_complement") --> int? (please, very please, better encoding names) Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

On Sat, Aug 8, 2009 at 10:29 PM, Facundo Batista<facundobatista@gmail.com> wrote:
Sounds like a non-sequitur if anything.
From the perspective of "I don't know Python" you cannot expect to draw valid conclusions.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

For some background, I think the separation of text and data is a powerful idea, and now that I have my nice builtin object for data (and its mutable cousin), I want to be able to manipulate the raw data directly. Currently, all of the bitwise manipulations live in the int type (for what I like to think of as historical reasons). Currently, when manipulating a byte at a time, the bytes type does the right thing and shows the elements as ints, giving me access to these bitwise manipulations. However there currently lacks builtin functionality to manipulate multi-bytes chunks. In the grander scheme of things, this is what I would like to see addressed. The most general solution, which Guido mentions, and which I have also been thinking about, is a builtin conversion from bytes to ints that explicitly requires resolving ambiguities (eg endianness, negatives, data length). I personally use those pretty gross hacks to fake the functionality now, but they assumes big-endian, two's complement etc. I proposed direct bitwise operators (~ & ^ |) for bytes because these seemed to be the subset of all bitwise manipulations that were (mostly) unambiguous for raw bytes. Even if python gains conversions between bytes and ints, I still think these unambiguous operations would be useful to have (especially ~, the most unambiguous among them). -Eric

On Sat, Aug 8, 2009 at 10:31 PM, Guido van Rossum<guido@python.org> wrote:
That would also be a welcome addition. It's been requested on bugs.python.org at least a couple of times[1][2], and the C code to do the conversions already exists (_PyLong_{As,From}_ByteArray in longobject.c), so it wouldn't be too much work to implement. The main problem would be deciding exactly what the API should be and where to put it. Mark [1] http://bugs.python.org/issue1023290 [2] http://bugs.python.org/issue923643

Mark Dickinson wrote:
My suggestion would be to provide the relevant constructors as class methods on int, bytes and bytearray: bytes.from_int bytearray.from_int int.from_bytes Alternatively, the int.from_bytes classmethod could be replaced with a "to_int" instance method on bytes and bytearray. The method signatures would need to closely resemble the C API. In particular, for the conversion from int to bytes being able to state a desired size would both allow detection of cases where the value is too large as well as proper padding of the two's complement sign bit. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Mon, Aug 10, 2009 at 19:30, Nick Coghlan<ncoghlan@gmail.com> wrote:
For completeness, any function converting from int to bytes needs to accept the arguments size, endianness, and two's complement handling. By default, size and two's complement could be inferred from the int's value. Any function converting from bytes to int needs to accept the arguments endianness and two's complement. The function cannot really make a reasonable assumption about either. Having this much to specify in bytes.from_int or bytes.to_int seems a little overwhelming (to me anyway). It is already more complicated than the analogous bytes.decode for strings. What about a putting this in module support, and putting each of these options in its own named method? Some dummy examples that may need better names: base256.encode_le(0x123456, size=5) # returns b'\x56\x34\x12\x00\x00' base256.decode_le_positive(b'\xff\xce') # returns 0xceff -Eric

Eric Eisner wrote:
Keep in mind that Mark's suggestion isn't to completely reinvent the wheel - it is more about exposing an existing internal C API to Python code. Since that C API is provided by the int object, we should also consider the option of providing it in Python purely as methods of that object. Bytes to int: bytes pointer: pointer to first byte to be converted size: number of bytes to be converted little_endian: flag indicating MSB is at offset 0 signed: flag indicating whether or not the value is signed For a Python version of this conversion, the size argument is unnecessary, reducing the required parameters to a bytes/bytearray/memoryview reference, an endianness marker and a 'signed' flag (to indicate whether the buffer contains an unsigned value or a signed two's complement value). One slight quirk of the C API that probably shouldn't be replicated is a size of 0 translating to an integer result of zero. For the Python API, passing in an empty byte sequence should trigger a ValueError. In the C API, int to bytes takes the same parameters as the bytes to int conversion, but the meaning is slightly different. Int to bytes: bytes pointer: pointer to first byte to be converted size: number of bytes in result little_endian: flag indicating to write MSB to offset 0 signed: flag indicating negative value can be converted In this case, the size matters even for a Python version as an OverflowError will be raised if the integer won't fit in the specified size and sufficient sign bits are added to pad the result out to the requested size otherwise. That suggests to me the following signatures for the conversion functions (regardless of the names they might be given): int.from_bytes(data, *, little_endian=None, signed=True) little_endian would become a three-valued parameter for the Python version: None = native; False = little-endian; True = big-endian. The three valued parameter is necessary since Python code can't get at the "IS_LITTLE_ENDIAN" macro that the PyLong code uses to determine the endianness of the system when calling the C API functions. signed would just be an ordinary boolean flag int.to_bytes(data, *, size=0, little_endian=None, signed=True) A size <= 0 would mean to produce as many bytes as are needed to represent the integer and no more. Otherwise it would represent the maximum number of bytes allowed in the response (raising OverflowError if the value won't fit). little_endian and signed would be interpreted as for the conversion from bytes to an integer Sure, these could be moved into a module of their own, but I'm not sure what would be gained by doing so. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan <ncoghlan@...> writes:
int.to_bytes(data, *, size=0, little_endian=None, signed=True)
It would be better IMO if this method belonged to the target type, so that you can write `bytearray.from_int()` without going through an intermediate bytes object. What if `data` is equal to 0? Does it return an empty bytes object? It doesn't look dangerous to do so, so I'd say go for it.

Antoine Pitrou wrote:
Actually, that's just a typo after copying the from_bytes signature line when writing my message. int.to_bytes would just be a normal instance method. If the value was 0, then the result would be a single byte with the value zero (or all zeroes if a size was specified). As far as bytearray goes, my original post did suggest that approach (i.e. class methods on the target types), but that then gets tedious as every new type that wants to support conversion from an integer needs to either go through one of the existing types or else have it's own C extension call the underlying C API directly. Something analagous to the file.readinto() method seems like a more appropriate solution for populating the mutable types. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan wrote:
I'd suggest that int.to_bytes should take an optional destination parameter, and it should accept any object having a mutable buffer interface. That would keep all the int-related methods with the int class, without restricting the source or destination to any particular type. -- Greg

Nick Coghlan wrote:
Why? This seems like a perfectly logical limiting case to me.
I don't like the idea of a three-valued boolean. I also don't like boolean parameters whose sense is abritrary (why is it called "little_endian" and not "big_endian", and how do I remember which convention was chosen?) My suggestion would be to use the same characters that the struct module uses to represent endianness (">" for big-endian, "<" for little-endian, etc.) -- Greg

On Mon, Aug 10, 2009 at 2:13 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
Alexandre Vassalotti has posted a patch at http://bugs.python.org/issue1023290 that implements methods very much like the ones that Nick describes. Mark

This patch seems very complete, with only the API to hammered out. Here is my summary of some options: method names: patch behavior: int.as_bytes / int.frombytes int.as_bytes / int.from_bytes int.asbytes / int.frombytes Endianness: patch behavior: default flag: little_endian=False byteorder option accepting 'big' or 'little', this can also accept sys.byteorder sign: patch behavior: default flag: signed=True maybe unsigned as the default? byte length: patch behavior: fixed_length=None other names: length, bytelength As for my own opinions: I think the method names should have consistent underscore usage. I think it is important to have all of big, little, and native as byteorder options, but I would be against having native as the default. I think it is unclean for core functionality to be platform dependent (there may be some examples of this that I'm not thinking of though). One option would be to have the defaults of int.as_bytes mirror the hex builtin function, eg big endian and unsigned. This way it could be more consistent with related functionality already in the core. Thoughts? -Eric On Sat, Aug 15, 2009 at 20:23, Mark Dickinson<dickinsm@gmail.com> wrote:

Eric Eisner wrote:
Having the byte conversion mirror the hex builtin is probably the least arbitrary style guide we could come up with for the defaults. The other option would be to not *have* defaults for either of these settings and force the programmer to make a deliberate choice. Then once it has been out in the field for a while and we have evidence about the way people use it, add sensible defaults in the following release. I would prefer either of those two options to attempting to just guess what the appropriate defaults would be in the absence of a wide selection of use cases. I also like the idea of using sys.byteorder as the model for the byteorder selection option (i.e. byteorder=sys.byteorder to select native layout and 'big'/'little' to choose a specific one) I think Mark also makes a good point that in cases where the size doesn't matter, marshal or pickle is probably a better choice than this direct conversion so +0 on requiring an explicit size on the conversion to bytes. Again, this is a case where *adding* a default later would be easy, but changing a default or removing a variable size feature would be difficult. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

[Oops, I forgot to CC python-ideas. Sorry Eric for the double post.] On Sat, Aug 15, 2009 at 8:13 AM, Eric Eisner<ede@mit.edu> wrote:
In my patch, I chosen to use the name 'as_bytes' because it was consistent with 'float.as_integer_ratio'. Similarly, I chosen the name 'frombytes' because it was consistent with 'float.fromhex'. I don't mind the inconsistent use of the underscore in the names, but I admit there is room for improvement. So, what do you think of `int.frombytes` and `int.tobytes`?
I like the byteorder option better. I believe the byteorder option shouldn't default to use the native byte-order however. As you mentioned, it would be a bad choice to encourage the default behaviour to be platform-dependent. And since the primary purpose of the API is long serialization, it would be short-sighted to choose the option that cannot be used for serialization as the default. Whether it should default to 'little' or 'big' is pretty much an arbitrary choice. In my patch, I choose to default big-endian since it is the standard network byte-order. But maybe the option should default to little-endian instead since it more widely used. In addition, the patch is slightly more efficient with little-endian.
Either is fine by me. The advantage with 'signed' as the default is 'signed' works with all longs (and not only with non-negative ones).
I still like `fixed_length` better than proposed alternatives. The name `fixed_length` makes it clear that the returned object has a fixed and constant length. And, I find `fixed_length=None` is more telling than `length=None`. -- Alexandre

On Sat, Aug 15, 2009 at 4:45 PM, Alexandre Vassalotti<alexandre@peadrop.com> wrote:
Given all this, it sounds like byteorder should be a required argument. 'In the face of ambiguity... ' and all that. As Nick pointed out, we can always add a default later when a larger set of use-cases has emerged. You say that the 'primary purpose of the API is long serialization'. I'd argue that that's not quite true. That is, I see two separate uses: (1) fixed-size conversions: e.g., interpreting a three-byte sequence as an integer for the purposes of bit operations, or converting an int generated by random.getrandbits(k) to a random byte sequence. (See http://bugs.python.org/msg69285 and http://bugs.python.org/msg54262.) (2) Provide a primitive operation that's useful for serialization protocols. Here I'd guess that the details of e.g., how a negative integer is serialized would vary from protocol to protocol, so that the serialization code would in most cases still end up having to specify the fixed_length argument. I *don't* see int.tobytes and int.frombytes (or whatever the names turn out to be) as providing integer serialization by themselves. There's no need for this, since pickle and marshal already do this job. Incidentally, the commenters in http://bugs.python.org/issue467384 have quite a lot to say on this subject. It's on this basis that I'm suggesting that the size argument should be required for int.tobytes.
Agreed. Of course, this advantage disappears if the size argument is mandatory. I don't have any strong opinions about the method and parameter names, so I'll keep quiet on that subject. :) Mark

On Sat, Aug 15, 2009 at 12:25 PM, Mark Dickinson<dickinsm@gmail.com> wrote:
And you are totally right. Honestly, the only reason I was thinking about long serialization is because I have my hand full with pickle guts presently. :-) -- Alexandre

On Sat, Aug 15, 2009 at 5:48 PM, Alexandre Vassalotti<alexandre@peadrop.com> wrote:
Eh? But I came here for an argument! Isn't this room 12? A couple of other things: If these additions to int go in, then presumably the _PyLong_AsBytes and _PyLong_FromBytes functions should be documented and made public (and have their leading underscores removed, too). Those functions have been stable for a good while, and are well-used within the Python source; I think they're robust enough for public consumption. There may be some additional argument validation required; I'll take a look at this. In the issue tracker, Josiah Carlson asked about the possibility of backporting to 2.7. I can't see any problem with this, though there would be some small extra work involved in making things work for int as well as long. Does anyone else see any issues with this? Mark

Mark Dickinson wrote:
What would we be converting them to in that case? 2.x strings? (I don't have a problem with that, just pointing out there may be some additional work due to changing the target type). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Sat, Aug 15, 2009 at 2:42 PM, Mark Dickinson<dickinsm@gmail.com> wrote:
Oh, oh, I am sorry. This is agreement. You want 12A, next door. =)
You are referring to _PyLong_FromByteArray and _PyLong_AsByteArray, right? -- Alexandre

[Nick Coghlan]
Yes, the 2.x 'str' type. I haven't thought about whether this could cause 2-to-3 translation problems for people using this in 2.x. (I don't immediately see why it would.) Might there be political reasons not to backport this to 2.x? I seem to recall it being suggested at the PyCon language summit that we should consider making new features 3.x only, but I don't entirely remember what the rationale for this was. [Alexandre Vassalotti]
Whoops! Yes, that's what I meant. Thanks. :) Sorry for my silence on the issue tracker. I'll try to find time to look at your new patch this weekend. Mark

Mark Dickinson wrote:
I wasn't there, but there were rumbles about having a few nice carrots in 3.x to help people think it was worthwhile to switch. That said, I think the final consensus was that new features *must* go into 3.x, but if they can be backported and someone is happy to do the work then backporting isn't an issue. And in this case, the fact that the methods will produce/consume strings in 2.x shouldn't cause any more problems than any other cases that involve 2.x storing binary data in text strings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

(I'm repeating some of the comments already made in the bug-tracker; as Antoine pointed out, discussion should probably remain here until the API is settled.) On Mon, Aug 10, 2009 at 2:13 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
Sounds good to me. I'm not sure about the 'signed=True' default; to me, a default of unsigned seems more natural. But this is bikeshedding, and I'd happily accept either default. I agree with other posters that there seems little reason not to accept the empty string. It's a natural end-case for unsigned input; whether it's natural for signed input (where there should really be at least one 'sign bit', and hence at least one byte) is arguable, but I can't see the harm in accepting it.
I'm not convinced that it's valuable to a have a variable-size version of this; I'd make size a required argument. The problem with the variable size version is that the choice of byte-length for the output for a given integer input is a little bit arbitrary. For a particular requirement (producing code to conform with some existing serialization protocol, for example) it seems likely that the choice Python makes will disagree with what's required by that protocol, so that size still has to be given explicitly. On the other hand, if a user just wants a quick and easy way to serialize ints, without caring about the exact form of the serialization, then there are number of solutions already available within Python. +1 on raising OverflowError for out-of-range inputs, instead of wrapping modulo 2**whatever. This also fits with the way that the struct module currently behaves. Does anyone see other use-cases for variable-size conversion? [Greg Ewing]
How about a parameter byteorder=None, accepting values 'big' and 'little'? Then one could use byteorder=sys.byteorder to explicitly specify native byteorder. Mark

On Sat, Aug 15, 2009 at 8:14 AM, Mark Dickinson<dickinsm@gmail.com> wrote:
Well, the only use-case in the standard library I found (i.e., simplifying encode_long() and decode_long() in pickle.py) needed the variable-length version. However, unlike I originally thought, the variable length version is not difficult to emulate using `int.bit_length()`. For example, with my patch I can rewrite: def encode_long(x): if x == 0: return b"" return x.as_bytes(little_endian=True) as: def encode_long(x) if x == 0: return b"" nbytes = (x.bit_length() >> 3) + 1 result = x.as_bytes(nbytes, little_endian=True) if x < 0 and nbytes > 1: if result[-1] == 0xff and (result[-2] & 0x80) != 0: result = result[:-1] return result I usually hate with passion APIs that requires you to know the length of the result in advance. But this doesn't look bad. The only use-case for the variable-length version I have is the encode_long() function in pickle.py. In addition, it sounds reasonable to leave the duty of long serialization to pickle. So, +1 from me. -- Alexandre

On Sat, Aug 8, 2009 at 10:31 PM, Guido van Rossum<guido@python.org> wrote:
The first part also doesn't work if hex(i) has odd length. [py3k]:
I think the fact that it's non-trivial to get this right first time is further evidence that it would be useful to have built-in int <-> bytes conversions somewhere. Mark

Mark Dickinson wrote:
Are there going to possibly be other conversions to bytes and back? (float, string, struct, ...) It seems to me the type conversion to and from bytes should be on the encoded non-byte type, and other types including user created ones could follow that pattern. That may allow bytes to work with any type that has the required special methods to do the conversions. Then most of the methods on bytes would be for manipulating bytes in various ways. The constructor for the int type already does base and string conversions, extending it to bytes seems like it would be natural. int(bytes) # just like int(string) bytes = bytes(int) # calls int.__to_bytes__() to do the actual work. Ron

The struct module already handles all those -- long ints are pretty much the only common type that it doesn't cover, because it only deals with fixed-length values. On Mon, Aug 10, 2009 at 10:16 AM, Ron Adam<rrr@ronadam.com> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tue, Aug 11, 2009 at 02:16, Ron Adam<rrr@ronadam.com> wrote:
The constructor for bytes currently accepts a single int as an argument, producing that many zero bytes. As far as I can tell this behavior is undocumented, but I don't know if that can be changed easily... The first thing I did when I tried out the bytes type in 3.x was to try to convert an int to a byte via the constructor, and the current behavior surprised me. -Eric

Eric Eisner wrote:
It seems that bytes are quite different here in 2.6 and 3.0. In 2.6 and bytes is an alias for string(), so the bytes constructor behavior just converts the int to a string. ra@Gutsy:~$ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
class str(basestring) | str(object) -> string | | Return a nice string representation of the object. | If the argument is a string, the return value is the same object. | (clipped) Ron

Eric Eisner schrieb:
It is documented, see http://docs.python.org/py3k/library/functions#bytearray. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
participants (15)
-
Alexandre Vassalotti
-
Antoine Pitrou
-
Eric Eisner
-
Facundo Batista
-
Georg Brandl
-
Greg Ewing
-
Guido van Rossum
-
Mark Dickinson
-
Mike Meyer
-
MRAB
-
Nick Coghlan
-
Oleg Broytmann
-
Ron Adam
-
Steven D'Aprano
-
Tal Einat