String and bytes bitwise operations

Hi all, We all know the bitwise operators: & (and), | (or), ^ (xor), and ~ (not). We know how they work with numbers: 420 ^ 502 110100100 111110110 == XOR == 001010010 = 82 But it might be useful in some cases to (let's say) xor a string (or bytestring): HELLO ^ world 01001000 01000101 01001100 01001100 01001111 01110111 01101111 01110010 01101100 01100100 =================== XOR ==================== 00111111 00101010 00111110 00100000 00101011 = ?*> + Currently, that's done with this expression for strings: >>> ''.join(chr(ord(a) ^ ord(b)) for a, b in zip('HELLO', 'world')) '?*> +' and this expression for bytestrings: >>> bytes(a ^ b for a, b in zip(b'HELLO', b'world')) b'?*> +' It would be much more convenient, however, to allow a simple xor of a string: >>> 'HELLO' ^ 'world' '?*> +' or bytestring: >>> b'HELLO' ^ b'world' b'?*> +' (All of this applies to other bitwise operators, of course.) Compatibility issues are a no-brainer - currently, bitwise operators for strings raise TypeErrors. Thanks. Suggesting, Ken Hilton ;

17.05.18 13:53, Ken Hilton пише:
The question is how common a need of these operations? If it is not common enough, they are better be implemented as functions in a third-party library.
Are you aware that this can raise a ValueError for some input strings? For example for '\U00100000' and '\U00010000'.

On Thu, May 17, 2018 at 02:14:10PM +0300, Serhiy Storchaka wrote:
The real question is, what does it mean to XOR a text string? What is 'π' XOR '≠'? I can think of at least three possibilities: '-\t' # treat the strings as UTF-8 '↠' # treat them as UTF-16 or UTF-32 '\x14' # treat them as MacRoman What if the strings are unequal lengths? But XORing equal length byte strings makes sense to me. There's no ambiguity: it is equivalent to just XORing each byte with the corresponding byte.
That works for me. py> ''.join(chr(ord(a) ^ ord(b)) for a, b in zip('\U00100000', '\U00100000')) '\x00' -- Steve

On Thu, May 17, 2018 at 03:49:02PM +0300, Serhiy Storchaka wrote:
No, he didn't explain the meaning. He gave an example, but not a reason why it should do what he showed. Why should the *abstract character* 'H' XORed with the abstract character 'w' return the abstract character '?'? Why shouldn't the result be '>' instead? (For the record that's using 'EBCDIC-CP-BE'.) The point is, XORing abstract characters seems meaningless to me. If the OP has an explanation for why 'H'^'?' must mean '?', he should explain it. XORing code points could easily generate invalid Unicode sequences containing lone surrogates, say, or undefined characters. Or as you point out, out of range values. But XORing bytes seems perfectly reasonable. Bytes are numbers, even if we display them as ASCII characters.
Oops, sorry. I misread your post. -- Steve

On 18/05/2018 07:07, Greg Ewing wrote:
Personally I would suggest that the bitwise operations would only make sense between pairs of characters, (or bytes), of equal size. (It is possible to make a rule that the returned value is the same size as the larger of the two - i.e. the smaller is promoted and left padded with 0s). But such operations don't make much sense in most cases. However the bitwise operators could sensibly be used for set-wise operations on strings, i.e.: - "123xyz" | "abc" = "123abcxyz", - "Spam" & "Eggs" = "", "Spam" & "Scam" = "Sam" - "Spam" ^ "Scam" = "cp" This __might__ have some practical use? -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

On Thu, May 17, 2018 at 11:07 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
actually, bytes are, well, bytes ;-) -- that is, 8 bits. But the point is that "bitwise" operations make all the sense in the world for bytes, but not for unicode text -- did anyone have a use case for bitwise operation on unicode strings? I can't imagine one, even if you could agree on a definition... Yep. Implement it for bytes, What exactly would be implemented? bytes is a sequence type, so would a bitwise operator perform the operation "elementwise"? (i.e. like numpy) or would it operate on the whole sequence as a single collection of bits? Would it be any different? Without thinking hard, it seems some operations, like AND and OR and XOR would be the same, but bit shifting would be different. And then what do you do if the two bytes objects are not the same length? If "elementwise", then we should think carefully about that -- no where else does Python do things elementwise in the standard library -- and we already have numpy if you want to do it now. and a bytes object can be representing any type of data -- so one byte at a time might make sense, but maybe two bytes at a time makes more sense -- and if the data is representing, say an integer, then endian-ness matters... All this is handled by numpy by having multiple data types that can be "mapped" to a buffer. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

2018-05-18 13:32 GMT-03:00 Chris Barker via Python-ideas <python-ideas@python.org>:
or would it operate on the whole sequence as a single collection of bits?
Once you think as the whole sequence of bytes as a sequence of bits (eight times longer, of course), all questions are easly answered, see below...
AND, XOR and OR are simple. Bit shifting will shift bits for the whole bit sequence.
And then what do you do if the two bytes objects are not the same length?
Pad with 0s to the left. Exactly the same that happen in
1 ^ 1231312312 1231312313
Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ Twitter: @facundobatista

On 5/18/18 12:32 PM, Chris Barker via Python-ideas wrote:
The one case I could think of is a very crude form of encryption. Probably most makes sense if you assume the string is really just ASCII. Not sure if that is a good enough case to be worth adding it, as often you special case some characters (like control characters) to avoid messing things up too much for display. -- Richard Damon

On Sat, May 19, 2018 at 2:32 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:
Grammatically, you appear to be disagreeing with the assertion that bytes are numbers. Is that the case? If you want to be extremely technical, an "octet" is a group of eight bits (or eight musicians, but I haven't yet figured out how to send musicians down an ethernet cable), and a "byte" isn't as rigidly defined. But on modern PCs, you can fairly safely assume that they're synonymous. I suppose you could argue that a "byte" is a patch of storage capable of holding a number from 0 to 255, as opposed to being the number itself, but that's getting rather existential :) In Python, a "bytes" object represents a sequence of eight-bit units. When you subscript a bytes [1], you get back an integer with the value at that position. So if a collection of them is called a "bytes" and one of them is an integer in range(0, 256), doesn't it stand to reason that a byte is a number? Maybe I'm completely misunderstanding your statement here. ChrisA [1] Yes, I'm aware that older versions of Python behaved differently. [2] [2] I'm also aware that putting square brackets after the word "bytes" could be interpreted as subscripting the 'bytes' type itself, rather than creating a reference to a footnote. Ain't contextual grammar fun?

Um, yes. Though I think for the rest of the conversation, it’s a distinction that doesn’t matter.
Sure — a byte, I think, is the smallest unit of memory that can be addressed. It could be more or less that 8 bytes, but that wasn’t the point either.
No, I’m making the distinction that an eight bit byte is, well, eight bits, that CAN represent a number from 0 to 255, or it can represent any other data type — like one eighth of the bits in a float, for instance. Or a bit field, or 1/2 a 16 bit int.
And when you print it, you get the ascii characters corresponding to each byte.... So one element in a bytes object is no more an integer than a character....
We use a decimal number (and ascii) to represent the bytes, as it’s more human readable and consistent with other python types.
Maybe I'm completely misunderstanding your statement here.
Again, it doesn’t much matter, until you get to deciding how to bitshift an entire bytes object. -CHB

On Sat, May 19, 2018 at 8:25 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Since "bit" simply means "binary digit", that's like saying that a four-digit number isn't a number; it MIGHT represent a number, but it might represent one quarter of your credit card. Is "4564" less of a number for that reason?
That's because those numbers can often be used to represent characters. But they are really and truly numbers. (If you want to get down to brass tacks, a Unicode string could be treated as a sequence of 21-bit numbers. And in some languages, the "string" type is actually a highly-optimized version of 21-bit-number-array - or 32-bit, perhaps - with fully supported use-cases involving numerical (non-textual) data.)
So one element in a bytes object is no more an integer than a character....
Except that the bytestring b"\x00\x80\xff\x99" very clearly represents four numbers, but doesn't clearly represent any characters.
Bitshifting a sequence of bytes has nothing whatsoever to do with characters. It has to do with the individual numbers, and then you have to decide how you represent those as a collective: little-endian or big-endian. That's still a matter of numbers, not characters. ChrisA

On Fri, May 18, 2018 at 06:25:19PM -0400, Chris Barker - NOAA Federal via Python-ideas wrote:
That might be the case if we were talking about chunks of memory in RAM, but we're talking about *bytes objects in Python* where the individual items in the object most certainly represent ints between 0 and 255: py> b"abcde"[3] 100 Philosophical arguments about the nature of computer memory aside, byte objects in Python are collections of ints. If you want those ints to represent something else, you're responsible for handling that (say, using the struct module). -- Steve

On Sat, May 19, 2018 at 6:52 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Philosophical arguments about the nature of computer memory aside, byte objects in Python are collections of ints.
not when you start talking about bit-wise operations :-) If a "byte" in python was an integer, then we'd use b**2 rather than b << 1 (yes, I know bit shifting can be more efficient, but I don't think that's why it's there in python) The entire point of bitwise operators is so that the bits themselves can be accessed and manipulated.
If you want those ints to represent something else, you're responsible for handling that (say, using the struct module).
yup -- with struct, and, hmm, maybe bitwise operators? Anyway, as you say, this is a Philosophical (or semantic) point -- I don't think it effects the discussion at hand. However, when you talk about bit-shifting a bytes object, you do need to decide if each byte is handled individually, or if they are one big collection. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, May 21, 2018 at 11:22:17AM -0700, Chris Barker wrote:
Why? Ints support bit-shift operations: py> 45 << 1 90 as well as other bitwise operations. (And b<<1 isn't the same as b**2.) I trust that you aren't going to argue that 45 isn't an int.
Then where are the primitives for accessing and manipulating individual bits?
Right. Which is why we want to add support for bitwise operators to bytes.
True, we have to decide that. Is there a good reason to handle each byte individually? -- Steve

On 5/17/2018 6:53 AM, Ken Hilton wrote:
https://bugs.python.org/issue19251 bitwise ops for bytes of equal length -- Terry Jan Reedy

Bitwise xor is used for "masking" code like these: https://github.com/PyMySQL/PyMySQL/blob/37eba60439039eff17b32ef1a63b45c25ea2... https://github.com/tornadoweb/tornado/blob/0b2b055061eb4754c80a8d6bc28614b86... https://github.com/tornadoweb/tornado/blob/master/tornado/speedups.c#L5 I think implementing it in C is really helpful for protocol library authors. On Thu, May 17, 2018 at 7:54 PM Ken Hilton <kenlhilton@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

On 6/22/2018 7:08 AM, INADA Naoki wrote:
Bitwise xor is used for "masking" code like these:
https://github.com/PyMySQL/PyMySQL/blob/37eba60439039eff17b32ef1a63b45c25ea2...
This points to a function _my_crypt that is O(n*n) because of using bytes.append. Using bytearray.append makes it O(n). --- import random import struct import timeit range_type = range def _my_crypt(message1, message2): length = len(message1) result = b'' for i in range_type(length): x = (struct.unpack('B', message1[i:i+1])[0] ^ struct.unpack('B', message2[i:i+1])[0]) result += struct.pack('B', x) return result def _my_crypt2(message1, message2): length = len(message1) result = bytearray() for i in range_type(length): x = (struct.unpack('B', message1[i:i+1])[0] ^ struct.unpack('B', message2[i:i+1])[0]) result += struct.pack('B', x) return bytes(result) def make(n): result = bytearray() for i in range(n): result.append(random.randint(0, 255)) return result for m in (10, 100, 1000, 10_000, 100_000, 1000_000): m1 = make(m) m2 = make(m) n = 1000_000 // m print(f'bytes len {m}, timeit reps {n}') print('old ', timeit.timeit('_my_crypt(m1, m2)', number = n, globals=globals())) print('new ', timeit.timeit('_my_crypt2(m1, m2)', number = n, globals=globals())) --- prints bytes len 10, timeit reps 100000 old 1.2277594129999998 new 1.2174212309999999 bytes len 100, timeit reps 10000 old 1.145566423 new 1.0924002120000003 bytes len 1000, timeit reps 1000 old 1.2860306190000002 new 1.1168685839999999 bytes len 10000, timeit reps 100 old 1.6543344650000003 new 1.118191714 bytes len 100000, timeit reps 10 old 4.2568492110000005 new 1.1266137560000011 bytes len 1000000, timeit reps 1 old 60.651238144000004 new 1.1315020199999992 I tried to submit this to https://github.com/PyMySQL/PyMySQL/issues/new but [Submit] does not work for me. -- Terry Jan Reedy

Hi Terry, Thanks, but I didn't care because my password is not so long. I just want to illustrate real world bytes xor usage. BTW, New MySQL auth methods (sha256 and caching_sha2) use bytes xor too. For performance point of view, websocket masking is performance critical. Tornado uses extension module only for it. If bytearray ^= bytes is supported, websocket frame masking may look like: frame ^= mask * ((len(frame)+3)//4) # mask is 4 bytes long On Sat, Jun 23, 2018 at 1:26 AM Terry Reedy <tjreedy@udel.edu> wrote:
-- INADA Naoki <songofacandy@gmail.com>

17.05.18 13:53, Ken Hilton пише:
The question is how common a need of these operations? If it is not common enough, they are better be implemented as functions in a third-party library.
Are you aware that this can raise a ValueError for some input strings? For example for '\U00100000' and '\U00010000'.

On Thu, May 17, 2018 at 02:14:10PM +0300, Serhiy Storchaka wrote:
The real question is, what does it mean to XOR a text string? What is 'π' XOR '≠'? I can think of at least three possibilities: '-\t' # treat the strings as UTF-8 '↠' # treat them as UTF-16 or UTF-32 '\x14' # treat them as MacRoman What if the strings are unequal lengths? But XORing equal length byte strings makes sense to me. There's no ambiguity: it is equivalent to just XORing each byte with the corresponding byte.
That works for me. py> ''.join(chr(ord(a) ^ ord(b)) for a, b in zip('\U00100000', '\U00100000')) '\x00' -- Steve

On Thu, May 17, 2018 at 03:49:02PM +0300, Serhiy Storchaka wrote:
No, he didn't explain the meaning. He gave an example, but not a reason why it should do what he showed. Why should the *abstract character* 'H' XORed with the abstract character 'w' return the abstract character '?'? Why shouldn't the result be '>' instead? (For the record that's using 'EBCDIC-CP-BE'.) The point is, XORing abstract characters seems meaningless to me. If the OP has an explanation for why 'H'^'?' must mean '?', he should explain it. XORing code points could easily generate invalid Unicode sequences containing lone surrogates, say, or undefined characters. Or as you point out, out of range values. But XORing bytes seems perfectly reasonable. Bytes are numbers, even if we display them as ASCII characters.
Oops, sorry. I misread your post. -- Steve

I agree with Steven. XORing unicode strings doesn't make sense, and is pointless anyway. The only interesting question is whether we want to add bytewise operations to the stdlib. Regards Antoine. On Thu, 17 May 2018 23:13:22 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

On 18/05/2018 07:07, Greg Ewing wrote:
Personally I would suggest that the bitwise operations would only make sense between pairs of characters, (or bytes), of equal size. (It is possible to make a rule that the returned value is the same size as the larger of the two - i.e. the smaller is promoted and left padded with 0s). But such operations don't make much sense in most cases. However the bitwise operators could sensibly be used for set-wise operations on strings, i.e.: - "123xyz" | "abc" = "123abcxyz", - "Spam" & "Eggs" = "", "Spam" & "Scam" = "Sam" - "Spam" ^ "Scam" = "cp" This __might__ have some practical use? -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

On Thu, May 17, 2018 at 11:07 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
actually, bytes are, well, bytes ;-) -- that is, 8 bits. But the point is that "bitwise" operations make all the sense in the world for bytes, but not for unicode text -- did anyone have a use case for bitwise operation on unicode strings? I can't imagine one, even if you could agree on a definition... Yep. Implement it for bytes, What exactly would be implemented? bytes is a sequence type, so would a bitwise operator perform the operation "elementwise"? (i.e. like numpy) or would it operate on the whole sequence as a single collection of bits? Would it be any different? Without thinking hard, it seems some operations, like AND and OR and XOR would be the same, but bit shifting would be different. And then what do you do if the two bytes objects are not the same length? If "elementwise", then we should think carefully about that -- no where else does Python do things elementwise in the standard library -- and we already have numpy if you want to do it now. and a bytes object can be representing any type of data -- so one byte at a time might make sense, but maybe two bytes at a time makes more sense -- and if the data is representing, say an integer, then endian-ness matters... All this is handled by numpy by having multiple data types that can be "mapped" to a buffer. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

2018-05-18 13:32 GMT-03:00 Chris Barker via Python-ideas <python-ideas@python.org>:
or would it operate on the whole sequence as a single collection of bits?
Once you think as the whole sequence of bytes as a sequence of bits (eight times longer, of course), all questions are easly answered, see below...
AND, XOR and OR are simple. Bit shifting will shift bits for the whole bit sequence.
And then what do you do if the two bytes objects are not the same length?
Pad with 0s to the left. Exactly the same that happen in
1 ^ 1231312312 1231312313
Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ Twitter: @facundobatista

On 5/18/18 12:32 PM, Chris Barker via Python-ideas wrote:
The one case I could think of is a very crude form of encryption. Probably most makes sense if you assume the string is really just ASCII. Not sure if that is a good enough case to be worth adding it, as often you special case some characters (like control characters) to avoid messing things up too much for display. -- Richard Damon

On Sat, May 19, 2018 at 2:32 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:
Grammatically, you appear to be disagreeing with the assertion that bytes are numbers. Is that the case? If you want to be extremely technical, an "octet" is a group of eight bits (or eight musicians, but I haven't yet figured out how to send musicians down an ethernet cable), and a "byte" isn't as rigidly defined. But on modern PCs, you can fairly safely assume that they're synonymous. I suppose you could argue that a "byte" is a patch of storage capable of holding a number from 0 to 255, as opposed to being the number itself, but that's getting rather existential :) In Python, a "bytes" object represents a sequence of eight-bit units. When you subscript a bytes [1], you get back an integer with the value at that position. So if a collection of them is called a "bytes" and one of them is an integer in range(0, 256), doesn't it stand to reason that a byte is a number? Maybe I'm completely misunderstanding your statement here. ChrisA [1] Yes, I'm aware that older versions of Python behaved differently. [2] [2] I'm also aware that putting square brackets after the word "bytes" could be interpreted as subscripting the 'bytes' type itself, rather than creating a reference to a footnote. Ain't contextual grammar fun?

Um, yes. Though I think for the rest of the conversation, it’s a distinction that doesn’t matter.
Sure — a byte, I think, is the smallest unit of memory that can be addressed. It could be more or less that 8 bytes, but that wasn’t the point either.
No, I’m making the distinction that an eight bit byte is, well, eight bits, that CAN represent a number from 0 to 255, or it can represent any other data type — like one eighth of the bits in a float, for instance. Or a bit field, or 1/2 a 16 bit int.
And when you print it, you get the ascii characters corresponding to each byte.... So one element in a bytes object is no more an integer than a character....
We use a decimal number (and ascii) to represent the bytes, as it’s more human readable and consistent with other python types.
Maybe I'm completely misunderstanding your statement here.
Again, it doesn’t much matter, until you get to deciding how to bitshift an entire bytes object. -CHB

On Sat, May 19, 2018 at 8:25 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Since "bit" simply means "binary digit", that's like saying that a four-digit number isn't a number; it MIGHT represent a number, but it might represent one quarter of your credit card. Is "4564" less of a number for that reason?
That's because those numbers can often be used to represent characters. But they are really and truly numbers. (If you want to get down to brass tacks, a Unicode string could be treated as a sequence of 21-bit numbers. And in some languages, the "string" type is actually a highly-optimized version of 21-bit-number-array - or 32-bit, perhaps - with fully supported use-cases involving numerical (non-textual) data.)
So one element in a bytes object is no more an integer than a character....
Except that the bytestring b"\x00\x80\xff\x99" very clearly represents four numbers, but doesn't clearly represent any characters.
Bitshifting a sequence of bytes has nothing whatsoever to do with characters. It has to do with the individual numbers, and then you have to decide how you represent those as a collective: little-endian or big-endian. That's still a matter of numbers, not characters. ChrisA

On Fri, May 18, 2018 at 06:25:19PM -0400, Chris Barker - NOAA Federal via Python-ideas wrote:
That might be the case if we were talking about chunks of memory in RAM, but we're talking about *bytes objects in Python* where the individual items in the object most certainly represent ints between 0 and 255: py> b"abcde"[3] 100 Philosophical arguments about the nature of computer memory aside, byte objects in Python are collections of ints. If you want those ints to represent something else, you're responsible for handling that (say, using the struct module). -- Steve

On Sat, May 19, 2018 at 6:52 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Philosophical arguments about the nature of computer memory aside, byte objects in Python are collections of ints.
not when you start talking about bit-wise operations :-) If a "byte" in python was an integer, then we'd use b**2 rather than b << 1 (yes, I know bit shifting can be more efficient, but I don't think that's why it's there in python) The entire point of bitwise operators is so that the bits themselves can be accessed and manipulated.
If you want those ints to represent something else, you're responsible for handling that (say, using the struct module).
yup -- with struct, and, hmm, maybe bitwise operators? Anyway, as you say, this is a Philosophical (or semantic) point -- I don't think it effects the discussion at hand. However, when you talk about bit-shifting a bytes object, you do need to decide if each byte is handled individually, or if they are one big collection. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, May 21, 2018 at 11:22:17AM -0700, Chris Barker wrote:
Why? Ints support bit-shift operations: py> 45 << 1 90 as well as other bitwise operations. (And b<<1 isn't the same as b**2.) I trust that you aren't going to argue that 45 isn't an int.
Then where are the primitives for accessing and manipulating individual bits?
Right. Which is why we want to add support for bitwise operators to bytes.
True, we have to decide that. Is there a good reason to handle each byte individually? -- Steve

On 5/17/2018 6:53 AM, Ken Hilton wrote:
https://bugs.python.org/issue19251 bitwise ops for bytes of equal length -- Terry Jan Reedy

Bitwise xor is used for "masking" code like these: https://github.com/PyMySQL/PyMySQL/blob/37eba60439039eff17b32ef1a63b45c25ea2... https://github.com/tornadoweb/tornado/blob/0b2b055061eb4754c80a8d6bc28614b86... https://github.com/tornadoweb/tornado/blob/master/tornado/speedups.c#L5 I think implementing it in C is really helpful for protocol library authors. On Thu, May 17, 2018 at 7:54 PM Ken Hilton <kenlhilton@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

On 6/22/2018 7:08 AM, INADA Naoki wrote:
Bitwise xor is used for "masking" code like these:
https://github.com/PyMySQL/PyMySQL/blob/37eba60439039eff17b32ef1a63b45c25ea2...
This points to a function _my_crypt that is O(n*n) because of using bytes.append. Using bytearray.append makes it O(n). --- import random import struct import timeit range_type = range def _my_crypt(message1, message2): length = len(message1) result = b'' for i in range_type(length): x = (struct.unpack('B', message1[i:i+1])[0] ^ struct.unpack('B', message2[i:i+1])[0]) result += struct.pack('B', x) return result def _my_crypt2(message1, message2): length = len(message1) result = bytearray() for i in range_type(length): x = (struct.unpack('B', message1[i:i+1])[0] ^ struct.unpack('B', message2[i:i+1])[0]) result += struct.pack('B', x) return bytes(result) def make(n): result = bytearray() for i in range(n): result.append(random.randint(0, 255)) return result for m in (10, 100, 1000, 10_000, 100_000, 1000_000): m1 = make(m) m2 = make(m) n = 1000_000 // m print(f'bytes len {m}, timeit reps {n}') print('old ', timeit.timeit('_my_crypt(m1, m2)', number = n, globals=globals())) print('new ', timeit.timeit('_my_crypt2(m1, m2)', number = n, globals=globals())) --- prints bytes len 10, timeit reps 100000 old 1.2277594129999998 new 1.2174212309999999 bytes len 100, timeit reps 10000 old 1.145566423 new 1.0924002120000003 bytes len 1000, timeit reps 1000 old 1.2860306190000002 new 1.1168685839999999 bytes len 10000, timeit reps 100 old 1.6543344650000003 new 1.118191714 bytes len 100000, timeit reps 10 old 4.2568492110000005 new 1.1266137560000011 bytes len 1000000, timeit reps 1 old 60.651238144000004 new 1.1315020199999992 I tried to submit this to https://github.com/PyMySQL/PyMySQL/issues/new but [Submit] does not work for me. -- Terry Jan Reedy

Hi Terry, Thanks, but I didn't care because my password is not so long. I just want to illustrate real world bytes xor usage. BTW, New MySQL auth methods (sha256 and caching_sha2) use bytes xor too. For performance point of view, websocket masking is performance critical. Tornado uses extension module only for it. If bytearray ^= bytes is supported, websocket frame masking may look like: frame ^= mask * ((len(frame)+3)//4) # mask is 4 bytes long On Sat, Jun 23, 2018 at 1:26 AM Terry Reedy <tjreedy@udel.edu> wrote:
-- INADA Naoki <songofacandy@gmail.com>
participants (14)
-
Antoine Pitrou
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Facundo Batista
-
Greg Ewing
-
INADA Naoki
-
Ken Hilton
-
Richard Damon
-
Serhiy Storchaka
-
Stephan Houben
-
Steve Barnes
-
Steven D'Aprano
-
Terry Reedy