Mailman 3 Add an option for delimiters in bytes.hex() - Python-ideas

Add an option for delimiters in bytes.hex()

older
Suggestion: Add shutil.get_dir_size

robert.hoelzl＠posteo.de

May 1, 2017

7:19 a.m.

The bytes.hex() function is the inverse function of Bytes.fromhex(). But fromhex can process spaces (which is much more readable), while hex() provides no way to include spaces. My proposal would be to add an optional delimiter, that allows to specify a string that will be inserted between the digit pairs of a byte: def hex(self, delimiter=‘‘): … This would allow to write: assert b’abc‘.hex(‘ ‘) == ’61 62 63‘ Gesendet von Mail für Windows 10

Attachments:

attachment.htm (text/html — 2.0 KB)

Show replies by date

Nick Coghlan

May 2017

1:38 p.m.

On 1 May 2017 at 17:19, <robert.hoelzl@posteo.de> wrote:

...

The bytes.hex() function is the inverse function of Bytes.fromhex().

But fromhex can process spaces (which is much more readable), while hex() provides no way to include spaces.

My proposal would be to add an optional delimiter, that allows to specify a string that will be inserted between the digit pairs of a byte:

def hex(self, delimiter=‘‘): …

We're definitely open to offering better formatting options for bytes.hex(). My proposal in https://bugs.python.org/issue22385 was to define a new formatting mini-language (akin to the way strftime works, but with a much simpler formatting mini-language): http://bugs.python.org/issue22385#msg292663 However, a much simpler alternative would be to just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Juancarlo Añez

2:04 p.m.

On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk. Cheers, -- Juancarlo *Añez*

Ethan Furman

5:34 p.m.

On 05/01/2017 07:04 AM, Juancarlo Añez wrote:

...

On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

I was also surprised by that. Also, should Python be used on a machine with, say, 24-bit words then a chunk size of three makes more sense that one of 1.5. ;) -- ~Ethan~

Alexandre Brault

5:41 p.m.

On 2017-05-01 01:34 PM, Ethan Furman wrote:

...

On 05/01/2017 07:04 AM, Juancarlo Añez wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

I was also surprised by that. Also, should Python be used on a machine with, say, 24-bit words then a chunk size of three makes more sense that one of 1.5. ;)

-- ~Ethan~ A hex digit is 4 bits long. To separate into words, the 24-bit word Python would use 3 (counting in bytes as initially proposed), or 6 (counting in hex digits). Neither option would result in a 1.5 chunk_size for 24-bit chunks.

Counting chunk_size either in nibbles or bytes seem equally intuitive to me (as long as it's documented).

Alexandre Brault

5:49 p.m.

...

On 2017-05-01 01:34 PM, Ethan Furman wrote:

...
On 05/01/2017 07:04 AM, Juancarlo Añez wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default) I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk. I was also surprised by that. Also, should Python be used on a machine with, say, 24-bit words then a chunk size of three makes more sense that one of 1.5. ;)

-- ~Ethan~ A hex digit is 4 bits long. To separate into words, the 24-bit word Python would use 3 (counting in bytes as initially proposed), or 6 (counting in hex digits). Neither option would result in a 1.5 chunk_size for 24-bit chunks.

Counting chunk_size either in nibbles or bytes seem equally intuitive to me (as long as it's documented). And I only just realised your main concern was about the 12-bit byte of

On 2017-05-01 01:41 PM, Alexandre Brault wrote: that 24-bit word architecture. Carry on

Terry Reedy

9:08 p.m.

On 5/1/2017 1:41 PM, Alexandre Brault wrote:

...

On 2017-05-01 01:34 PM, Ethan Furman wrote:

...
On 05/01/2017 07:04 AM, Juancarlo Añez wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

I was also surprised by that. Also, should Python be used on a machine with, say, 24-bit words then a chunk size of three makes more sense that one of 1.5. ;)

-- ~Ethan~ A hex digit is 4 bits long. To separate into words, the 24-bit word Python would use 3 (counting in bytes as initially proposed), or 6 (counting in hex digits). Neither option would result in a 1.5 chunk_size for 24-bit chunks.

Counting chunk_size either in nibbles or bytes seem equally intuitive to me (as long as it's documented).

Call the paramater 'octets' and it should be clear that it means 8 bit chunks. Do any machine now use anything else? -- Terry Jan Reedy

Nick Coghlan

4:52 a.m.

On 2 May 2017 at 03:34, Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 05/01/2017 07:04 AM, Juancarlo Añez wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

I was also surprised by that. Also, should Python be used on a machine with, say, 24-bit words then a chunk size of three makes more sense that one of 1.5. ;)

I came up with a possible alternative scheme on the issue tracker: def hex(self, *, group_digits=None, delimiter=" "): """B.hex() -> string of hex digits B.hex(group_digits=N) -> hex digits in groups separated by *delimeter* Create a string of hexadecimal numbers from a bytes object:: >>> b'\xb9\x01\xef'.hex() 'b901ef' >>> b'\xb9\x01\xef'.hex(group_digits=2) 'b9 01 ef' """ Advantages of this approach: - grouping by digits generalises more obviously to other bases (e.g. if similar arguments were ever added to the hex/oct/bin builtins) - by using "group_digits=None" to indicate "no grouping", the default delimiter can be a space rather than the empty string Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Joao S. O. Bueno

12:02 p.m.

On 1 May 2017 at 11:04, Juancarlo Añez <apalala@gmail.com> wrote:

...

On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

So do I. Moreover, if "1" is for two digits, there is no way to specify single digits - for little use we can perceive for that. Maybe it does not need to be named "chunk_size" - "digits_per_block" is too big, but is precise. Also, whatever we think is good for "hex" could also be done to "bin" .

...

Cheers,

-- Juancarlo Añez

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Carl Smith

12:06 p.m.

Couldn't it just be named `str.delimit`? I totally agree with Steve for what it's worth. Thanks for everything guys. Best, On Tue, 2 May 2017 13:02 Joao S. O. Bueno, <jsbueno@python.org.br> wrote:

...

On 1 May 2017 at 11:04, Juancarlo Añez <apalala@gmail.com> wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes) per chunk.

So do I. Moreover, if "1" is for two digits, there is no way to specify single digits - for little use we can perceive for that.

Maybe it does not need to be named "chunk_size" - "digits_per_block" is too big, but is precise.

Also, whatever we think is good for "hex" could also be done to "bin" .

...
Cheers,

-- Juancarlo Añez

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Carl Smith

12:12 p.m.

Sorry. I meant to be terse, but wasn't clear enough. I meant the method name. If it takes a `delimiter` karg, it would be consistent to call the operation `delimit`. On Tue, 2 May 2017 13:06 Carl Smith, <carl.input@gmail.com> wrote:

...

Couldn't it just be named `str.delimit`? I totally agree with Steve for what it's worth. Thanks for everything guys. Best,

On Tue, 2 May 2017 13:02 Joao S. O. Bueno, <jsbueno@python.org.br> wrote:

...
On 1 May 2017 at 11:04, Juancarlo Añez <apalala@gmail.com> wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan <ncoghlan@gmail.com>

...
...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes)

wrote: per

...
chunk. So do I. Moreover, if "1" is for two digits, there is no way to specify single digits - for little use we can perceive for that.

Maybe it does not need to be named "chunk_size" - "digits_per_block" is too big, but is precise.

Also, whatever we think is good for "hex" could also be done to "bin" .

...
Cheers,

-- Juancarlo Añez

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Carl Smith

12:18 p.m.

On the block size arg, couldn't it just be named `index`? On Tue, 2 May 2017 13:12 Carl Smith, <carl.input@gmail.com> wrote:

...

Sorry. I meant to be terse, but wasn't clear enough. I meant the method name. If it takes a `delimiter` karg, it would be consistent to call the operation `delimit`.

On Tue, 2 May 2017 13:06 Carl Smith, <carl.input@gmail.com> wrote:

...
Couldn't it just be named `str.delimit`? I totally agree with Steve for what it's worth. Thanks for everything guys. Best,

On Tue, 2 May 2017 13:02 Joao S. O. Bueno, <jsbueno@python.org.br> wrote:

...
On 1 May 2017 at 11:04, Juancarlo Añez <apalala@gmail.com> wrote:

...
On Mon, May 1, 2017 at 9:38 AM, Nick Coghlan <ncoghlan@gmail.com>

...
...
just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I'd expect "chunk_size" to mean the number of hex digits (not bytes)

wrote: per

...
chunk. So do I. Moreover, if "1" is for two digits, there is no way to specify single digits - for little use we can perceive for that.

Maybe it does not need to be named "chunk_size" - "digits_per_block" is too big, but is precise.

Also, whatever we think is good for "hex" could also be done to "bin" .

...
Cheers,

-- Juancarlo Añez

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Greg Ewing

10:10 p.m.

For a name, I think "group" would be better than "chunk". We talk about grouping the digits of a number, not chunking them. -- Greg

Nick Coghlan

1:46 p.m.

On 3 May 2017 at 08:10, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

For a name, I think "group" would be better than "chunk". We talk about grouping the digits of a number, not chunking them.

As soon as I added an intermediate variable to my example, I came to the same conclusion: >>> digit_groups = b'\xb9\x01\xef'.hex().splitgroups(2) >>> ' '.join(digit_groups) 'b9 01 ef' (from http://bugs.python.org/issue22385#msg292900) And for David's telephone number examples: >>> digit_groups = str(4135559414).rsplitgroups(4,3) >>> '-'.join(digit_groups) '413-555-9414' >>> digit_groups = "0113225551212".rsplitgroups(2,2,3,1,2,3) >>> '-'.join(digit_groups) '011-32-2-555-12-12' Another example would be generating numeric literals with underscores: >>> digit_groups = str(int(1e6).rsplitgroups(3) >>> '_'.join(digit_groups) '1_000_000' While a generalised reversed version wouldn't be possible, a corresponding "itertools.itergroups" function could be used to produce groups of defined lengths as islice iterators, similar to the way itertools.groupby works (i.e. producing subiterators of variable length rather than a fixed length tuple the way the grouper() recipe in the docs does). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Steven D'Aprano

11:31 a.m.

On Mon, May 01, 2017 at 11:38:20PM +1000, Nick Coghlan wrote:

...

We're definitely open to offering better formatting options for bytes.hex().

My proposal in https://bugs.python.org/issue22385 was to define a new formatting mini-language (akin to the way strftime works, but with a much simpler formatting mini-language): http://bugs.python.org/issue22385#msg292663

However, a much simpler alternative would be to just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I disagree with this approach. There's nothing special about bytes.hex() here, perhaps we want to format the output of hex() or bin() or oct(), or for that matter "%x" and any of the other string templates? In fact, this is a string operation that could apply to any character string, including decimal digits. Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments: "1234ABCDEF".chunk(4) => returns "1234 ABCD EF" rchunk will be useful for money or other situations where we group from the right rather than from the left: "$" + str(10**6).rchunk(3, ',') => returns "$1,000,000" And if we want to add bells and whistles, we could accept a tuple for the size argument: # Format mobile phone number in the Australian style "04123456".rchunk((4, 3)) => returns "0412 345 678" # Format an integer in the Indian style str(123456789).rchunk((3, 2), ",") => returns "12,34,56,789" In the OP's use-case: bytes("abcde", "ascii").hex().chunk(2) => returns '61 62 63 64 65' bytes("abcde", "ascii").hex().chunk(4) => returns '6162 6364 65' I don't see any advantage to adding this to bytes.hex(), hex(), oct(), bin(), and I really don't think it is helpful to be grouping the characters by the number of bits. Its a string formatting operation, not a bit operation. -- Steve

Nick Coghlan

1:45 p.m.

On 2 May 2017 at 21:31, Steven D'Aprano <steve@pearwood.info> wrote:

...

On Mon, May 01, 2017 at 11:38:20PM +1000, Nick Coghlan wrote:

...
However, a much simpler alternative would be to just support two keyword arguments to hex(): "delimiter" (as you suggest) and "chunk_size" (defaulting to 1, so you get per-byte chunking by default)

I disagree with this approach. There's nothing special about bytes.hex() here, perhaps we want to format the output of hex() or bin() or oct(), or for that matter "%x" and any of the other string templates?

In fact, this is a string operation that could apply to any character string, including decimal digits.

Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

"1234ABCDEF".chunk(4) => returns "1234 ABCD EF"

rchunk will be useful for money or other situations where we group from the right rather than from the left:

"$" + str(10**6).rchunk(3, ',') => returns "$1,000,000"

Nice. That proposal also addresses one of the problems I raised in the issue tracker, which is that the decimal equivalent to hex/oct/bin is just str, so anything based on keyword arguments to the display functions is hard to apply to ordinary decimal numbers. Attempting to align the terminology with existing string methods and other stdlib APIs: 1. the programming FAQ uses "chunks" as the accumulation variable prior to calling str.join(): https://docs.python.org/3/faq/programming.html#what-is-the-most-efficient-wa... 2. the most analogous itertools recipe is the "grouper" recipe, which describes it purpose as "Collect data into fixed-length chunks or blocks" 3. there's a top level "chunk" module for working with audio file formats (today-I-learned...) 4. multiprocessing uses "chunksize" to manage the dispatching of work to worker processes 5. various networking, IO and serialisation libraries use "chunk" to describe data blocks for incremental reads and writes I think a couple of key problems are illustrated by that survey: 1. we don't have any current APIs or documentation that use "chunk" in combination with any kind of delimiter 2. we don't have any current APIs or documentation that use "chunk" as a verb - they all use it as a noun So if we went with this approach, then Carl Smith's suggestion of "str.delimit()" likely makes sense. However, the other question worth asking is whether we might want a "string slice splitting" operation rather than a string delimiting option: once you have the slices, then combining them again with str.join is straightforward, but extracting the slices in the first place is currently a little fiddly (especially for the reversed case): def splitslices(self, size): return [self[start:start+size] for start in range(0, len(self), size)] def rsplitslices(self, size): blocks = [self[start:start+size] for start in range(-2*size, -len(self), -size)] blocks.append(self[-size:]) return blocks Given those methods, the split-and-rejoin use case that started the thread would look like: " ".join("1234ABCDEF".splitslices(4)) => "1234 ABCD EF" "$" + ",".join(str(10**6).rsplitslices(3)) => "$1,000,000" Which is the same pattern that can be used to change a delimiter with str.split() and str.splitlines(). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Steven D'Aprano

4:28 p.m.

On Tue, May 02, 2017 at 11:45:35PM +1000, Nick Coghlan wrote:

...

Attempting to align the terminology with existing string methods and other stdlib APIs: [...] 1. we don't have any current APIs or documentation that use "chunk" in combination with any kind of delimiter 2. we don't have any current APIs or documentation that use "chunk" as a verb - they all use it as a noun

English has a long and glorious tradition of verbing nouns, and nouning verbs. Group can mean the action of putting things into a group, join likewise refers to both the action of attaching two things and the seam or joint where they have been joined. Likewise for chunking: https://duckduckgo.com/html/?q=chunking "Chunk" has used as a verb since at least 1890 (albeit with a different meaning). None of my dictionaries give a date for the use of chunking to mean dividing something up into chunks, so that could be quite recent, but it's well-established in education (chunking as a technique for doing long division), psychology, linguistics and more. I remember using "chunking" as a verb to describe Hyperscript's text handling back in the mid 1980s, e.g. "word 2 of line 6 of text". The nltk library handles chunk as both a noun and verb in a similar sense: http://www.nltk.org/howto/chunk.html

...

So if we went with this approach, then Carl Smith's suggestion of "str.delimit()" likely makes sense.

The problem with "delimit" is that in many contexts it refers to marking both the start and end boundaries, e.g. people often refer to string delimiters '...' and list delimiters [...]. That doesn't apply here, where we're adding separators between chunks/groups. The term delimiter can be used in various ways, and some of them do not match the behaviour we want here: http://stackoverflow.com/questions/9118769/when-to-use-the-terms-delimiter-t... In this case, we are not adding delimiters, we're adding separators. We're chunking (or grouping) characters by counting them, then separating the groups. The test here is what happens if the string is shorter than the group size? "xyz".chunk(5, '*') If we're delimiting the boundaries of the group, then I expect that we should get "*xyz*", but if we're separating groups, I expect that we should get "xyz" unchanged.

...

However, the other question worth asking is whether we might want a "string slice splitting" operation rather than a string delimiting option: once you have the slices, then combining them again with str.join is straightforward, but extracting the slices in the first place is currently a little fiddly (especially for the reversed case):

Let me think about that :-) -- Steve

David Mertz

5:48 p.m.

On Tue, May 2, 2017 at 4:31 AM, Steven D'Aprano <steve@pearwood.info> wrote:

...

Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

"1234ABCDEF".chunk(4) => returns "1234 ABCD EF"

rchunk will be useful for money or other situations where we group from the right rather than from the left:

"$" + str(10**6).rchunk(3, ',') => returns "$1,000,000"

# Format mobile phone number in the Australian style "04123456".rchunk((4, 3)) => returns "0412 345 678"

# Format an integer in the Indian style str(123456789).rchunk((3, 2), ",") => returns "12,34,56,789"

I like this general idea very much. Dealing with lakh and crore is a very nice feature (and one that the `.format()` mini-language sadly fails to handle; it assumes numeric delimiters can only be commas, and only ever three positions). But I'm not sure the semantics you propose is flexible enough. I take it that the tuple means (<first-delimiter>, <other-delimiters>) from your examples. But I don't think that suffices for every common format. It would be fine to get a USA phone number like: str(4135559414).rchunk((4,3),'-') # -> 413-555-9414 But for example, looking somewhat at random at an international call ( https://en.wikipedia.org/wiki/Telephone_numbers_in_Belgium) *Dialing from New York to Brussel**011-32-2-555-12-12* - Omitting the leading "0". Maybe your API is for any length tuple, with the final element repeated. So I guess maybe this example could be: "0113225551212".rchunk((2,2,3,1,2,3),'-') I don't care about this method being called .chunk() vs. .delimit() vs. something else. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Carl Smith

6:46 p.m.

The main reason for naming it `delimit` was to be consistent with the karg `delimiter`, so `str.delimit(index, delimiter)`. You could call it `chop` I guess, but I'm just bikeshedding, so will leave it while you guys figure out the important stuff. -- Carl Smith carl.input@gmail.com On 2 May 2017 at 18:48, David Mertz <mertz@gnosis.cx> wrote:

...

On Tue, May 2, 2017 at 4:31 AM, Steven D'Aprano <steve@pearwood.info> wrote:

...
Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

"1234ABCDEF".chunk(4) => returns "1234 ABCD EF"

rchunk will be useful for money or other situations where we group from the right rather than from the left:

"$" + str(10**6).rchunk(3, ',') => returns "$1,000,000"

# Format mobile phone number in the Australian style "04123456".rchunk((4, 3)) => returns "0412 345 678"

# Format an integer in the Indian style str(123456789).rchunk((3, 2), ",") => returns "12,34,56,789"

I like this general idea very much. Dealing with lakh and crore is a very nice feature (and one that the `.format()` mini-language sadly fails to handle; it assumes numeric delimiters can only be commas, and only ever three positions).

But I'm not sure the semantics you propose is flexible enough. I take it that the tuple means (<first-delimiter>, <other-delimiters>) from your examples. But I don't think that suffices for every common format. It would be fine to get a USA phone number like:

str(4135559414 <(413)%20555-9414>).rchunk((4,3),'-') # -> 413-555-9414 <(413)%20555-9414>

But for example, looking somewhat at random at an international call ( https://en.wikipedia.org/wiki/Telephone_numbers_in_Belgium)

*Dialing from New York to Brussel**011-32-2-555-12-12 <+32%202%20555%2012%2012>* - Omitting the leading "0".

Maybe your API is for any length tuple, with the final element repeated. So I guess maybe this example could be:

"0113225551212 <+32%202%20555%2012%2012>".rchunk((2,2,3,1,2,3),'-')

I don't care about this method being called .chunk() vs. .delimit() vs. something else.

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

12:01 a.m.

On Tue, May 02, 2017 at 10:48:08AM -0700, David Mertz wrote:

...

Maybe your API is for any length tuple, with the final element repeated. So I guess maybe this example could be:

"0113225551212".rchunk((2,2,3,1,2,3),'-')

That's what I meant. -- Steve

Erik

10:39 p.m.

On 02/05/17 12:31, Steven D'Aprano wrote:

...

I disagree with this approach. There's nothing special about bytes.hex() here, perhaps we want to format the output of hex() or bin() or oct(), or for that matter "%x" and any of the other string templates?

In fact, this is a string operation that could apply to any character string, including decimal digits.

Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

"1234ABCDEF".chunk(4) => returns "1234 ABCD EF"

FWIW, I implemented a version of something similar as a fixed-length "chunk" method in itertoolsmodule.c (it was similar to izip_longest - it had a "fill" keyword to pad the final chunk). It was ~100 LOC including the structure definitions. The chunk method was an iterator (so it returned a sequence of "chunks" as defined by the API). Then I read that "itertools" should consist of primitives only and that we should defer to "moreitertools" for anything that is of a higher level (which this is - it can be done in terms of itertools functions). So I didn't propose it, although the processing of my WAV files (in which the sample data are groups of bytes - frames - of a fixed length) was significantly faster with it :( I also looked at implementing itertools.chunk as a function that would make use of a "__chunk__" method on the source object if it existed (which allowed a class to support an even more efficient version of chunking - things like range() etc).

...

I don't see any advantage to adding this to bytes.hex(), hex(), oct(), bin(), and I really don't think it is helpful to be grouping the characters by the number of bits. Its a string formatting operation, not a bit operation.

Why do you want to limit it to strings? Isn't something like this potentially useful for all sequences (where the result is a tuple of objects that are the same as the source sequence - be that strings or lists or lazy ranges or whatever?). Why aren't the chunks returned via an iterator? E.

Steven D'Aprano

12:43 a.m.

On Tue, May 02, 2017 at 11:39:48PM +0100, Erik wrote:

...

On 02/05/17 12:31, Steven D'Aprano wrote:

...

...
Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

For the record, I now think the second argument should be called "sep", for separator, and I'm okay with Greg's suggestion we call the method "group".

...

...
"1234ABCDEF".chunk(4) => returns "1234 ABCD EF" [...]

...

Why do you want to limit it to strings?

I'm not stopping anyone from proposing a generalisation of this that works with other sequence types. As somebody did :-) I've also been thinking about generalisations such as grouping lines into paragraphs, words into lines, etc. In text processing, chunking can refer to more than just characters. But here we have a specific, concrete use-case that involves strings. Anything else is YAGNI until a need is demonstrated :-)

...

Isn't something like this potentially useful for all sequences (where the result is a tuple of objects that are the same as the source sequence - be that strings or lists or lazy ranges or whatever?). Why aren't the chunks returned via an iterator?

String methods should return strings. That's not to argue against a generic iterator solution, but the barrier to use of an iterator solution is higher than just calling a method. You have to learn about importing, you need to know there is an itertools module (or a third party module to install first!), you have to know how to convert the iterator back to a string... -- Steve

Juancarlo Añez

1:07 a.m.

On Tue, May 2, 2017 at 8:43 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...

String methods should return strings.

...

...
...
"A-B-C".split("-") ['A', 'B', 'C']

If chunk() worked for all iterables:

...

...
...
" ".join("1234ABCDEF".chunk(4)) "1234 ABCD EF"

Cheers, -- Juancarlo *Añez*

Steven D'Aprano

9:20 a.m.

On Tue, May 02, 2017 at 09:07:41PM -0400, Juancarlo Añez wrote:

...

On Tue, May 2, 2017 at 8:43 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...
String methods should return strings.

...
...
...
"A-B-C".split("-") ['A', 'B', 'C']

Yes, thank you. And don't forget: py> 'abcd'.index('c') 2 But in context, I was responding to the question of why this proposed chunk()/group() method returns a string rather than an iterator. I worded my answer badly, but the intention was clear, at least in my own head *wink* Given that we were discussing a method that both groups the characters of a string and inserts the separators, it makes sense to return a string, like other string methods: 'foo'.upper() returns 'FOO', not iter(['F', 'O', 'O']) 'cheese and green eggs'.replace('green', 'red') returns a string, not iter(['cheese and ', 'red', ' eggs']) 'xyz'.zfill(5) returns '00xyz' not iter(['00', 'xyz']) etc, and likewise: 'abcdef'.chunk(2, sep='-') should return 'ab-cd-ef' rather than iter(['ab', '-', 'cd', '-', 'ef']) If we're talking about a different API, one where only the grouping is done and inserting separators is left for join(), then my answer will be different. In that case, then it is a matter of taste whether to return a list (like split()) or an iterator. I lean slightly towards returning a list, but I can see arguments for and against both. -- Steve

Erik

1:48 a.m.

On 03/05/17 01:43, Steven D'Aprano wrote:

...

On Tue, May 02, 2017 at 11:39:48PM +0100, Erik wrote:

...
On 02/05/17 12:31, Steven D'Aprano wrote:

...
...
Rather than duplicate the API and logic everywhere, I suggest we add a new string method. My suggestion is str.chunk(size, delimiter=' ') and str.rchunk() with the same arguments:

For the record, I now think the second argument should be called "sep", for separator, and I'm okay with Greg's suggestion we call the method "group".

...
...
"1234ABCDEF".chunk(4) => returns "1234 ABCD EF" [...]

...
Why do you want to limit it to strings?

I'm not stopping anyone from proposing a generalisation of this that works with other sequence types. As somebody did :-)

Who? I didn't spot that in the thread - please give a reference. Thanks. Anyway, I know you can't stop anyone from *proposing* something like this, but as soon as they do you may decide to quote the recipe from "https://docs.python.org/3/library/functions.html#zip" and try to block their proposition. There are already threads on fora that do that. That was my sticking point at the time when I implemented a general solution. Why bother to propose something that (although it made my code significantly faster) had already been blocked as being something that should be a python-level operation and not something to be included in a built-in?

...

String methods should return strings.

In that case, we need to fix this ASAP ;) :

...

...
...
'foobarbaz'.split('o') ['f', '', 'barbaz']

Where the result is reasonably a sequence, a method should return a sequence (but I would agree that it should generally be a sequence of objects of the source type - which I think is what I effectively said: "Isn't something like this potentially useful for all sequences (where the result is a [sequence] of objects that are the same [type] as the source sequence)"

...

That's not to argue against a generic iterator solution, but the barrier to use of an iterator solution is higher than just calling a method.

Knowing which sequence classes have a "chunk" method and which don't is a higher barrier than knowing that all sequences can be "chunked" by a single imported function. E.

Paul Moore

7:57 a.m.

On 3 May 2017 at 02:48, Erik <python@lucidity.plus.com> wrote:

...

Anyway, I know you can't stop anyone from *proposing* something like this, but as soon as they do you may decide to quote the recipe from "https://docs.python.org/3/library/functions.html#zip" and try to block their proposition. There are already threads on fora that do that.

That was my sticking point at the time when I implemented a general solution. Why bother to propose something that (although it made my code significantly faster) had already been blocked as being something that should be a python-level operation and not something to be included in a built-in?

It sounds like you have a reasonable response to the suggestion of using zip - that you have a use case where performance matters, and your proposed solution is of value in that case. Whether it's a *sufficient* response remains to be seen, but unless you present the argument we won't know. IMO, the idea behind itertools being building blocks is not to deter proposals for new tools, but to make sure that people focus on providing important low-level tools, and not on high level operations that can just as easily be written using those tools - essentially the guideline "not every 3-line function needs to be a builtin". So it's to make people think, not to block innovation. Hope this clarifies, Paul

Erik

11:13 p.m.

...

On 3 May 2017 at 02:48, Erik <python@lucidity.plus.com> wrote:

...
Anyway, I know you can't stop anyone from *proposing* something like

Hi Paul, On 03/05/17 08:57, Paul Moore wrote: this,

...

...
but as soon as they do you may decide to quote the recipe from "https://docs.python.org/3/library/functions.html#zip" and try to block their proposition. There are already threads on fora that do that.

That was my sticking point at the time when I implemented a general solution. Why bother to propose something that (although it made my code significantly faster) had already been blocked as being something that should be a python-level operation and not something to be included in a built-in?

It sounds like you have a reasonable response to the suggestion of using zip- that you have a use case where performance matters, and your proposed solution is of value in that case.

I don't think so, though. I had a use-case where splitting an iterable into a sequence of same-sized chunks efficiently improved the performance of my code significantly (processing a LOT of 24-bit, multi-channel - 16 to 32 - PCM streams from a WAV file). Having thought "I need to split this stream by a fixed number of bytes" and then found more_itertools.chunked() (and the zip_longest(*([iter(foo)] * num)) trick) it turned out they were not quick enough so I implemented itertools.chunked() in C. That worked well for me, so when I was done I did a search in case it was worth proposing as an enhancement to feed it back to the community. Then I came across things such as the following: http://bugs.python.org/issue6021 I am specifically referring to the "It has been rejected before" comment, also mentioned here: https://mail.python.org/pipermail/python-dev/2012-July/120885.html See this entire thread, too: https://mail.python.org/pipermail/python-ideas/2012-July/015671.html This is the reason why I really just didn't care enough to go through the process of proposing it in the end (even though the more_itertools.chunked function was one of the first 3 implemented in V1.0 and seems to _still_ be cropping up all the time in different guises - so is perhaps more fundamental than people recognise). The strong implication of the discussions linked to above is that if it had been mentioned before it would be immediately rejected, and that was supported by several members of the community in good standing. So I didn't propose it. I have no idea now what I spent my saved hours doing, but I imagine that it was fun

...

Whether it's a *sufficient* response remains to be seen, but unless you present the argument we won't know.

Summary: I didn't present the argument because I'm not a masochist Regards, E.

Steven D'Aprano

12:24 a.m.

On Thu, May 04, 2017 at 12:13:25AM +0100, Erik wrote:

...

I had a use-case where splitting an iterable into a sequence of same-sized chunks efficiently improved the performance of my code [...] So I didn't propose it. I have no idea now what I spent my saved hours doing, but I imagine that it was fun

...

Summary: I didn't present the argument because I'm not a masochist

I'm not sure what the point of that anecdote was, unless it was "I wrote some useful code, and you missed out". Your comments come across as a passive-aggressive chastisment of the core devs and the Python-Ideas community for being too quick to reject useful code: we missed out on something good, because you don't have the time or energy to deal with our negativity and knee-jerk rejection of everything good. That's the way your series of posts come across to me. Not every piece of useful code has to go into the std lib, and even if it should, it doesn't necessarily have to go into it from day 1. If you wanted to give back to the community, there are a number of options apart from "std lib or nothing": - you could have offered it to the moreitertools project; - you could have published it on PyPy; - you could have proposed it on Python-Ideas with an explicit statement that you didn't have the time or energy to get into a debate about including the function, "here's my implementation and an appropriate licence for you to use it: use it yourself, or if somebody else wants to champion putting it into the std lib, go right ahead, but I won't"; and possibly more. I'm not suggesting that you have any obligation to do any of these things, but you don't *have* to get into a long-winded, energy-sapping debate over inclusion unless you *really* care about having it added. If you care so little that you can't be bothered even to propose it, why do you care if it is rejected? -- Steve

Erik

1:32 a.m.

On 04/05/17 01:24, Steven D'Aprano wrote:

...

On Thu, May 04, 2017 at 12:13:25AM +0100, Erik wrote:

...
I had a use-case where splitting an iterable into a sequence of same-sized chunks efficiently improved the performance of my code [...] So I didn't propose it. I have no idea now what I spent my saved hours doing, but I imagine that it was fun

...
Summary: I didn't present the argument because I'm not a masochist

I'm not sure what the point of that anecdote was, unless it was "I wrote some useful code, and you missed out".

Then you have misunderstood me. Paul suggested that my use-case (chunking could be faster) was perhaps enough to propose that my patch may be considered. I responded with historical/empirical evidence that perhaps that would actually not be the case. I was responding, honestly, to the questions raised by Paul's email.

...

Your comments come across as a passive-aggressive chastisment of the core devs and the Python-Ideas community for being too quick to reject useful code: we missed out on something good, because you don't have the time or energy to deal with our negativity and knee-jerk rejection of everything good. That's the way your series of posts come across to me.

I apologise if my words or my turn of phrase do not appeal to you. I am trying to be constructive with everything I post. If you choose to interpret my messages in a different way then I'm not sure what I can do about that. Back to the important stuff though:

...

- you could have offered it to the moreitertools project;

A more efficient version of moreitertools.chunked() is what we're talking about.

...

- you could have published it on PyPy;

Does PyPy support C extension modules? If so, that's a possibility.

...

- you could have proposed it on Python-Ideas with an explicit statement

I may well do that - my current patch (because of when I did it) is against a Py2 codebase, but I could port it to Py3. I still have a nagging doubt that I'd be wasting my time though ;)

...

If you care so little that you can't be bothered even to propose it, why do you care if it is rejected?

You are mistaking not caring enough about the functionality with not caring enough to enter into an argument about including that functionality ... I didn't propose it at the time because of the reasons I mentioned. But when I saw something being discussed yet again that I had a general solution for already written I thought I mention it in case it was useful. As I said, I'm _trying_ to be constructive. E.

Steven D'Aprano

9:28 a.m.

On Wed, May 03, 2017 at 02:48:03AM +0100, Erik wrote:

...

On 03/05/17 01:43, Steven D'Aprano wrote:

...

...
I'm not stopping anyone from proposing a generalisation of this that works with other sequence types. As somebody did :-)

Who? I didn't spot that in the thread - please give a reference. Thanks.

https://mail.python.org/pipermail/python-ideas/2017-May/045568.html [...]

...

Knowing which sequence classes have a "chunk" method and which don't is a higher barrier than knowing that all sequences can be "chunked" by a single imported function.

At the moment, we're only talking about strings. That's the only actual use-case been presented so far. Everything else is at best Nice To Have, if not YAGNI. Let's not kill this idea by over-generalising it. We can always extend the idea in the future once it is proven. Or for those who really want a general purpose group-any-iterable function, it can start life as a third party module, and we can discuss adding it to the language when it is mature and the kinks are ironed out. -- Steve

Greg Ewing

6:08 a.m.

Steven D'Aprano wrote:

...

I've also been thinking about generalisations such as grouping lines into paragraphs, words into lines, etc.

You're probably going to want considerably more complicated algorithms for that kind of thing, though. Let's keep it simple. -- Greg

2868

Age (days ago)

2871

Last active (days ago)

List overview

Download

30 comments

13 participants

participants (13)

Alexandre Brault
Carl Smith
David Mertz
Erik
Ethan Furman
Greg Ewing
Joao S. O. Bueno
Juancarlo Añez
Nick Coghlan
Paul Moore
robert.hoelzl＠posteo.de
Steven D'Aprano
Terry Reedy

Add an option for delimiters in bytes.hex()

Erik

Erik

Erik

Erik

tags

participants (13)