Add a .chunks() method to sequences

Steven D’Aprano was giving me an idea (in the bytes.hex delimiter discussion): I had very often the use case that I want to split sequences into subsequences of same size. How about adding a chunks() and rchunks() function to sequences: [1,2,3,4,5,6,7].chunks(3) => [[1,2,3], [4,5,6], [7]] "1234“.chunks(2) => [“12“, “34“] (this could then be used to emulate stevens proposal: “ “.join(“1234567“.chunks(2)) => “12 34 56 7“) robert

On Tue, May 2, 2017 at 8:10 AM <robert.hoelzl@posteo.de> wrote:
Changing the definition of the Sequence ABC to avoid needing to use a 2-line function from the itertools recipes seems like a pretty drastic change. I don't think there's even a compelling argument for adding grouper() to itertools, let along to every single sequence.

On 2 May 2017 at 22:10, <robert.hoelzl@posteo.de> wrote:
While there may be a case for a "splitslices()" method on strings for text formatting purposes, that case is weaker for generic sequences - we don't offer general purpose equivalents to split() or partition() either. That said, the possibility of a sequence or container focused counterpart to "itertools" has come up before, and it's conceivable such an algorithm might find a home there. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

How about adding a chunks() and rchunks() function to sequences:
[1,2,3,4,5,6,7].chunks(3) => [[1,2,3], [4,5,6], [7]]
I prefer str.join() approach: write a single chunks() function which takes a sequence, instead of modifying all sequence types around the world ;-) It's less natural to write chunks(seq, n), but it's much simpler to implement and will work on all Python versions, enjoy! Victor

Victor Stinner wrote:
Even if a general sequence-chunking function is thought useful, it might be worth providing a special-purpose one as a string method in the interests of efficiency. Splitting a string into a sequence of characters, messing with it and then joining it back into a string is a pretty expensive way to do things. While most uses would probably be for short strings, I can think of uses cases involving large ones. For example, to format a hex dump into lines with 8 bytes per line and spaces between the lines: data.group(2, ' ').group(24, '\n') And even for short strings, processing lots of them in a loop could get expensive with the general-purpose approach. -- Greg

On 5 May 2017 at 08:20, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I don't think performance is a good argument for combining the split/join operations, but I do think it's a decent argument for offering native support for decomposition of regularly structured data without the conceptual complexity of going through memoryview and reshaping the data that way. That is, approaching the problem of displaying regular data from a "formatting text" point of view would look like: BYTES_PER_LINE = 8 DIGITS_PER_BYTE = 2 hex_lines = data.hex().splitgroups(BYTES_PER_LINE * DIGITS_PER_BYTE) for line in hex_lines: print(' '.join(line.splitgroups(DIGITS_PER_BYTE)) This has the benefit of working with the hex digits directly, so it doesn't specifically require access to the original data. By being a string method, it can handle all the complexities of str's variable width internal storage, while still taking advantage of direct access to those data structures. The corresponding "data view" mindset would be: NUM_LINES, remainder = divmod(len(data), BYTES_PER_LINE) if remainder: ... # Pad the data with zeros to give it a regular shape view = memoryview(data).cast('c', (NUM_LINES, BYTES_PER_LINE)) for row in view.tolist(): print(' '.join(entry.hex() for entry in row)) This approaches the task at hand as an array-rendering issue rather than as a string or bytes formatting problem, but that's actually a pretty big mental leap to make if you're thinking of your input as a stream of text to be formatted rather than as an array of data to be displayed. And then given the proposed str.splitgroups() on the one hand, and the existing memoryview.cast() on the other, offering itertools.itergroups() as a corresponding building block specifically for working with streams of regular data would make sense to me - that's a standard approach in time-division multiplexing protocols, and it also shows up in areas like digital audio processing as well (where you're often doing things like shuffling incoming data chunks into FFT buffers) Cheers, Nick. P.S. As evidence for "this is a problem for memoryview" being a tricky leap to make, note that it's the first time it has come up in this thread as a possibility, even though it already works in at least 3.5+: >>> data = b'abcdefghijklmnop' >>> data.hex() '6162636465666768696a6b6c6d6e6f70' >>> view = memoryview(data).cast('b', (4, 4)) >>> for row in view.tolist(): ... print(' '.join(entry.hex() for entry in row)) ... 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick, On 05/05/17 08:29, Nick Coghlan wrote:
It looks to me like your "itertools.itergroups()" is similar to more_itertools.chunked() - with at least one obvious change, see below(*). If anyone wants to persue this (or any itertools) enhancement, then please be aware of the following thread (and in particular the message being linked to - and the bug and discussion that it is replying to): https://mail.python.org/pipermail/python-dev/2012-July/120885.html I have been told off for bringing this up already, but I do it again in direct response to your suggestion because it seems there is a bar to getting something included in itertools and something like "chunked()" has already failed to make it. The thing to do is probably to talk directly to Raymond to see if there's an acceptable solution first before too much work is put into something that may be rejected as being too high level. It may be that a C version of "more_itertools" for things which people would find a speedup useful might be a solution (where the more_itertools package defers to those built-ins if they exist on the version of Python its executing on, otherwise uses its existing implementation as a fallback). I am not suggesting implementing the _whole_ of more_itertools in C - it's quite large now. (*) I had implemented itertools.chunked in C before (also for audio processing, as it happens) and one thing that I didn't like is the way strings get unpacked:
tuple(more_itertools.chunked("foo bar baz", 2)) (['f', 'o'], ['o', ' '], ['b', 'a'], ['r', ' '], ['b', 'a'], ['z'])
If the chunked/itergroups method checked for the presence of a __chunks__ or similar dunder method in the source sequence which returns an iterator, then the string class could efficiently yield substrings rather than individual characters which then had to be wrapped in a list or tuple (which I think is what you wanted itergroups() to do):
tuple(itertools.chunked("foo bar baz", 2)) ('fo', 'o ', 'ba', 'r ', 'ba', 'z')
Similarly, for objects which _represent_ a lot of data but do not actually hold those data literally (for example, range objects or even memoryviews), the returned chunks can also be representations of the data (subranges or subviews) and not the actual rendered data. For example, the existing:
becomes:
tuple(more_itertools.chunked(range(10), 3)) (range(0, 3), range(3, 6), range(6, 9), range(9, 10))
Obviously, with those short strings and ranges one could argue that there's no point, but the principle of doing it this way scales better than the version that collects all of the data in lists - for things like chunks of some sort of "view" object, you would still only have the actual data stored once in the original object. I suppose that one thing to consider is what happens when an iterator is passed to the chunked() function. An iterator could have a __chunks__ method which returned chunks of the source sequence from the existing point in the iteration, however the difference between such an iterator and one that _doesn't_ have a __chunks__ method is that in the second case the iterator would be consumed by the fall-back code which just does what more_itertools.chunked() does now, but in the first it would not. Perhaps there is a precedent for that particular edge case with iterators in a different context. Hope that helps, E.

On Tue, May 2, 2017 at 8:10 AM <robert.hoelzl@posteo.de> wrote:
Changing the definition of the Sequence ABC to avoid needing to use a 2-line function from the itertools recipes seems like a pretty drastic change. I don't think there's even a compelling argument for adding grouper() to itertools, let along to every single sequence.

On 2 May 2017 at 22:10, <robert.hoelzl@posteo.de> wrote:
While there may be a case for a "splitslices()" method on strings for text formatting purposes, that case is weaker for generic sequences - we don't offer general purpose equivalents to split() or partition() either. That said, the possibility of a sequence or container focused counterpart to "itertools" has come up before, and it's conceivable such an algorithm might find a home there. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

How about adding a chunks() and rchunks() function to sequences:
[1,2,3,4,5,6,7].chunks(3) => [[1,2,3], [4,5,6], [7]]
I prefer str.join() approach: write a single chunks() function which takes a sequence, instead of modifying all sequence types around the world ;-) It's less natural to write chunks(seq, n), but it's much simpler to implement and will work on all Python versions, enjoy! Victor

Victor Stinner wrote:
Even if a general sequence-chunking function is thought useful, it might be worth providing a special-purpose one as a string method in the interests of efficiency. Splitting a string into a sequence of characters, messing with it and then joining it back into a string is a pretty expensive way to do things. While most uses would probably be for short strings, I can think of uses cases involving large ones. For example, to format a hex dump into lines with 8 bytes per line and spaces between the lines: data.group(2, ' ').group(24, '\n') And even for short strings, processing lots of them in a loop could get expensive with the general-purpose approach. -- Greg

On 5 May 2017 at 08:20, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I don't think performance is a good argument for combining the split/join operations, but I do think it's a decent argument for offering native support for decomposition of regularly structured data without the conceptual complexity of going through memoryview and reshaping the data that way. That is, approaching the problem of displaying regular data from a "formatting text" point of view would look like: BYTES_PER_LINE = 8 DIGITS_PER_BYTE = 2 hex_lines = data.hex().splitgroups(BYTES_PER_LINE * DIGITS_PER_BYTE) for line in hex_lines: print(' '.join(line.splitgroups(DIGITS_PER_BYTE)) This has the benefit of working with the hex digits directly, so it doesn't specifically require access to the original data. By being a string method, it can handle all the complexities of str's variable width internal storage, while still taking advantage of direct access to those data structures. The corresponding "data view" mindset would be: NUM_LINES, remainder = divmod(len(data), BYTES_PER_LINE) if remainder: ... # Pad the data with zeros to give it a regular shape view = memoryview(data).cast('c', (NUM_LINES, BYTES_PER_LINE)) for row in view.tolist(): print(' '.join(entry.hex() for entry in row)) This approaches the task at hand as an array-rendering issue rather than as a string or bytes formatting problem, but that's actually a pretty big mental leap to make if you're thinking of your input as a stream of text to be formatted rather than as an array of data to be displayed. And then given the proposed str.splitgroups() on the one hand, and the existing memoryview.cast() on the other, offering itertools.itergroups() as a corresponding building block specifically for working with streams of regular data would make sense to me - that's a standard approach in time-division multiplexing protocols, and it also shows up in areas like digital audio processing as well (where you're often doing things like shuffling incoming data chunks into FFT buffers) Cheers, Nick. P.S. As evidence for "this is a problem for memoryview" being a tricky leap to make, note that it's the first time it has come up in this thread as a possibility, even though it already works in at least 3.5+: >>> data = b'abcdefghijklmnop' >>> data.hex() '6162636465666768696a6b6c6d6e6f70' >>> view = memoryview(data).cast('b', (4, 4)) >>> for row in view.tolist(): ... print(' '.join(entry.hex() for entry in row)) ... 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick, On 05/05/17 08:29, Nick Coghlan wrote:
It looks to me like your "itertools.itergroups()" is similar to more_itertools.chunked() - with at least one obvious change, see below(*). If anyone wants to persue this (or any itertools) enhancement, then please be aware of the following thread (and in particular the message being linked to - and the bug and discussion that it is replying to): https://mail.python.org/pipermail/python-dev/2012-July/120885.html I have been told off for bringing this up already, but I do it again in direct response to your suggestion because it seems there is a bar to getting something included in itertools and something like "chunked()" has already failed to make it. The thing to do is probably to talk directly to Raymond to see if there's an acceptable solution first before too much work is put into something that may be rejected as being too high level. It may be that a C version of "more_itertools" for things which people would find a speedup useful might be a solution (where the more_itertools package defers to those built-ins if they exist on the version of Python its executing on, otherwise uses its existing implementation as a fallback). I am not suggesting implementing the _whole_ of more_itertools in C - it's quite large now. (*) I had implemented itertools.chunked in C before (also for audio processing, as it happens) and one thing that I didn't like is the way strings get unpacked:
tuple(more_itertools.chunked("foo bar baz", 2)) (['f', 'o'], ['o', ' '], ['b', 'a'], ['r', ' '], ['b', 'a'], ['z'])
If the chunked/itergroups method checked for the presence of a __chunks__ or similar dunder method in the source sequence which returns an iterator, then the string class could efficiently yield substrings rather than individual characters which then had to be wrapped in a list or tuple (which I think is what you wanted itergroups() to do):
tuple(itertools.chunked("foo bar baz", 2)) ('fo', 'o ', 'ba', 'r ', 'ba', 'z')
Similarly, for objects which _represent_ a lot of data but do not actually hold those data literally (for example, range objects or even memoryviews), the returned chunks can also be representations of the data (subranges or subviews) and not the actual rendered data. For example, the existing:
becomes:
tuple(more_itertools.chunked(range(10), 3)) (range(0, 3), range(3, 6), range(6, 9), range(9, 10))
Obviously, with those short strings and ranges one could argue that there's no point, but the principle of doing it this way scales better than the version that collects all of the data in lists - for things like chunks of some sort of "view" object, you would still only have the actual data stored once in the original object. I suppose that one thing to consider is what happens when an iterator is passed to the chunked() function. An iterator could have a __chunks__ method which returned chunks of the source sequence from the existing point in the iteration, however the difference between such an iterator and one that _doesn't_ have a __chunks__ method is that in the second case the iterator would be consumed by the fall-back code which just does what more_itertools.chunked() does now, but in the first it would not. Perhaps there is a precedent for that particular edge case with iterators in a different context. Hope that helps, E.
participants (6)
-
Erik
-
Geoffrey Spear
-
Greg Ewing
-
Nick Coghlan
-
robert.hoelzl@posteo.de
-
Victor Stinner