[Python-ideas] Add a .chunks() method to sequences

Fri May 5 03:29:55 EDT 2017

On 5 May 2017 at 08:20, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Victor Stinner wrote:
>>
>> I prefer str.join() approach: write a single chunks() function which
>> takes a sequence, instead of modifying all sequence types around the
>> world ;-)
>
>
> Even if a general sequence-chunking function is thought useful,
> it might be worth providing a special-purpose one as a string
> method in the interests of efficiency. Splitting a string into
> a sequence of characters, messing with it and then joining it
> back into a string is a pretty expensive way to do things.
>
> While most uses would probably be for short strings, I can
> think of uses cases involving large ones. For example, to
> format a hex dump into lines with 8 bytes per line and spaces
> between the lines:
>
>    data.group(2, ' ').group(24, '\n')
>
> And even for short strings, processing lots of them in a loop
> could get expensive with the general-purpose approach.

I don't think performance is a good argument for combining the
split/join operations, but I do think it's a decent argument for
offering native support for decomposition of regularly structured data
without the conceptual complexity of going through memoryview and
reshaping the data that way.

That is, approaching the problem of displaying regular data from a
"formatting text" point of view would look like:

    BYTES_PER_LINE = 8
    DIGITS_PER_BYTE = 2
    hex_lines = data.hex().splitgroups(BYTES_PER_LINE * DIGITS_PER_BYTE)
    for line in hex_lines:
        print(' '.join(line.splitgroups(DIGITS_PER_BYTE))

This has the benefit of working with the hex digits directly, so it
doesn't specifically require access to the original data.

By being a string method, it can handle all the complexities of str's
variable width internal storage, while still taking advantage of
direct access to those data structures.

The corresponding "data view" mindset would be:

    NUM_LINES, remainder = divmod(len(data), BYTES_PER_LINE)
    if remainder: ... # Pad the data with zeros to give it a regular shape
    view = memoryview(data).cast('c', (NUM_LINES, BYTES_PER_LINE))
    for row in view.tolist():
        print(' '.join(entry.hex() for entry in row))

This approaches the task at hand as an array-rendering issue rather
than as a string or bytes formatting problem, but that's actually a
pretty big mental leap to make if you're thinking of your input as a
stream of text to be formatted rather than as an array of data to be
displayed.

And then given the proposed str.splitgroups() on the one hand, and the
existing memoryview.cast() on the other, offering
itertools.itergroups() as a corresponding building block specifically
for working with streams of regular data would make sense to me -
that's a standard approach in time-division multiplexing protocols,
and it also shows up in areas like digital audio processing as well
(where you're often doing things like shuffling incoming data chunks
into FFT buffers)

Cheers,
Nick.

P.S. As evidence for "this is a problem for memoryview" being a tricky
leap to make, note that it's the first time it has come up in this
thread as a possibility, even though it already works in at least
3.5+:

    >>> data = b'abcdefghijklmnop'
    >>> data.hex()
    '6162636465666768696a6b6c6d6e6f70'
    >>> view = memoryview(data).cast('b', (4, 4))
    >>> for row in view.tolist():
    ...     print(' '.join(entry.hex() for entry in row))
    ...
    61 62 63 64
    65 66 67 68
    69 6a 6b 6c
    6d 6e 6f 70

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia