[Python-ideas] Add an option for delimiters in bytes.hex()

Tue May 2 09:45:35 EDT 2017

On 2 May 2017 at 21:31, Steven D'Aprano <steve at pearwood.info> wrote:
> On Mon, May 01, 2017 at 11:38:20PM +1000, Nick Coghlan wrote:
>> However, a much simpler alternative would be to just support two
>> keyword arguments to hex(): "delimiter" (as you suggest) and
>> "chunk_size" (defaulting to 1, so you get per-byte chunking by
>> default)
>
> I disagree with this approach. There's nothing special about bytes.hex()
> here, perhaps we want to format the output of hex() or bin() or oct(),
> or for that matter "%x" and any of the other string templates?
>
> In fact, this is a string operation that could apply to any character
> string, including decimal digits.
>
> Rather than duplicate the API and logic everywhere, I suggest we add a
> new string method. My suggestion is str.chunk(size, delimiter=' ') and
> str.rchunk() with the same arguments:
>
> "1234ABCDEF".chunk(4)
> => returns "1234 ABCD EF"
>
> rchunk will be useful for money or other situations where we group from
> the right rather than from the left:
>
> "$" + str(10**6).rchunk(3, ',')
> => returns "$1,000,000"

Nice. That proposal also addresses one of the problems I raised in the
issue tracker, which is that the decimal equivalent to hex/oct/bin is
just str, so anything based on keyword arguments to the display
functions is hard to apply to ordinary decimal numbers.

Attempting to align the terminology with existing string methods and
other stdlib APIs:

1. the programming FAQ uses "chunks" as the accumulation variable
prior to calling str.join():
https://docs.python.org/3/faq/programming.html#what-is-the-most-efficient-way-to-concatenate-many-strings-together
2. the most analogous itertools recipe is the "grouper" recipe, which
describes it purpose as "Collect data into fixed-length chunks or
blocks"
3. there's a top level "chunk" module for working with audio file
formats (today-I-learned...)
4. multiprocessing uses "chunksize" to manage the dispatching of work
to worker processes
5. various networking, IO and serialisation libraries use "chunk" to
describe data blocks for incremental reads and writes

I think a couple of key problems are illustrated by that survey:

1. we don't have any current APIs or documentation that use "chunk" in
combination with any kind of delimiter
2. we don't have any current APIs or documentation that use "chunk" as
a verb - they all use it as a noun

So if we went with this approach, then Carl Smith's suggestion of
"str.delimit()" likely makes sense.

However, the other question worth asking is whether we might want a
"string slice splitting" operation rather than a string delimiting
option: once you have the slices, then combining them again with
str.join is straightforward, but extracting the slices in the first
place is currently a little fiddly (especially for the reversed case):

    def splitslices(self, size):
        return [self[start:start+size] for start in range(0, len(self), size)]

    def rsplitslices(self, size):
        blocks = [self[start:start+size] for start in range(-2*size,
-len(self), -size)]
        blocks.append(self[-size:])
        return blocks

Given those methods, the split-and-rejoin use case that started the
thread would look like:

  " ".join("1234ABCDEF".splitslices(4))
    => "1234 ABCD EF"

  "$" + ",".join(str(10**6).rsplitslices(3))
   => "$1,000,000"

Which is the same pattern that can be used to change a delimiter with
str.split() and str.splitlines().

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia