[Python-ideas] Run length encoding

Sun Jun 11 00:17:56 EDT 2017

On 11 June 2017 at 13:27, Joshua Morton <joshua.morton13 at gmail.com> wrote:
> David: You're absolutely right, s/2/3 in my prior post!
>
> Neal: As for why zip (at first I thought you meant the zip function, not the
> zip compression scheme) is included and rle is not, zip is (or was), I
> believe, used as part of python's packaging infrastructure, hopefully
> someone else can correct me if that's untrue.

There are a variety of reasons why things end up in the standard library:

- they're part of how the language works (e.g. contextlib, types,
operator, numbers, pickle, abc, copy, sys, importlib, zipimport)
- they're modeling core computing concepts (e.g. math, cmath, decimal,
fractions, re, encodings)
- they're widely useful, and harder to get right than they may first
appear (e.g. collections, itertools, secrets, statistics, struct,
shelve)
- they're widely useful (or at least once were), and just plain hard
to get right (e.g. ssl, http, json, xml, xmlrpc, zip, bz2)
- they provide cross-platform abstractions of operating system level
interfaces (e.g. os, io, shutil, pathlib)
- they're helpful in enabling ecosystem level interoperability between
tools (e.g. logging, unittest, asyncio)
- they're part of the way *nix systems work (e.g. grp, pwd)
- they're useful in working on other parts of the standard library
(e.g. unittest.mock, enum)

"zip" and the other compression libraries check a couple of those
boxes: they're broadly useful *and* they're needed in other parts of
the standard library (e.g. lots of network protocols include
compression support, we support importing from zip archives, and we
support generating them through distutils, shutil, and zipapp)

Run-length-encoding on the other hand is one of those things where the
algorithm is pretty simple, and you're almost always going to be able
to get the best results by creating an implement tailored specifically
to your use case, rather than working through a general purpose
abstraction like the iterator protocol. Even when that isn't the case,
implementing your own utility function is still going to be
competitive time-wise with finding and using a standard
implementation.

I suspect the main long term value of offering a recommended
implementation as an itertools recipe wouldn't be in using it
directly, but rather in making it easier for people to:

- find the recipe if they already know the phrase "run length encoding"
- test their own implementations that are more tailored to their
specific use case

The one itertools building block I do sometimes wish we had was a
`counted_len` helper that:

- used `len(obj)` if the given object implemented `__len__`
- fell back to `sum(1 for __ in obj)` otherwise

Otherwise you have to make the choice between:

- use `len(obj)`, and only support sequences
- use `len(list(obj))` and potentially make a pointless copy
- use `sum(1 for __ in obj)` and ignore the possible O(1) fast path
- writing your own `counted_len` helper:

    def counted_len(iterable):
        try:
            return len(iterable)
        except TypeError:
            pass
        return sum(1 for __ in iter(iterable))

If there was an itertools.counted_len function then the obvious option
would be "use itertools.counted_len". Such a function would also
become the obvious way to consume an iterator when you don't care
about the individual results - you'd just process them all, and get
the count of how many iterations happened.

Given such a helper, the recipe for run-length-encoding would then be:

    def run_length_encode(iterable):
        return ((k, counted_len(g)) for k, g in groupby(iterable))

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia