Mailman 3 Run length encoding - Python-ideas

Run length encoding

older
ImportError raised for a circular...

Neal Fultz

June 11, 2017

2:20 a.m.

Hello python-ideas, I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :) The general idea is to be able to go back and forth between two representations of a sequence: [1,1,1,1,2,3,4,4,3,3,3] and [(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)] where the first element is the data element, and the second is how many times it is repeated. I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts. Best, -Neal

Attachments:

attachment.htm (text/html — 2.4 KB)

Show replies by date

Bernardo Sulzbach

June 2017

2:55 a.m.

On 2017-06-10 23:20, Neal Fultz wrote:

...

Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

|[1,1,1,1,2,3,4,4,3,3,3]|

and

|[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]|

where the first element is the data element, and the second is how many times it is repeated.

We can currently do it like this in one line: [(k, sum(1 for _ in g)) for k, g in groupby(sequence)] However, it is slower than a "dedicated" solution. Additionally, I don't know if what you are proposing is generic enough for the standard library. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

David Mertz

2:59 a.m.

Here's a one-line version: from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])]) Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`. On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...

Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Joshua Morton

3:12 a.m.

Another is [(k, len(list(g))) for k, g in groupby(l)] It might be worth adding it to the list of recipies either at https://docs.python.org/2/library/itertools.html#itertools.groupby or at https://docs.python.org/2/library/itertools.html#recipes, though. On Sat, Jun 10, 2017 at 8:07 PM David Mertz <mertz@gnosis.cx> wrote:

...

Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

David Mertz

3:20 a.m.

God no! Not in the Python 2 docs! ... if the recipe belongs somewhere it's in the Python 3 docs. Although, I suppose it could go under 2 also, since it's not actually a behavior change in the feature-frozen interpreter. But as a Python instructor (and someone who remembers the cool new features of Python 1.5 over 1.4 pretty well), my attitude about Python 2 is "kill it with fire!" Your spelling of the one-liner is prettier, shorter, and more intuitive than mine, and the same speed. On Sat, Jun 10, 2017 at 8:12 PM, Joshua Morton <joshua.morton13@gmail.com> wrote:

...

Another is

[(k, len(list(g))) for k, g in groupby(l)]

It might be worth adding it to the list of recipies either at https://docs.python.org/2/library/itertools.html#itertools.groupby or at https://docs.python.org/2/library/itertools.html#recipes, though.

On Sat, Jun 10, 2017 at 8:07 PM David Mertz <mertz@gnosis.cx> wrote:

...
Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Joshua Morton

3:27 a.m.

David: You're absolutely right, s/2/3 in my prior post! Neal: As for why zip (at first I thought you meant the zip function, not the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue. --Josh On Sat, Jun 10, 2017 at 8:20 PM David Mertz <mertz@gnosis.cx> wrote:

...

God no! Not in the Python 2 docs! ... if the recipe belongs somewhere it's in the Python 3 docs. Although, I suppose it could go under 2 also, since it's not actually a behavior change in the feature-frozen interpreter. But as a Python instructor (and someone who remembers the cool new features of Python 1.5 over 1.4 pretty well), my attitude about Python 2 is "kill it with fire!"

Your spelling of the one-liner is prettier, shorter, and more intuitive than mine, and the same speed.

On Sat, Jun 10, 2017 at 8:12 PM, Joshua Morton <joshua.morton13@gmail.com> wrote:

...
Another is

[(k, len(list(g))) for k, g in groupby(l)]

It might be worth adding it to the list of recipies either at https://docs.python.org/2/library/itertools.html#itertools.groupby or at https://docs.python.org/2/library/itertools.html#recipes, though.

On Sat, Jun 10, 2017 at 8:07 PM David Mertz <mertz@gnosis.cx> wrote:

...
Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Neal Fultz

3:33 a.m.

Yes, I mean zip compression :) Also, everyone's been posting decode functions, but encode is a bit harder :). I think it should be equally easy to go one direction as the other. Hopefully this email chain builds up enough info to update the docs for posterity / future me. On Sat, Jun 10, 2017 at 8:27 PM, Joshua Morton <joshua.morton13@gmail.com> wrote:

...

David: You're absolutely right, s/2/3 in my prior post!

Neal: As for why zip (at first I thought you meant the zip function, not the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue.

--Josh

On Sat, Jun 10, 2017 at 8:20 PM David Mertz <mertz@gnosis.cx> wrote:

...
God no! Not in the Python 2 docs! ... if the recipe belongs somewhere it's in the Python 3 docs. Although, I suppose it could go under 2 also, since it's not actually a behavior change in the feature-frozen interpreter. But as a Python instructor (and someone who remembers the cool new features of Python 1.5 over 1.4 pretty well), my attitude about Python 2 is "kill it with fire!"

Your spelling of the one-liner is prettier, shorter, and more intuitive than mine, and the same speed.

On Sat, Jun 10, 2017 at 8:12 PM, Joshua Morton <joshua.morton13@gmail.com

...
wrote:

...
Another is

[(k, len(list(g))) for k, g in groupby(l)]

It might be worth adding it to the list of recipies either at https://docs.python.org/2/library/itertools.html#itertools.groupby or at https://docs.python.org/2/library/itertools.html#recipes, though.

On Sat, Jun 10, 2017 at 8:07 PM David Mertz <mertz@gnosis.cx> wrote:

...
Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Neal Fultz

3:35 a.m.

Whoops, scratch that part about encode /decode. On Sat, Jun 10, 2017 at 8:33 PM, Neal Fultz <nfultz@gmail.com> wrote:

...

Yes, I mean zip compression :)

Also, everyone's been posting decode functions, but encode is a bit harder :).

I think it should be equally easy to go one direction as the other. Hopefully this email chain builds up enough info to update the docs for posterity / future me.

On Sat, Jun 10, 2017 at 8:27 PM, Joshua Morton <joshua.morton13@gmail.com> wrote:

...
David: You're absolutely right, s/2/3 in my prior post!

Neal: As for why zip (at first I thought you meant the zip function, not the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue.

--Josh

On Sat, Jun 10, 2017 at 8:20 PM David Mertz <mertz@gnosis.cx> wrote:

...
God no! Not in the Python 2 docs! ... if the recipe belongs somewhere it's in the Python 3 docs. Although, I suppose it could go under 2 also, since it's not actually a behavior change in the feature-frozen interpreter. But as a Python instructor (and someone who remembers the cool new features of Python 1.5 over 1.4 pretty well), my attitude about Python 2 is "kill it with fire!"

Your spelling of the one-liner is prettier, shorter, and more intuitive than mine, and the same speed.

On Sat, Jun 10, 2017 at 8:12 PM, Joshua Morton < joshua.morton13@gmail.com> wrote:

...
Another is

[(k, len(list(g))) for k, g in groupby(l)]

It might be worth adding it to the list of recipies either at https://docs.python.org/2/library/itertools.html#itertools.groupby or at https://docs.python.org/2/library/itertools.html#recipes, though.

On Sat, Jun 10, 2017 at 8:07 PM David Mertz <mertz@gnosis.cx> wrote:

...
Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Nick Coghlan

4:24 a.m.

On 11 June 2017 at 13:35, Neal Fultz <nfultz@gmail.com> wrote:

...

Whoops, scratch that part about encode /decode.

Aye, decode is a relatively straightforward nested comprehension: def run_length_decode(iterable): return (item for item, item_count in iterable for __ in range(item_count)) It's only encode that is currently missing a clear self-evidently correct spelling, and I think that's more due to the lack of an obvious spelling for "tell me how many items are in this iterable, exhausting it if necessary" than it is to anything else. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Terry Reedy

3:46 a.m.

On 6/10/2017 11:27 PM, Joshua Morton wrote:

...

Neal: As for why zip (at first I thought you meant the zip function, not the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue.

cpython can run from a zipped version of the stdlib. In fact, sys.path contains 'C:\\Programs\\Python36\\python36.zip' -- Terry Jan Reedy

Neal Fultz

3:56 a.m.

I would also submit there's some value in the obvious readability of z = runlength.encode(sequence) vs z = [(k, len(list(g))) for k, g in itertools.groupby(sequence)] but that's my personal opinion. Everyone is welcome to use my code, but I probably won't submit to pypi for a two function module, it was just an idea :) I do think it's worth adding to the docs, though, if only for future people / me googling "run length encoding python" and only finding stack overflow. On Sat, Jun 10, 2017 at 8:46 PM, Terry Reedy <tjreedy@udel.edu> wrote:

...

On 6/10/2017 11:27 PM, Joshua Morton wrote:

Neal: As for why zip (at first I thought you meant the zip function, not

...
the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue.

cpython can run from a zipped version of the stdlib. In fact, sys.path contains 'C:\\Programs\\Python36\\python36.zip'

-- Terry Jan Reedy

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Nick Coghlan

4:17 a.m.

On 11 June 2017 at 13:27, Joshua Morton <joshua.morton13@gmail.com> wrote:

...

David: You're absolutely right, s/2/3 in my prior post!

Neal: As for why zip (at first I thought you meant the zip function, not the zip compression scheme) is included and rle is not, zip is (or was), I believe, used as part of python's packaging infrastructure, hopefully someone else can correct me if that's untrue.

There are a variety of reasons why things end up in the standard library: - they're part of how the language works (e.g. contextlib, types, operator, numbers, pickle, abc, copy, sys, importlib, zipimport) - they're modeling core computing concepts (e.g. math, cmath, decimal, fractions, re, encodings) - they're widely useful, and harder to get right than they may first appear (e.g. collections, itertools, secrets, statistics, struct, shelve) - they're widely useful (or at least once were), and just plain hard to get right (e.g. ssl, http, json, xml, xmlrpc, zip, bz2) - they provide cross-platform abstractions of operating system level interfaces (e.g. os, io, shutil, pathlib) - they're helpful in enabling ecosystem level interoperability between tools (e.g. logging, unittest, asyncio) - they're part of the way *nix systems work (e.g. grp, pwd) - they're useful in working on other parts of the standard library (e.g. unittest.mock, enum) "zip" and the other compression libraries check a couple of those boxes: they're broadly useful *and* they're needed in other parts of the standard library (e.g. lots of network protocols include compression support, we support importing from zip archives, and we support generating them through distutils, shutil, and zipapp) Run-length-encoding on the other hand is one of those things where the algorithm is pretty simple, and you're almost always going to be able to get the best results by creating an implement tailored specifically to your use case, rather than working through a general purpose abstraction like the iterator protocol. Even when that isn't the case, implementing your own utility function is still going to be competitive time-wise with finding and using a standard implementation. I suspect the main long term value of offering a recommended implementation as an itertools recipe wouldn't be in using it directly, but rather in making it easier for people to: - find the recipe if they already know the phrase "run length encoding" - test their own implementations that are more tailored to their specific use case The one itertools building block I do sometimes wish we had was a `counted_len` helper that: - used `len(obj)` if the given object implemented `__len__` - fell back to `sum(1 for __ in obj)` otherwise Otherwise you have to make the choice between: - use `len(obj)`, and only support sequences - use `len(list(obj))` and potentially make a pointless copy - use `sum(1 for __ in obj)` and ignore the possible O(1) fast path - writing your own `counted_len` helper: def counted_len(iterable): try: return len(iterable) except TypeError: pass return sum(1 for __ in iter(iterable)) If there was an itertools.counted_len function then the obvious option would be "use itertools.counted_len". Such a function would also become the obvious way to consume an iterator when you don't care about the individual results - you'd just process them all, and get the count of how many iterations happened. Given such a helper, the recipe for run-length-encoding would then be: def run_length_encode(iterable): return ((k, counted_len(g)) for k, g in groupby(iterable)) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Greg Ewing

4:39 a.m.

In my experience, RLE isn't something you often find on its own. Usually it's used as part of some compression scheme that also has ways of encoding verbatim runs of data and maybe other things. So I'm skeptical that it can be usefully provided as a library function. It seems more like a design pattern than something you can capture in a library. -- Greg

Neal Fultz

5:08 a.m.

Agreed to a degree about providing it as code, but it may also be worth mentioning also that zlib itself implements rle [1], and if there was ever a desire to go "python all the way down" you need an RLE somewhere anyway :) That said, I'll be pretty happy with anything that replaces an hour of google/coding/testing/(hour later find out I'm an idiot from a random listserv) with 1 minute of googling. Again, my issue isn't that it was difficult to code, but it *was* hard to make the research-y jump from googling for "run length encoding python", where I knew *exactly* what algorithm I wanted, to "itertools.groupby" which appears to be more general purpose and needs a little tweaking. Adjusting the docs/recipes would probably solve that problem. -- To me this is roughly on the same level as googling for 'binary search python' and not having bisect show up. However, the fact that `itertools.groupby` doesn't group over elements that are not contiguous is a bit surprising to me coming from SQL/pandas/R land (that is probably a large part of my disconnect here). This is actually explicitly called out in the current docs, but I wonder how many people search for one thing and find the other: I googled for RLE and the solution was actually groupby, but probably a lot of other people want a SQL group-by accidentally got an RLE and have to work around that... Then again, I don't know if you all can easily change names of functions at this point. -Neal [1] https://github.com/madler/zlib/blob/master/deflate.c#L2057 On Sat, Jun 10, 2017 at 9:39 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

In my experience, RLE isn't something you often find on its own. Usually it's used as part of some compression scheme that also has ways of encoding verbatim runs of data and maybe other things.

So I'm skeptical that it can be usefully provided as a library function. It seems more like a design pattern than something you can capture in a library.

-- Greg

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

David Mertz

5:19 a.m.

If you understand what iterators do, the fact that itertools.groupby collects contiguous elements is both obvious and necessary. Iterators might be infinitely long... you cannot ask for every "A" that might eventually occur in an infinite sequence of letters. On Sat, Jun 10, 2017 at 10:08 PM, Neal Fultz <nfultz@gmail.com> wrote:

...

Agreed to a degree about providing it as code, but it may also be worth mentioning also that zlib itself implements rle [1], and if there was ever a desire to go "python all the way down" you need an RLE somewhere anyway :)

That said, I'll be pretty happy with anything that replaces an hour of google/coding/testing/(hour later find out I'm an idiot from a random listserv) with 1 minute of googling. Again, my issue isn't that it was difficult to code, but it *was* hard to make the research-y jump from googling for "run length encoding python", where I knew *exactly* what algorithm I wanted, to "itertools.groupby" which appears to be more general purpose and needs a little tweaking. Adjusting the docs/recipes would probably solve that problem.

-- To me this is roughly on the same level as googling for 'binary search python' and not having bisect show up.

However, the fact that `itertools.groupby` doesn't group over elements that are not contiguous is a bit surprising to me coming from SQL/pandas/R land (that is probably a large part of my disconnect here). This is actually explicitly called out in the current docs, but I wonder how many people search for one thing and find the other:

I googled for RLE and the solution was actually groupby, but probably a lot of other people want a SQL group-by accidentally got an RLE and have to work around that... Then again, I don't know if you all can easily change names of functions at this point.

-Neal

[1] https://github.com/madler/zlib/blob/master/deflate.c#L2057

On Sat, Jun 10, 2017 at 9:39 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...
In my experience, RLE isn't something you often find on its own. Usually it's used as part of some compression scheme that also has ways of encoding verbatim runs of data and maybe other things.

So I'm skeptical that it can be usefully provided as a library function. It seems more like a design pattern than something you can capture in a library.

-- Greg

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Neal Fultz

6:17 a.m.

If the consensus is "Let's add ten lines to the recipes" I'm all aboard, ignore the rest: if I could have googled a good answer I would have stopped there. I won't argue the necessity or obviousness of itertools.groupby, just it's name: * I myself am a false negative that wanted the RLE behavior *and couldn't find it easily * so we should update the docs * other people have been false positive and wanted a SQL-type group by, but got burned * hence the warnings in the docs. * If you say explicate "by run", some extra group of them will then know what that means vs the current wording. I would definitely also support adding helper functions though, I think this is a very common use case which turns up in math/optimization applied to geology, biology, ... , and also fax machines: https://en.wikipedia.org/wiki/Run-length_encoding Also, if someone rewrote zip in pure python, would many people actually notice a slow down vs network latency, disk IO, etc? RLE is a building block just like bisect. :) Anyway, I'm not claiming my implementation is some huge gift, but let's at least add a recipe or documentation so people can find y'all's way later without reinventing the wheel. On Sat, Jun 10, 2017 at 10:19 PM, David Mertz <mertz@gnosis.cx> wrote:

...

If you understand what iterators do, the fact that itertools.groupby collects contiguous elements is both obvious and necessary. Iterators might be infinitely long... you cannot ask for every "A" that might eventually occur in an infinite sequence of letters.

On Sat, Jun 10, 2017 at 10:08 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Agreed to a degree about providing it as code, but it may also be worth mentioning also that zlib itself implements rle [1], and if there was ever a desire to go "python all the way down" you need an RLE somewhere anyway :)

That said, I'll be pretty happy with anything that replaces an hour of google/coding/testing/(hour later find out I'm an idiot from a random listserv) with 1 minute of googling. Again, my issue isn't that it was difficult to code, but it *was* hard to make the research-y jump from googling for "run length encoding python", where I knew *exactly* what algorithm I wanted, to "itertools.groupby" which appears to be more general purpose and needs a little tweaking. Adjusting the docs/recipes would probably solve that problem.

-- To me this is roughly on the same level as googling for 'binary search python' and not having bisect show up.

However, the fact that `itertools.groupby` doesn't group over elements that are not contiguous is a bit surprising to me coming from SQL/pandas/R land (that is probably a large part of my disconnect here). This is actually explicitly called out in the current docs, but I wonder how many people search for one thing and find the other:

I googled for RLE and the solution was actually groupby, but probably a lot of other people want a SQL group-by accidentally got an RLE and have to work around that... Then again, I don't know if you all can easily change names of functions at this point.

-Neal

[1] https://github.com/madler/zlib/blob/master/deflate.c#L2057

On Sat, Jun 10, 2017 at 9:39 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...
In my experience, RLE isn't something you often find on its own. Usually it's used as part of some compression scheme that also has ways of encoding verbatim runs of data and maybe other things.

So I'm skeptical that it can be usefully provided as a library function. It seems more like a design pattern than something you can capture in a library.

-- Greg

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Serhiy Storchaka

8:53 a.m.

11.06.17 09:17, Neal Fultz пише:

...

* other people have been false positive and wanted a SQL-type group by, but got burned * hence the warnings in the docs.

This wouldn't help if people don't read the docs.

...

Also, if someone rewrote zip in pure python, would many people actually notice a slow down vs network latency, disk IO, etc?

Definitely yes.

...

RLE is a building block just like bisect.

This is very specific building block. And if ZIP compression be rewrote in pure Python it wouldn't use FYI, there are multiple compression methods supported in ZIP files, but the zipmodule module implements not all of them. In particular simple RLE based methods are not implemented (they almost not used in real world now). I suppose that if the zipmodule module implements these algorithms it wouldn't use any general RLE implementation. https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

David Mertz

1:47 a.m.

As an only semi-joke, I have created a module on GH that meets the needs of this discussion (using the spelling I think are most elegant): https://github.com/DavidMertz/RLE On Sun, Jun 11, 2017 at 1:53 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...

11.06.17 09:17, Neal Fultz пише:

...
* other people have been false positive and wanted a SQL-type group by, but got burned * hence the warnings in the docs.

This wouldn't help if people don't read the docs.

Also, if someone rewrote zip in pure python, would many people actually

...
notice a slow down vs network latency, disk IO, etc?

Definitely yes.

RLE is a building block just like bisect.

...
This is very specific building block. And if ZIP compression be rewrote in pure Python it wouldn't use

FYI, there are multiple compression methods supported in ZIP files, but the zipmodule module implements not all of them. In particular simple RLE based methods are not implemented (they almost not used in real world now). I suppose that if the zipmodule module implements these algorithms it wouldn't use any general RLE implementation.

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Erik

11:47 p.m.

On 19/06/17 02:47, David Mertz wrote:

...

As an only semi-joke, I have created a module on GH that meets the needs of this discussion (using the spelling I think are most elegant):

https://github.com/DavidMertz/RLE

It's a shame you have to build that list when encoding. I tried to work out a way to get the number of items in an iterable without having to capture all the values (on the understanding that if the iterable is already an iterator, it would be consumed). The best I came up with so far (not general purpose, but it works in this scenario) is: from iterator import groupby from operator import countOf def rle_encode(it): return ((k, countOf(g, k)) for k, g in groupby(it)) In your test code, this speeds things up quite a bit over building the list, but that's presumably only because both groupby() and countOf() will use the standard class comparison operator methods which in the case of ints will short-circuit with a C-level pointer comparison first. For user-defined classes with complicated comparison methods, getting the length of the group by comparing the items will probably be worse. Is there a better way of implementing a general-purpose "ilen()"? I tried a couple of other things, but they all required at least one lambda function and slowed things down by about 50% compared to the list-building version. (I agree this is sort of a joke, but it's still an interesting puzzle ...). Regards, E.

Tim Peters

4 a.m.

In the other direction, e.g., def expand_rle(rle): from itertools import repeat, chain return list(chain.from_iterable(repeat(x, n) for x, n in rle)) Then

...

...
...
expand_rle([('a', 5), ('bc', 3)]) ['a', 'a', 'a', 'a', 'a', 'bc', 'bc', 'bc']

As to why zip is in the distribution, but not RLE, zip is a very widely used, general purpose, compression standard. RLE is a special-purpose thing.

David Mertz

3:13 a.m.

Bernardo Sulzbach posted a much prettier version than mine that is a bit shorter. But his is also somewhat slower (and I believe asymptotically so as the number of equal elements in subsequence goes up). He needs to sum up a bunch of 1's repeatedly rather than do the O(1) `len()` function. For a list with 1000 run lengths of 1000 each, we get: In [53]: %timeit [(k, sum(1 for _ in g)) for k, g in groupby(lst)] 10 loops, best of 3: 66.2 ms per loop In [54]: %timeit [(k,len(l)) for k, g in groupby(lst) for l in [list(g)]] 100 loops, best of 3: 17.5 ms per loop On Sat, Jun 10, 2017 at 7:59 PM, David Mertz <mertz@gnosis.cx> wrote:

...

Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Bernardo Sulzbach

3:26 a.m.

On 2017-06-11 00:13, David Mertz wrote:

...

Bernardo Sulzbach posted a much prettier version than mine that is a bit shorter. But his is also somewhat slower (and I believe asymptotically so as the number of equal elements in subsequence goes up). He needs to sum up a bunch of 1's repeatedly rather than do the O(1) `len()` function.

Constructing a list from an iterator of size N is in O(N). Summing N elements is in O(N). I don't think it is asymptotically slower, just slower because of implementation details. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

David Mertz

3:35 a.m.

You are right. I made a thinko. List construction from an iterator is O(N) just as is `sum(1 for _ in it)`. Both of them need to march through every element. But as a constant multiplier, just constructing the list should be faster than needing an addition (Python append is O(1) because of smart dynamic memory pre-allocation). So the "just read the iterator" is about 2-3 times faster than read-then-accumulate). Of course, it the run-lengths are LARGE, we can start worrying about the extra memory allocation needed as a tradeoff. Your sum uses constant memory. On Sat, Jun 10, 2017 at 8:26 PM, Bernardo Sulzbach < mafagafogigante@gmail.com> wrote:

...

On 2017-06-11 00:13, David Mertz wrote:

...
Bernardo Sulzbach posted a much prettier version than mine that is a bit shorter. But his is also somewhat slower (and I believe asymptotically so as the number of equal elements in subsequence goes up). He needs to sum up a bunch of 1's repeatedly rather than do the O(1) `len()` function.

Constructing a list from an iterator of size N is in O(N).

Summing N elements is in O(N).

I don't think it is asymptotically slower, just slower because of implementation details.

-- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Nick Coghlan

4:50 a.m.

On 11 June 2017 at 13:35, David Mertz <mertz@gnosis.cx> wrote:

...

You are right. I made a thinko.

List construction from an iterator is O(N) just as is `sum(1 for _ in it)`. Both of them need to march through every element. But as a constant multiplier, just constructing the list should be faster than needing an addition (Python append is O(1) because of smart dynamic memory pre-allocation).

So the "just read the iterator" is about 2-3 times faster than read-then-accumulate). Of course, it the run-lengths are LARGE, we can start worrying about the extra memory allocation needed as a tradeoff. Your sum uses constant memory.

This would be another argument in favour of providing an itertools.counted_len function, as that would be able to avoid all the overheads currently associated with the "sum(1 for __ in iterable)" counting strategy. Without that, the best you can do in pure Python is to use __length_hint__ to choose your preferred counting strategy at runtime based on the specific input. Something like: from operator import length_hint # 10k 64-bit pointers ~= 640k max temporary list _USE_COUNTED_SUM = 10_001 def counted_len(iterable): # For sized containers, just return their length try: return len(iterable) except TypeError: pass # For probably-large inputs & those with no length hint, count them hint = length_hint(iterable, default=_USE_COUNTED_SUM) if hint >= _USE_COUNTED_SUM: return sum(1 for __ in iter(iterable)) # Otherwise, make a temporary list and report its length # as the specifics of the CPython implementation make this # faster than using the generator expression return len(list(iterable)) Cheers, Nick. P.S. itertools._grouper objects don't currently provide a length hint, and it would be tricky to add one, since it would need to be based on the number of remaining items in the original sequence, which would it turn depend on *that* defining either __len__ or __length_hint__. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Neal Fultz

3:14 a.m.

Thanks, that's cool. Maybe the root problem is that the docs aren't using the right words when I google. Run-length-encoding is particularly relevant for spare matrices, but there's probably a library for those as well. On the data science side of things, there's a few hundred R packages that use it there[1]. Can you explicate the guiding principle a bit? I'm perplexed that python would come with zip and gzip but not rle. [1] : https://github.com/search?l=R&q=user%3Acran+rle&type=Code&utf8=%E2%9C%93 On Sat, Jun 10, 2017 at 7:59 PM, David Mertz <mertz@gnosis.cx> wrote:

...

Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

David Mertz

3:29 a.m.

If what you really want is sparse matrices, you should use those: https://docs.scipy.org/doc/scipy/reference/sparse.html. Or maybe from the experimental Dask offshoot that I contributed a few lines to: https://github.com/mrocklin/sparse. Either of those will be about two orders of magnitude faster than working with Python lists for numeric data. The reason, I think, that there's no RLE module in Python (standard library, there's probably something on PyPI) is that it's so easy to roll your own with the building blocks in itertools. The `zipfile` and `gzip` modules are written in hundreds or thousands of lines of C code, and more importantly they are for dealing with *files* mostly, not generic sequences... that's the domain of `itertools`, but `itertools` is kept to a bare minimum collection of building blocks from which 1-10 line functions can be built efficiently. On Sat, Jun 10, 2017 at 8:14 PM, Neal Fultz <nfultz@gmail.com> wrote:

...

Thanks, that's cool. Maybe the root problem is that the docs aren't using the right words when I google. Run-length-encoding is particularly relevant for spare matrices, but there's probably a library for those as well. On the data science side of things, there's a few hundred R packages that use it there[1].

Can you explicate the guiding principle a bit? I'm perplexed that python would come with zip and gzip but not rle.

[1] : https://github.com/search?l=R&q=user%3Acran+rle&type=Code& utf8=%E2%9C%93

On Sat, Jun 10, 2017 at 7:59 PM, David Mertz <mertz@gnosis.cx> wrote:

...
Here's a one-line version:

from itertools import groupby rle_encode = lambda it: ( (l[0],len(l)) for g in groupby(it) for l in [list(g[1])])

Since "not every one line function needs to be in the standard library" is a guiding principle of Python, and even moreso of `itertools`, probably this is a recipe in the documentation at most. Or maybe it would have a home in `more_itertools`.

On Sat, Jun 10, 2017 at 7:20 PM, Neal Fultz <nfultz@gmail.com> wrote:

...
Hello python-ideas,

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

[1,1,1,1,2,3,4,4,3,3,3]

and

[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

Best,

-Neal

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Serhiy Storchaka

4:57 a.m.

11.06.17 05:20, Neal Fultz пише:

...

I am very new to this, but on a different forum and after a couple conversations, I really wished Python came with run-length encoding built-in; after all, it ships with zip, which is much more complicated :)

The general idea is to be able to go back and forth between two representations of a sequence:

|[1,1,1,1,2,3,4,4,3,3,3]|

and

|[(1, 4), (2, 1), (3, 1), (4, 2), (3, 3)]|

where the first element is the data element, and the second is how many times it is repeated.

I wrote an encoder/decoder in about 20 lines ( https://github.com/nfultz/rle.py/blob/master/rle.py ) and would like to offer it for the next version; I think it might fit in nicely in the itertools module, for example. I am curious about your thoughts.

RLE is just a general idea. Concrete implementations in file formats and protocols have different limitations and peculiarities. Different schemes are used for encoding lengths and values, short repetition sequences usually are not encoded with RLE, as well as repetition sequences of specific values, there are limitations on the maximal length. The implementation of the general idea is simple, but is not helpful in concrete cases.

2799

Age (days ago)

2807

Last active (days ago)

List overview

Download

26 comments

10 participants

participants (10)

Bernardo Sulzbach
David Mertz
Erik
Greg Ewing
Joshua Morton
Neal Fultz
Nick Coghlan
Serhiy Storchaka
Terry Reedy
Tim Peters

Run length encoding

Neal Fultz

Bernardo Sulzbach

Neal Fultz

Neal Fultz

Neal Fultz

Neal Fultz

Neal Fultz

Erik

Bernardo Sulzbach

Neal Fultz

tags

participants (10)