[New-bugs-announce] [issue34561] Replace list sorting merge_collapse()?

Tim Peters report at bugs.python.org
Fri Aug 31 21:22:04 EDT 2018


New submission from Tim Peters <tim at python.org>:

The invariants on the run-length stack are uncomfortably subtle.  There was a flap a while back when an attempt at a formal correctness proof uncovered that the _intended_ invariants weren't always maintained.  That was easily repaired (as the researchers suggested), but it remains hard to grasp why it's always correct now.  The Java variant didn't take the suggested fix, instead boosting its stack size.  But it turns out that's still not enough, because the original researchers made a different mistake in their own paper ;-)

http://drops.dagstuhl.de/opus/volltexte/2018/9467/

Buss & Knop recently published results on different ways of managing the stack, and I think they're worth investigating for "real life" use:

https://arxiv.org/abs/1801.04641

Offhand, their "2-merge" looks quite appealing to me.  The only invariant it maintains is that each run on the stack is at least twice the length of the run one above it on the stack, which can be maintained just by checking the topmost two stack entries in the loop.  So, e.g., where merge_collapse() is currently happy with runs of lengths 150 and 100 ending the stack, 2-merge would force a merge (it's not the case that 150 >= 2*100).  Both are happy if the stack ends with runs of lengths 200 and 100.

This is much easier to reason about, simpler to code, and would allow reducing the size of the stack (worst-case size is proportional to the log (of the largest possible array) to the base 2 instead of to the current base phi (~1.62)).  They developed some formal measures showing that its "merge cost" (an overall measure abstracting some from real-life behavior) is significantly better than timsort's current strategy.  And a little thought convinced me it preserves that, for random data, it would continue to strongly tend towared triggering perfectly balanced merges (merging runs of the same size).

When I was developing this code, virtually all the time was spent making merging itself as fast as possible; the stack-management code was just the first thing I tried that "worked well".  But turns out that's the only part researchers care about ;-) I'd be pleased if it could be replaced with something both better and simpler, or even just simpler but no worse.  But the devil's in the details ...

----------
messages: 324455
nosy: tim.peters
priority: normal
severity: normal
stage: needs patch
status: open
title: Replace list sorting merge_collapse()?
type: performance
versions: Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34561>
_______________________________________


More information about the New-bugs-announce mailing list