<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Feb 24, 2015 at 4:07 PM, Chris Angelico <span dir="ltr"><<a href="mailto:rosuav@gmail.com" target="_blank">rosuav@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On Wed, Feb 25, 2015 at 10:50 AM, Skip Montanaro<br>
<<a href="mailto:skip.montanaro@gmail.com" target="_blank">skip.montanaro@gmail.com</a>> wrote:<br>
> Even if/when we get to the point where machines can hold an array of<br>
> 2**49 elements, I suspect people won't be using straight Python to<br>
> wrangle them.<br>
<br>
</span>Looking just at CPython, what is the absolute minimum storage space<br>
for a massive list like that? My guess is that a 64-bit CPython might<br>
get as good as eight bytes per element; playing around with smaller<br>
figures (up to circa a million elements) shows it ranging from 8 to 9<br>
bytes per element, bottoming out at just a smidge over 8, presumably<br>
at the moments when there's no slack space (there's a fixed overhead,<br>
which gets diminishingly significant). So an array of 2**49 elements<br>
would take 2**52 bytes (4 petabytes) of storage - addressable, to be<br>
sure, but an awful lot to be sorting.<br>
<br>
Would it be sufficient to stick a comment into the source saying "This<br>
may have problems with lists in excess of 2**49 elements", and leave<br>
it until that's actually likely to happen?<br></blockquote><div><br></div><div>I would agree that it will be quite a while as it stands before the bug appears, so documenting it MAY be appropriate, as it is likely to be far enough in the future that a number of unforeseen things may block the bug from being an issue (TimSort may be replaced by something better, CPython could die, etc).</div><div><br></div><div>That said, from what I understand, their proposed fix would be reasonably easy to use, and have minimal cost, so it may be worthwhile to use just because the issue is likely to be in use for quite a while before somebody remembers the issue exists. Once somebody does hit it, it may take a while to realize that TimSort is at fault, as it will only occur on some data sets (my understanding is that the elements must be in certain configurations with at-least the minimum number), so it will likely appear as just a random crash occurring.<br></div><div><br></div><div>If it were fully up to me, I would do one of the following, in order of preference:</div><div>1) Implement and test the fix provided on the blog. This requires the most work due to licensing, reviewing, and testing.</div><div>2) Add a noisy warning/error to the code when sorting lists large enough to potentially hit the error (probably 2**48 to leave some wiggle room). This is much easier to do, but merely pushes the can down the road to be dealt with later. It does still make it obvious what is going long when it finally breaks.</div><div>3) Add a comment explaining the issue (likely linking back to the blog). While by far the easiest "solution", again, this merely pushes the can down the road, and it may be very non-obvious what went wrong when it breaks, despite the comment.</div><div><br></div><div>Chris</div><div><br></div></div></div></div>