Intro
-----
This describes an adaptive, stable, natural mergesort, modestly called
timsort (hey, I earned it <wink>).  It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.

In a nutshell, the main routine marches over the array once, left to right,
alternately identifying the next run, then merging it into the previous
runs "intelligently".  Everything else is complication for speed, and some
hard-won measure of memory efficiency.


Comparison with Python's Samplesort Hybrid
------------------------------------------
+ timsort can require a temp array containing as many as N//2 pointers,
  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
  expected to require a temp array this large when sorting random data; on
  data with significant structure, it may get away without using any extra
  heap memory.  This appears to be the strongest argument against it, but
  compared to the size of an object, 2 temp bytes worst-case (also expected-
  case for random data) doesn't scare me much.

  It turns out that Perl is moving to a stable mergesort, and the code for
  that appears always to require a temp array with room for at least N
  pointers. (Note that I wouldn't want to do that even if space weren't an
  issue; I believe its efforts at memory frugality also save timsort
  significant pointer-copying costs, and allow it to have a smaller working
  set.)

+ Across about four hours of generating random arrays, and sorting them
  under both methods, samplesort required about 1.5% more comparisons
  (the program is at the end of this file).

+ In real life, this may be faster or slower on random arrays than
  samplesort was, depending on platform quirks.  Since it does fewer
  comparisons on average, it can be expected to do better the more
  expensive a comparison function is.  OTOH, it does more data movement
  (pointer copying) than samplesort, and that may negate its small
  comparison advantage (depending on platform quirks) unless comparison
  is very expensive.

+ On arrays with many kinds of pre-existing order, this blows samplesort out
  of the water.  It's significantly faster than samplesort even on some
  cases samplesort was special-casing the snot out of.  I believe that lists
  very often do have exploitable partial order in real life, and this is the
  strongest argument in favor of timsort (indeed, samplesort's special cases
  for extreme partial order are appreciated by real users, and timsort goes
  much deeper than those, in particular naturally covering every case where
  someone has suggested "and it would be cool if list.sort() had a special
  case for this too ... and for that ...").

+ Here are exact comparison counts across all the tests in sortperf.py,
  when run with arguments "15 20 1".

  First the trivial cases, trivial for samplesort because it special-cased
  them, and trivial for timsort because it naturally works on runs.  Within
  an "n" block, the first line gives the # of compares done by samplesort,
  the second line by timsort, and the third line is the percentage by
  which the samplesort count exceeds the timsort count:

      n   \sort   /sort   =sort
-------  ------  ------  ------
  32768   32768   32767   32767  samplesort
          32767   32767   32767  timsort
          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort

  65536   65536   65535   65535
          65535   65535   65535
          0.00%   0.00%   0.00%

 131072  131072  131071  131071
         131071  131071  131071
          0.00%   0.00%   0.00%

 262144  262144  262143  262143
         262143  262143  262143
          0.00%   0.00%   0.00%

 524288  524288  524287  524287
         524287  524287  524287
          0.00%   0.00%   0.00%

1048576 1048576 1048575 1048575
        1048575 1048575 1048575
          0.00%   0.00%   0.00%

  The algorithms are effectively identical in these cases, except that
  timsort does one less compare in \sort.

  Now for the more interesting cases.  lg(n!) is the information-theoretic
  limit for the best any comparison-based sorting algorithm can do on
  average (across all permutations).  When a method gets significantly
  below that, it's either astronomically lucky, or is finding exploitable
  structure in the data.

      n   lg(n!)    *sort     3sort    +sort    ~sort     !sort
-------  -------   ------  --------  -------  -------  --------
  32768   444255   453084    453604    32908   130484    469132  samplesort
                   449235     33019    33016   188720     65534  timsort
                    0.86%  1273.77%   -0.33%  -30.86%   615.86%  %ch from timsort

  65536   954037   973111    970464    65686   260019   1004597
                   963924     65767    65802   377634    131070
                    0.95%  1375.61%   -0.18%  -31.15%   666.46%

 131072  2039137  2100019   2102708   131232   555035   2161268
                  2058863    131422   131363   755476    262142
                    2.00%  1499.97%   -0.10%  -26.53%   724.46%

 262144  4340409  4461471   4442796   262314  1107826   4584316
                  4380148    262446   262466  1511174    524286
                    1.86%  1592.84%   -0.06%  -26.69%   774.39%

 524288  9205096  9448146   9368681   524468  2218562   9691553
                  9285454    524576   524626  3022584   1048574
                    1.75%  1685.95%   -0.03%  -26.60%   824.26%

1048576 19458756 19950541  20307955  1048766  4430616  20433371
                 19621100   1048854  1048933  6045418   2097150
                    1.68%  1836.20%   -0.02%  -26.71%   874.34%

  Discussion of cases:

  *sort:  There's no structure in random data to exploit, so the theoretical
  limit is lg(n!).  Both methods get close to that, and timsort is hugging
  it (indeed, in a *marginal* sense, it's a spectacular improvement --
  there's only about 1% left before hitting the wall, and timsort knows
  darned well it's doing compares that won't pay on random data -- but so
  does the samplesort hybrid).  For contrast, Hoare's original random-pivot
  quicksort does about 39% more compares than the limit, and the median-of-3
  variant about 19% more.

  3sort and !sort:  No contest; there's structure in this data, but not of
  the specific kinds samplesort special-cases.  Note that structure in !sort
  wasn't put there on purpose -- it was crafted as a worst case for a
  previous quicksort implementation.  That timsort nails it came as a
  surprise to me (although it's obvious in retrospect).

  +sort:  samplesort special-cases this data, and does a few less compares
  than timsort.  However, timsort runs this case significantly faster on all
  boxes we have timings for, because timsort is in the business of merging
  runs efficiently, while samplesort does much more data movement in this
  (for it) special case.

  ~sort:  samplesort's special cases for large masses of equal elements are
  extremely effective on ~sort's specific data pattern, and timsort just
  isn't going to get close to that, despite that it's clearly getting a
  great deal of benefit out of the duplicates (the # of compares is much less
  than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
  distinct values, and as the distribution gets more skewed, samplesort's
  equal-element gimmicks become less effective, while timsort's adaptive
  strategies find more to exploit; in a database supplied by Kevin Altis, a
  sort on its highly skewed "on which stock exchange does this company's
  stock trade?" field ran over twice as fast under timsort.

  However, despite that timsort does many more comparisons on ~sort, and
  that on several platforms ~sort runs highly significantly slower under
  timsort, on other platforms ~sort runs highly significantly faster under
  timsort.  No other kind of data has shown this wild x-platform behavior,
  and we don't have an explanation for it.  The only thing I can think of
  that could transform what "should be" highly significant slowdowns into
  highly significant speedups on some boxes are catastrophic cache effects
  in samplesort.

  But timsort "should be" slower than samplesort on ~sort, so it's hard
  to count that it isn't on some boxes as a strike against it <wink>.


A detailed description of timsort follows.

Runs
----
count_run() returns the # of elements in the next run.  A run is either
"ascending", which means non-decreasing:

    a0 <= a1 <= a2 <= ...

or "descending", which means strictly decreasing:

    a0 > a1 > a2 > ...

Note that a run is always at least 2 long, unless we start at the array's
last element.

The definition of descending is strict, because the main routine reverses
a descending run in-place, transforming a descending run into an ascending
run.  Reversal is done via the obvious fast "swap elements starting at each
end, and converge at the middle" method, and that can violate stability if
the slice contains any equal elements.  Using a strict definition of
descending ensures that a descending run contains distinct elements.

If an array is random, it's very unlikely we'll see long runs.  If a natural
run contains less than minrun elements (see next secion), the main loop
artificially boosts it to minrun elements, via a stable binary insertion sort
applied to the right number of array elements following the short natural
run.  In a random array, *all* runs are likely to be minrun long as a
result.  This has two primary good effects:

1. Random data strongly tends then toward perfectly balanced (both runs have
   the same length) merges, which is the most efficient way to proceed when
   data is random.

2. Because runs are never very short, the rest of the code doesn't make
   heroic efforts to shave a few cycles off per-merge overheads.  For
   example, reasonable use of function calls is made, rather than trying to
   inline everything.  Since there are no more than N/minrun runs to begin
   with, a few "extra" function calls per merge is barely measurable.


Computing minrun
----------------
If N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
array then; it's hard to beat that given the overheads of trying something
fancier.

When N is a power of 2, testing on random data showed that minrun values of
16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
in binary insertion sort clearly hurt, and at 8 the increase in the number
of function calls clearly hurt.  Picking *some* power of 2 is important
here, so that the merges end up perfectly balanced (see next section).  We
pick 32 as a good value in the sweet range; picking a value at the low end
allows the adaptive gimmicks more opportunity to exploit shorter natural
runs.

Because sortperf.py only tries powers of 2, it took a long time to notice
that 32 isn't a good choice for the general case!  Consider N=2112:

>>> divmod(2112, 32)
(66, 0)
>>>

If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32.  The first 64 of these trigger a sequence of perfectly
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
compares, but it's still more compares than necessary, and-- mergesort's
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
to get 64 elements into place).

If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced.  Better!

What we want to avoid is picking minrun such that in

    q, r = divmod(N, minrun)

q is a power of 2 and r>0 (then the last merge only gets r elements into
place, and r<minrun is small compared to N), or r=0 and q a little larger
than a power of 2 (then we've got a case similar to "2112", again leaving
too little work for the last merge to do).

Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
power of 2, or if that isn't possible, is close to, but strictly less than,
a power of 2.  This is easier to do than it may sound:  take the first 6
bits of N, and add 1 if any of the remaining bits are set.  In fact, that
rule covers every case in this section, including small N and exact powers
of 2; merge_compute_minrun() is a deceptively simple function.


The Merge Pattern
-----------------
In order to exploit regularities in the data, we're merging on natural
run lengths, and they can become wildly unbalanced.  That's a Good Thing
for this sort!  It means we have to find a way to manage an assortment of
potentially very different run lengths, though.

Stability constrains permissible merging patterns.  For example, if we have
3 consecutive runs of lengths

    A:10000  B:20000  C:10000

we dare not merge A with C first, because if A, B and C happen to contain
a common element, it would get out of order wrt its occurence(s) in B.  The
merging must be done as (A+B)+C or A+(B+C) instead.

So merging is always done on two consecutive runs at a time, and in-place,
although this may require some temp memory (more on that later).

When a run is identified, its base address and length are pushed on a stack
in the MergeState struct.  merge_collapse() is then called to see whether it
should merge it with preceding run(s).  We would like to delay merging as
long as possible in order to exploit patterns that may come up later, but we
like even more to do merging as soon as possible to exploit that the run just
found is still high in the memory hierarchy.  We also can't delay merging
"too long" because it consumes memory to remember the runs that are still
unmerged, and the stack has a fixed size.

What turned out to be a good compromise maintains two invariants on the
stack entries, where A, B and C are the lengths of the three righmost not-yet
merged slices:

1.  A > B+C
2.  B > C

Note that, by induction, #2 implies the lengths of pending runs form a
decreasing sequence.  #1 implies that, reading the lengths right to left,
the pending-run lengths grow at least as fast as the Fibonacci numbers.
Therefore the stack can never grow larger than about log_base_phi(N) entries,
where phi = (1+sqrt(5))/2 ~= 1.618.  Thus a small # of stack slots suffice
for very large arrays.

If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
e.g., if the last 3 entries are

    A:30  B:20  C:10

then B is merged with C, leaving

    A:30  BC:30

on the stack.  Or if they were

    A:500  B:400:  C:1000

then A is merged with B, leaving

    AB:900  C:1000

on the stack.

In both examples, the stack configuration after the merge still violates
invariant #2, and merge_collapse() goes on to continue merging runs until
both invariants are satisfied.  As an extreme case, suppose we didn't do the
minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
and 2.  Nothing would get merged until the final 2 was seen, and that would
trigger 7 perfectly balanced merges.

The thrust of these rules when they trigger merging is to balance the run
lengths as closely as possible, while keeping a low bound on the number of
runs we have to remember.  This is maximally effective for random data,
where all runs are likely to be of (artificially forced) length minrun, and
then we get a sequence of perfectly balanced merges (with, perhaps, some
oddballs at the end).

OTOH, one reason this sort is so good for partly ordered data has to do
with wildly unbalanced run lengths.


Merge Memory
------------
Merging adjacent runs of lengths A and B in-place is very difficult.
Theoretical constructions are known that can do it, but they're too difficult
and slow for practical use.  But if we have temp memory equal to min(A, B),
it's easy.

