pages tagged sortHiking and Codinghttp://chrismgray.github.com//tags/sort/Hiking and Codingikiwiki2011-12-01T07:45:44ZLazy Sorting in Clojurehttp://chrismgray.github.com//posts/clojure-sorting/2011-12-01T07:45:44Z2011-11-27T14:23:34Z
<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Penance</h2>
<div class="outline-text-2" id="text-1">
<p>
In a job interview last week I said a couple of things related to
sorting in Clojure that turned out to be incorrect when I checked
them. This is my way of making up for them – if I can make them
true, then no one will notice. Though to be fair, I'm sure that I'm
the only one that thinks that way.
</p>
<p>
The first thing I said was that heap-sort is my favorite sorting
algorithm, but then I declined to implement it, preferring to
implement merge-sort instead. I think this is actually pretty
reasonable in an interview situation – heap-sort is a conceptually
more difficult sorting algorithm, so being able to remember it on
the spot is more difficult. The second thing I said was that
Clojure itself uses heap-sort and that given its love affair with
laziness that it would not be unreasonable to assume that <code>(first (sort lst))</code> was a linear-time operation. I might have read
something like this on a mailing list or IRC, but it is not
correct. Clojure currently uses Java's <code>sort</code> function, which is a
slightly-modified merge-sort. There is not much point in making
that algorithm lazy, because getting the first element from the
sorted list requires \(O(n \log n)\) time anyway.
</p>
</div>
</div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Heap Sort</h2>
<div class="outline-text-2" id="text-2">
<p>
For those that are not familiar with it, heap-sort is something that
is usually taught in a second-year undergraduate Computer Science
class. So it's not <b>that</b> difficult an algorithm, but it does
require some thinking, and there is some fancy analysis that goes
into part of it. For a full discussion, see the <a href="http://www.amazon.com/gp/product/0262033844/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=hikiandcodi-20&linkCode=as2&camp=217145&creative=399369&creativeASIN=0262033844">Introduction to Algorithms</a> book by Cormen, Lieserson, Rivest, and Stein.
</p>
</div>
<div id="outline-container-2-1" class="outline-3">
<h3 id="sec-2-1">Heap</h3>
<div class="outline-text-3" id="text-2-1">
<p>
To start with, a <i>heap</i> is conceptually a tree where values are
stored at the nodes. The largest value of all the values stored in
subtrees is stored at the root and the two descendant trees are
heaps. Heaps are usually required to be as close to balanced as
possible – if any level of the tree has leaves, they are all
bunched to the right, and all the rest of the leaves are at the
next level.
</p>
<p>
Such a tree is usually implemented as an array, where the child
nodes of a node can be obtained by some simple arithmetic on the
index of the node in the array.
</p>
</div>
</div>
<div id="outline-container-2-2" class="outline-3">
<h3 id="sec-2-2">Building a Heap</h3>
<div class="outline-text-3" id="text-2-2">
<p>
Given the definition, there is an intuitive algorithm for building
such a heap on a set of numbers: first find the largest number in
your set and have it be the root of the heap, then split the rest
of the numbers in half, and recursively make heaps on those sets.
</p>
<p>
Such an algorithm is clearly correct, but it is also clearly
\(\Theta(n \log n)\). We can do better with a bottom-up algorithm.
If we continue imagining the heap as a tree, we start by putting
the input numbers into the tree willy-nilly. This clearly does not
satisfy the heap properties laid out above. However, some of it
does satisfy the heap properties – the leaves of the tree are
trivially heaps. If we go up one level from the leaves, we can fix
the trees rooted there by exchanging the root of the tree with its
largest child (or not exchanging it if it's already the largest of
the three). Higher levels are a bit more difficult, because if the
root of a tree is exchanged, then we must make sure to fix the tree
that it ends up being the root of. You can imagine the root of a
tree traveling down the heap until it is larger than both of its
children.
</p>
<p>
The correctness of this algorithm is a bit harder to see and it
also appears to take \(O(n \log n)\) time. It does, but there is a
slightly more sophisticated analysis that shows that it is really
\(\Theta(n)\). I won't go into the analysis, but a hint is that
most of the values don't actually travel very far in the tree.
</p>
</div>
</div>
<div id="outline-container-2-3" class="outline-3">
<h3 id="sec-2-3">Using a heap to sort</h3>
<div class="outline-text-3" id="text-2-3">
<p>
With the heap properties in mind, we can easily see how to get the
largest value of a set of numbers – just take the top element from
the heap. How can we use the heap properties to sort? Well, we
want the largest number, then the largest number that remains, then
the largest number that remains after that, and so on. So if we
can take the largest number from the heap and then fix the heap so
that it retains the heap properties, then we'd be done.
</p>
<p>
We just devised a way to fix heaps when building the heap, so we
use that. What we do is to take the very last node in the heap
(which is not necessarily the smallest, but it doesn't hurt to
think about it as the smallest) and put that at the top of the
heap. The resulting tree is clearly not a heap, but if we call the
algorithm to fix heaps on the root of the tree, then we end up with
a heap again. The node that we put on top of the heap might end up
traveling all the way to the bottom, so this update takes
\(\Theta(\log n)\) time. Thus if we sort the entire input set, we
have a \(\Theta(n \log n)\) algorithm.
</p>
</div>
</div>
<div id="outline-container-2-4" class="outline-3">
<h3 id="sec-2-4">Advantages</h3>
<div class="outline-text-3" id="text-2-4">
<p>
That's the best we can do theoretically, which is great, but the
Java <code>sort</code> algorithm is also \(\Theta(n \log n)\), so why is there
any advantage to using heap-sort? In a language where laziness is
embraced (such as Clojure), heap-sort can be made lazy. That is,
the next element of the sorted list can be computed only when it is
needed. Since the build-heap procedure described above takes
linear time, getting the first element from the sorted list takes
\(O(n)\) time. Each subsequent element then takes \(O(\log n)\)
time. Thus, if only a small number of elements from the sorted
list are needed, then this lazy version of heap-sort is
theoretically faster than other sorts.
</p>
<p>
I can think of situations where this would actually have practical
advantages. For example, what if you were writing a search engine
and wanted to obtain the \(k\) best results? You could write an
ad-hoc function that found the best result, removed it and recursed
\(k - 1\) times. Or you could just <code>(take k (heap-sort input))</code>.
The first would take \(O(kn)\) time, whereas the second would take
\(O(k \log n + n)\) time. In many practical situations, \(k\) is
\(\Omega(\log n)\), which means that the first takes \(\Omega(n
\log n)\) time, whereas the second takes only \(O(n)\) time.
Essentially, the first is no better than the second solution with a
non-lazy sorting algorithm.
</p>
</div>
</div>
<div id="outline-container-2-5" class="outline-3">
<h3 id="sec-2-5">Disadvantages</h3>
<div class="outline-text-3" id="text-2-5">
<p>
Heap-sort has some disadvantages compared to other sorts of <code>sort</code>
algorithms. The least theoretically significant is that the
constants hidden in the big-O notation are higher than other
sorting algorithms (tuned versions of quicksort can have extremely
small constants).
</p>
<p>
Another disadvantage can be seen when dealing with data sets so
large that they no longer fit in the computer's main memory.
Something like merge-sort can be modified fairly easily so that the
number I/O operations is minimized. I haven't thought about it too
deeply, but this doesn't seem quite so easy with heap-sort.
However, I think that people dealing with such large datasets
should probably be using specialized libraries anyway, so perhaps
that isn't too bad.
</p>
</div>
</div>
</div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Implementation</h2>
<div class="outline-text-2" id="text-3">
<p>
This whole discussion is a bit useless if it only remains at the
theoretical level. I have an implementation <a href="https://github.com/chrismgray/clojure-heap-sort">in my github repo</a> that
implements most of the ideas that are given above. The code is
highly optimized so that it is somewhat competitive with the native
Java implementation. This makes the code on the HEAD of the
<code>master</code> branch somewhat less than readable. However, the first
commit to the repository used Clojure vectors and a functional
style, so if you would like to understand the code, you might start
there.
</p>
<p>
I (unfortunately) needed to use Java arrays and mutation rather than
the more functional style that I have gotten used to, but the
results speak for themselves. Finding the first few elements of a
sorted array is significantly faster than the Java version. Finding
the entire sorted list is somewhat slower than the Java version, but
not too much. This is not surprising for a couple of reasons.
First, heap-sort tends to have larger constants than other sorting
methods. Secondly, this code is one day old. The Java sort method
has had years to be optimized.
</p>
</div>
</div>
<div id="outline-container-4" class="outline-2">
<h2 id="sec-4">Conclusion</h2>
<div class="outline-text-2" id="text-4">
<p>
It may be dreaming, but I would love to see this idea (if not this
implementation) put into Clojure proper. I think the advantages
from laziness outweigh the small constant slowdown versus using
Java's <code>sort</code>.
</p>
</div>
</div>