Lazy Sorting in Clojure

Penance

In a job interview last week I said a couple of things related to sorting in Clojure that turned out to be incorrect when I checked them. This is my way of making up for them – if I can make them true, then no one will notice. Though to be fair, I'm sure that I'm the only one that thinks that way.

The first thing I said was that heap-sort is my favorite sorting algorithm, but then I declined to implement it, preferring to implement merge-sort instead. I think this is actually pretty reasonable in an interview situation – heap-sort is a conceptually more difficult sorting algorithm, so being able to remember it on the spot is more difficult. The second thing I said was that Clojure itself uses heap-sort and that given its love affair with laziness that it would not be unreasonable to assume that (first (sort lst)) was a linear-time operation. I might have read something like this on a mailing list or IRC, but it is not correct. Clojure currently uses Java's sort function, which is a slightly-modified merge-sort. There is not much point in making that algorithm lazy, because getting the first element from the sorted list requires \(O(n \log n)\) time anyway.

Heap Sort

For those that are not familiar with it, heap-sort is something that is usually taught in a second-year undergraduate Computer Science class. So it's not that difficult an algorithm, but it does require some thinking, and there is some fancy analysis that goes into part of it. For a full discussion, see the Introduction to Algorithms book by Cormen, Lieserson, Rivest, and Stein.

Heap

To start with, a heap is conceptually a tree where values are stored at the nodes. The largest value of all the values stored in subtrees is stored at the root and the two descendant trees are heaps. Heaps are usually required to be as close to balanced as possible – if any level of the tree has leaves, they are all bunched to the right, and all the rest of the leaves are at the next level.

Such a tree is usually implemented as an array, where the child nodes of a node can be obtained by some simple arithmetic on the index of the node in the array.

Building a Heap

Given the definition, there is an intuitive algorithm for building such a heap on a set of numbers: first find the largest number in your set and have it be the root of the heap, then split the rest of the numbers in half, and recursively make heaps on those sets.

Such an algorithm is clearly correct, but it is also clearly \(\Theta(n \log n)\). We can do better with a bottom-up algorithm. If we continue imagining the heap as a tree, we start by putting the input numbers into the tree willy-nilly. This clearly does not satisfy the heap properties laid out above. However, some of it does satisfy the heap properties – the leaves of the tree are trivially heaps. If we go up one level from the leaves, we can fix the trees rooted there by exchanging the root of the tree with its largest child (or not exchanging it if it's already the largest of the three). Higher levels are a bit more difficult, because if the root of a tree is exchanged, then we must make sure to fix the tree that it ends up being the root of. You can imagine the root of a tree traveling down the heap until it is larger than both of its children.

The correctness of this algorithm is a bit harder to see and it also appears to take \(O(n \log n)\) time. It does, but there is a slightly more sophisticated analysis that shows that it is really \(\Theta(n)\). I won't go into the analysis, but a hint is that most of the values don't actually travel very far in the tree.

Using a heap to sort

With the heap properties in mind, we can easily see how to get the largest value of a set of numbers – just take the top element from the heap. How can we use the heap properties to sort? Well, we want the largest number, then the largest number that remains, then the largest number that remains after that, and so on. So if we can take the largest number from the heap and then fix the heap so that it retains the heap properties, then we'd be done.

We just devised a way to fix heaps when building the heap, so we use that. What we do is to take the very last node in the heap (which is not necessarily the smallest, but it doesn't hurt to think about it as the smallest) and put that at the top of the heap. The resulting tree is clearly not a heap, but if we call the algorithm to fix heaps on the root of the tree, then we end up with a heap again. The node that we put on top of the heap might end up traveling all the way to the bottom, so this update takes \(\Theta(\log n)\) time. Thus if we sort the entire input set, we have a \(\Theta(n \log n)\) algorithm.

Advantages

That's the best we can do theoretically, which is great, but the Java sort algorithm is also \(\Theta(n \log n)\), so why is there any advantage to using heap-sort? In a language where laziness is embraced (such as Clojure), heap-sort can be made lazy. That is, the next element of the sorted list can be computed only when it is needed. Since the build-heap procedure described above takes linear time, getting the first element from the sorted list takes \(O(n)\) time. Each subsequent element then takes \(O(\log n)\) time. Thus, if only a small number of elements from the sorted list are needed, then this lazy version of heap-sort is theoretically faster than other sorts.

I can think of situations where this would actually have practical advantages. For example, what if you were writing a search engine and wanted to obtain the \(k\) best results? You could write an ad-hoc function that found the best result, removed it and recursed \(k - 1\) times. Or you could just (take k (heap-sort input)). The first would take \(O(kn)\) time, whereas the second would take \(O(k \log n + n)\) time. In many practical situations, \(k\) is \(\Omega(\log n)\), which means that the first takes \(\Omega(n \log n)\) time, whereas the second takes only \(O(n)\) time. Essentially, the first is no better than the second solution with a non-lazy sorting algorithm.

Disadvantages

Heap-sort has some disadvantages compared to other sorts of sort algorithms. The least theoretically significant is that the constants hidden in the big-O notation are higher than other sorting algorithms (tuned versions of quicksort can have extremely small constants).

Another disadvantage can be seen when dealing with data sets so large that they no longer fit in the computer's main memory. Something like merge-sort can be modified fairly easily so that the number I/O operations is minimized. I haven't thought about it too deeply, but this doesn't seem quite so easy with heap-sort. However, I think that people dealing with such large datasets should probably be using specialized libraries anyway, so perhaps that isn't too bad.

Implementation

This whole discussion is a bit useless if it only remains at the theoretical level. I have an implementation in my github repo that implements most of the ideas that are given above. The code is highly optimized so that it is somewhat competitive with the native Java implementation. This makes the code on the HEAD of the master branch somewhat less than readable. However, the first commit to the repository used Clojure vectors and a functional style, so if you would like to understand the code, you might start there.

I (unfortunately) needed to use Java arrays and mutation rather than the more functional style that I have gotten used to, but the results speak for themselves. Finding the first few elements of a sorted array is significantly faster than the Java version. Finding the entire sorted list is somewhat slower than the Java version, but not too much. This is not surprising for a couple of reasons. First, heap-sort tends to have larger constants than other sorting methods. Secondly, this code is one day old. The Java sort method has had years to be optimized.

Conclusion

It may be dreaming, but I would love to see this idea (if not this implementation) put into Clojure proper. I think the advantages from laziness outweigh the small constant slowdown versus using Java's sort.