CHAPTER 10: MEDIANS AND ORDER STATISTICS
The ith order statistic of a set of n elements is the ith smallest element. For example, the minimum of a set of elements is the first order statistic (i = 1), and the maximum is the nth order statistic (i = n). A median, informally, is the “halfway point” of the set. When n is odd, the median is unique, occurring at i = (n + 1)/2. When n is even, there are two medians, occurring at i = n/2 and i = n/2 + 1. Thus, regardless of the parity of n, medians occur at i = (n + 1)/2 and i = (n + 1)/2 .
This chapter addresses the problem of selecting the ith order statistic from a set of n distinct numbers. We assume for convenience that the set contains distinct numbers, although virtually everything that we do extends to the situation in which a set contains repeated values. The selection problem can be specified formally as follows:
Input: A set A of n (distinct) numbers and a number i, with 1 i n.
Output: The element x A that is larger than exactly i -1 other elements of A.
The selection problem can be solved in O(n lg n) time, since we can sort the numbers using heapsort or merge sort and then simply index the ith element in the output array. There are faster algorithms, however.
In Section 10.1, we examine the problem of selecting the minimum and maximum of a set of elements. More interesting is the general selection problem, which is investigated in the subsequent two sections. Section 10.2 analyzes a practical algorithm that achieves an O(n) bound on the running time in the average case. Section 10.3 contains an algorithm of more theoretical interest that achieves the O(n) running time in the worst case.
10.1 Minimum and maximum
How many comparisons are necessary to determine the minimum of a set of n elements? We can easily obtain an upper bound of n – 1 comparisons: examine each element of the set in turn and keep track of the smallest element seen so far. In the following procedure, we assume that the set resides in array A, where length[A] = n.
1 min A
2 for i 2 to length[A]
3 do if min > A[i]
4 then min A[i]
5 return min
Finding the maximum can, of course, be accomplished with n – 1 comparisons as well.
Is this the best we can do? Yes, since we can obtain a lower bound of n – 1 comparisons for the problem of determining the minimum. Think of any algorithm that determines the minimum as a tournament among the elements. Each comparison is a match in the tournament in which the smaller of the two elements wins. The key observation is that every element except the winner must lose at least one match. Hence, n – 1 comparisons are necessary to determine the minimum, and the algorithm MINIMUM is optimal with respect to the number of comparisons performed.
An interesting fine point of the analysis is the determination of the expected number of times that line 4 is executed. Problem 6-2 asks you to show that this expectation is (lg n).
Simultaneous minimum and maximum
In some applications, we must find both the minimum and the maximum of a set of n elements. For example, a graphics program may need to scale a set of (x, y) data to fit onto a rectangular display screen or other graphical output device. To do so, the program must first determine the minimum and maximum of each coordinate.
It is not too difficult to devise an algorithm that can find both the minimum and the maximum of n elements using the asymptotically optimal (n) number of comparisons. Simply find the minimum and maximum independently, using n – 1 comparisons for each, for a total of 2n – 2 comparisons.
In fact, only 3 n/2 comparisons are necessary to find both the minimum and the maximum. To do this, we maintain the minimum and maximum elements seen thus far. Rather than processing each element of the input by comparing it against the current minimum and maximum, however, at a cost of two comparisons per element, we process elements in pairs. We compare pairs of elements from the input first with each other, and then compare the smaller to the current minimum and the larger to the current maximum, at a cost of three comparisons for every two elements.
Show that the second smallest of n elements can be found with n+ lg n – 2 comparisons in the worst case. (Hint: Also find the smallest element.)
Show that 3n/2 – 2 comparisons are necessary in the worst case to find both the maximum and minimum of n numbers. (Hint: Consider how many numbers are potentially either the maximum or minimum, and investigate how a comparison affects these counts.)
10.2 Selection in expected linear time
The general selection problem appears more difficult than the simple problem of finding a minimum, yet, surprisingly, the asymptotic running time for both problems is the same: (n). In this section, we present a divide-and-conquer algorithm for the selection problem. The algorithm RANDOMIZED-SELECT is modeled after the quicksort algorithm of Chapter 8. As in quicksort, the idea is to partition the input array recursively. But unlike quicksort, which recursively processes both sides of the partition, RANDOMIZED-SELECT only works on one side of the partition. This difference shows up in the analysis: whereas quicksort has an expected running time of (n lg n), the expected time of RANDOMIZED-SELECT is (n).
RANDOMIZED-SELECT uses the procedure RANDOMIZED-PARTITION introduced in Section 8.3. Thus, like RANDOMIZED-QUICKSORT, it is a randomized algorithm, since its behavior is determined in part by the output of a random-number generator. The following code for RANDOMIZED-SELECT returns the ith smallest element of the array A[p . . r].
RANDOMIZED-SELECT(A, p, r, i)
1 if p = r
2 then return A[p]
3 q RANDOMIZED-PARTITION(A, p, r)
4 k q – p + 1
5 if i k
6 then return RANDOMIZED-SELECT(A, p, q, i)
7 else return RANDOMIZED-SELECT(A, q + 1, r, i – k)
After RANDOMIZED-PARTITION is executed in line 3 of the algorithm, the array A[p .. r] is partitioned into two nonempty subarrays A[p .. q] and A[q + 1 .. r] such that each element of A[p .. q] is less than each element of A[q + 1 .. r]. Line 4 of the algorithm computes the number k of elements in the subarray A[p .. q]. The algorithm now determines in which of the two subarrays A[p .. q] and A[q + 1 .. r] the ith smallest element lies. If i k, then the desired element lies on the low side of the partition, and it is recursively selected from the subarray in line 6. If i > k, however, then the desired element lies on the high side of the partition. Since we already know k values that are smaller than the ith smallest element of A[p .. r]–namely, the elements of A[p .. q]–the desired element is the (i – k)th smallest element of A[q + 1 .. r], which is found recursively in line 7.
The worst-case running time for RANDOMIZED-SELECT is (n2), even to find the minimum, because we could be extremely unlucky and always partition around the largest remaining element. The algorithm works well in the average case, though, and because it is randomized, no particular input elicits the worst-case behavior.
We can obtain an upper bound T(n) on the expected time required by RANDOMIZED-SELECT on an input array of n elements as follows. We observed in Section 8.4 that the algorithm RANDOMIZED-PARTITION produces a partition whose low side has 1 element with probability 2/n and i elements with probability 1/n for i = 2, 3, . . . , n – 1. Assuming that T(n) is monotonically increasing, in the worst case RANDOMIZED-SELECT is always unlucky in that the ith element is determined to be on the larger side of the partition. Thus, we get the recurrence
The second line follows from the first since max(1, n – 1 ) = n – 1 and
If n is odd, each term T( n/2 ), T( n/2 + 1), . . . ,T(n – 1) appears twice in the summation, and if n is even, each term T( n/2 + 1), T( n/2 + 2), . . . , T(n – 1) appears twice and the term T( n/2 ) appears once. In either case, the summation of the first line is bounded from above by the summation of the second line. The third line follows from the second since in the worst case T(n – 1) = O(n2), and thus the term can be absorbed by the term O(n).
We solve the recurrence by substitution. Assume that T(n) cn for some constant c that satisfies the initial conditions of the recurrence. Using this inductive hypothesis, we have
since we can pick c large enough so that c(n/4 + 1/2) dominates the O(n) term.
Thus, any order statistic, and in particular the median, can be determined on average in linear time.
Write an iterative version of RANDOMIZED-SELECT.
Suppose we use RANDOMIZED-SELECT to select the minimum element of the array A = 3, 2, 9, 0, 7, 5, 4, 8, 6, 1 . Describe a sequence of partitions that results in a worst-case performance of RANDOMIZED-SELECT.
Recall that in the presence of equal elements, the RANDOMIZED-PARTITION procedure partitions the subarray A[p . . r] into two nonempty subarrays A[p . . q] and A[q + 1 . . r] such that each element in A[p . . q] is less than or equal to every element in A[q + 1 . . r]. If equal elements are present, does the RANDOMIZED-SELECTprocedure work correctly?
10.3 Selection in worst-case linear time
We now examine a selection algorithm whose running time is O(n) in the worst case. Like RANDOMIZED-SELECT, the algorithm SELECT finds the desired element by recursively partitioning the input array. The idea behind the algorithm, however, is to guarantee a good split when the array is partitioned. SELECT uses the deterministic partitioning algorithm PARTITION from quicksort (see Section 8.1), modified to take the element to partition around as an input parameter.
Figure 10.1 Analysis of the algorithm SELECT. The n elements are represented by small circles, and each group occupies a column. The medians of the groups are whitened, and the median-of-medians x is labeled. Arrows are drawn from larger elements to smaller, from which it can be seen that 3 out of every group of 5 elements to the right of x are greater than x, and 3 out of every group of 5 elements to the left of x are less than x. The elements greater than x are shown on a shaded background.
The SELECT algorithm determines the ith smallest of an input array of n elements by executing the following steps.
1. Divide the n elements of the input array into n/5 groups of 5 elements each and at most one group made up of the remaining n mod 5 elements.
2. Find the median of each of the n/5 groups by insertion sorting the elements of each group (of which there are 5 at most) and taking its middle element. (If the group has an even number of elements, take the larger of the two medians.)
3. use SELECT recursively to find the median x of the n/5 medians found in step 2.
4. Partition the input array around the median-of-medians x using a modified version of PARTITION. Let k be the number of elements on the low side of the partition, so that n – k is the number of elements on the high side.
5. use SELECT recursively to find the ith smallest element on the low side if i k, or the (i – k)th smallest element on the high side if i > k.
To analyze the running time of SELECT, we first determine a lower bound on the number of elements that are greater than the partitioning element x. Figure 10.1 is helpful in visualizing this bookkeeping. At least half of the medians found in step 2 are greater than or equal to the median-of-medians x. Thus, at least half of the n/5 groups contribute 3 elements that are greater than x, except for the one group that has fewer than 5 elements if 5 does not divide n exactly, and the one group containing x itself. Discounting these two groups, it follows that the number of elements greater than x is at least
Similarly, the number of elements that are less than x is at least 3n/10 – 6. Thus, in the worst case, SELECT is called recursively on at most 7n/10 + 6 elements in step 5.
We can now develop a recurrence for the worst-case running time T(n) of the algorithm SELECT. Steps 1, 2, and 4 take O(n) time. (Step 2 consists of O(n) calls of insertion sort on sets of size O(1).) Step 3 takes time T( n/5 ), and step 5 takes time at most T(7n / 10 + 6), assuming that T is monotonically increasing. Note that 7n / 10 + 6 < n for n > 20 and that any input of 80 or fewer elements requires O(1) time. We can therefore obtain the recurrence
We show that the running time is linear by substitution. Assume that T(n) cn for some constant c and all n 80. Substituting this inductive hypothesis into the right-hand side of the recurrence yields
T(n) c n/5 + c(7n/10 + 6) + O(n)
cn/5 + c + 7cn/10 + 6c + O(n)
9cn/10 + 7c + O(n)
since we can pick c large enough so that c(n / 10 – 7) is larger than the function described by the O(n) term for all n > 80. The worst-case running time of SELECT is therefore linear.
As in a comparison sort (see Section 9.1), SELECT and RANDOMIZED-SELECT determine information about the relative order of elements only by comparing elements. Thus, the linear-time behavior is not a result of assumptions about the input, as was the case for the sorting algorithms in Chapter 9. Sorting requires (n 1g n) time in the comparison model, even on average (see Problem 9-1), and thus the method of sorting and indexing presented in the introduction to this chapter is asymptotically inefficient.