BOUNDS OF SORTING ALGORITHMS
A Project Report Submitted for the Course
MA698 Project I by Himadri Nayak (Roll No.:07212316)
to the DEPARTMENT OF MATHEMATICS INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI GUWAHATI - 781039, INDIA November 2008
CERTIFICATE This is to certify that the work contained in this report entitled “Bounds of Sorting Algorithms” submitted by Himadri Nayak (Roll No:07212316) to Indian Institute of Technology Guwahati towards the requirement of the course MA698 Project I has been carried out by him under my supervision.
Guwahati - 781 039
(Dr. Kalpesh Kapoor)
November 2008
Project Supervisor
ii
ABSTRACT In this first part of our work we studied various comparison sort algorithms. Then we focussed on comparison trees and with help of it we could determine the lower bound of any comparison sort. In the next part we looked into two problems. With experimental data we made a survey on the lengths of a sequence and its sorted subsequences such that by a little variation of merge-sort we can have a satisfactory result about the running time of that algorithm. In another problem we proved that the ’log’ factor cannot be removed from the lower bound of the complexity.
iii
Contents 1 Literature Survey 1.1
1.2
1
Analysis of algorithms . . . . . . . . . . . . . . . . . . . . . .
2
1.1.1
Primitive operations . . . . . . . . . . . . . . . . . . .
2
1.1.2
Asymptotic notation . . . . . . . . . . . . . . . . . . .
3
Some comparison sort algorithms . . . . . . . . . . . . . . . .
4
1.2.1
Bubblesort . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.2
Insertion sort . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.3
Merge Sort . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3
Comparison tree . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4
Lower bound of any comparison sort . . . . . . . . . . . . . . 12
2 Our Work
14
2.1
introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3
A look into the problems . . . . . . . . . . . . . . . . . . . . . 15
2.4
2.3.1
Some experiments with Problem 1 . . . . . . . . . . . . 15
2.3.2
Some questions on Problem 2 . . . . . . . . . . . . . . 25
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iv
List of Figures 1.1
How Merge-Sort Works . . . . . . . . . . . . . . . . . . . . . .
1.2
If the sequence is say {7, 15, 4}, then it follows the path of this
8
tree highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1
In case of merge sort how Problem 1 behaves . . . . . . . . . . 15
2.2
Graph of merge and mymerge at X=1.5 . . . . . . . . . . . . . 21
2.3
Graph of merge and mymerge at X=1.2 . . . . . . . . . . . . . 22
2.4
Graph of merge and mymerge for X=1.2 and comparison with other curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
List of Tables 1.1
How Bubble Sort Works . . . . . . . . . . . . . . . . . . . . .
5
1.2
How Insertion Sort Works . . . . . . . . . . . . . . . . . . . .
6
2.1
Experimental data . . . . . . . . . . . . . . . . . . . . . . . . 20
vi
Chapter 1 Literature Survey Introduction Nature tends towards disorder. We, the humans like everything to be in order. Being a social animal you cannot deny the fact keeping everything in order brings advantages. But the question is that who will do it? Sometimes this work is very laborious. So after invention of computers we gave this labor to them. But time is the main issue. It is how rather than who that matters ultimately. Now, no matter how you are sorting things, you have to compare between some common aspects of the things. It is our common nature to associate those aspects to natural numbers. So, every sorting problems, at the end of the day, boils down to sorting integers, rather it should be said sorting a finite sequence of natural numbers. Some of the sorting algorithms like Radix sort, Counting sort, Bucket sort etc. presume something about the input. The assumption may be about the bounds of the given integers or 1
the distribution from which the integers are given. In this report we are not going to discuss about them. We will only deal with those sorting algorithms which do not presume anything about the input integers. These algorithms only make use of the natural order between integers. This way of sorting is called COMPARISON SORT. As we know there are many kinds of sorting algorithms of this genre, i.e. bubble sort, insertion sort, heap sort, merge sort etc.
1.1
Analysis of algorithms
There are two ways to analyze an algorithm. To see how much time it is taking otherwise have a look at the space that the algorithm may require to execute. Here we will discuss only about the first one.
1.1.1
Primitive operations
Without performing any experiments on a particular algorithm, one may analyze it by just calculating the time required for the hardware to do particular operations and then just manipulate how many of those operations can at most be executed by this algorithm and then just multiply them. Though this process gives us an accurate result but it is very complicated. So, instead we perform our analysis directly on a high level language or pseudo-code. We define a set of high level primitive operations that are largely independent from programming language used and can be identified also in the pseudocode. Primitive operations include the following: • Assigning a value to a variable. 2
• Calling a method • Performing an arithmetic operation • Comparing two numbers • Indexing into an array • Following an object reference • Returning from a method
A primitive operation corresponds to low-level instruction with an execution time that depends on the hardware and the software environment but it is constant. Instead of trying to determine the specific execution time of each primitive operation, we will simply count how many primitive operations are executed, and use this number t as a high level estimate of the running time of the algorithm.
1.1.2
Asymptotic notation
In general each step in a pseudo-code and each statement in a high level language implementation corresponds to a small number of primitive operations that does not depend on the input size. Thus we can perform a simplified analysis that estimates the number of primitive operations executed up to a constant factor, by counting the steps of the pseudo-code or the statements of high-level language executed. The notations we use to describe the asymptotic running time of an algorithm are defined in terms of functions whose domains are the set of natural 3
numbers N = {0, 1, 2, ...}. Such notations are convenient for describing the worst-case running-time function T (n), which is usually defined only on integer input sizes. 1. BIG ’OH’: T (n) = O(f (n)) if there are constants c and n0 such that T (n) ≤ cf (n) when n ≥ n0 . 2. OMEGA: T (n) = Ω(g(n)) if there are constants c and n0 such that T (n) ≥ cg(n) when n ≥ n0 . 3. THETA: T (n) = Θ(h(n)) if and only if T (n) = O(h(n)) and T (n) = Ω(h(n)). 4. SMALL ’OH’: T (n) = o(p(n)) if T (n) = O(p(n)) and T (n) 6= Θ(p(n)).
1.2
Some comparison sort algorithms
Among the many comparison sorts we will discuss only about bubble sort, insertion sort,merge sort.
1.2.1
Bubblesort
The algorithm This algorithm in each step run on the given array and run one index less in every step. It bubbles out its greatest element to the end of the portion it is 4
run (of course if the sorting motivation is to sort in ascending order ). Table 1.1: How Bubble Sort Works Original 34
8
64
51
32 21
No. of comparisons
Step 1
8
34 51
32 21 64
5
Step 2
8
34 32
21 51 64
4
Step 3
8
32 21
34 51 64
3
Step 4
8
21 32
34 51 64
2
Step 5
8
21 32
34 51 64
1
Algorithm 1. Bubble sort BubbleSort(A){ for i=1 to (length(A)-1){ for j=n downto i+1 { if A[j]
/* this subroutine swaps the two /*elements of the array in the /*j th and j-1 th positions
} } } } Analysis of bubble sort As the algorithm has nested ‘for’ loops and the second loop depends on the first one, the number of comparisons required are (n − 1) + (n − 2) + (n − 5
3) + .... + 3 + 2 + 1 = O(n2 )
1.2.2
Insertion sort
The algorithm One of the simplest sorting algorithms is the insertion sort. Insertion sort consists of n - 1 passes. For pass p = 2 through n, insertion sort ensures that the elements in positions 1 through p are in sorted order. Insertion sort makes use of the fact that elements in positions 1 through p - 1 are already known to be in sorted order. Table 1.2: How Insertion Sort Works Original
34
8
64 51 32
21
Positions Moved
After p = 2
8
34 64 51
32
21
1
After p = 3
8
34 64 51
32
21
0
After p = 4
8
34 51 64
32
21
1
After p = 5
8
32 34 51
64
21
3
After p = 6
8
21 32 34
51
64
4
Algorithm 2. Insertin sort
InsertionSort(A){ for j=2 to length(A){ key=A[j]; i=j-1; while i>0 and A[i]>key{ 6
A[i+1]=A[i]; i=i-1; } A[i+1]=key } } Analysis of Insertion sort Because of the nested loops, each of which can take n iterations ( where n is the length of that array A), insertion sort is O(n2 ). Furthermore, this bound is tight, because input in reverse order can actually achieve this bound. A precise calculation shows that the test at line 4 can be executed at most p times for each value of p. Summing over all p gives a total of O(n2 ) operations. 2 P p = 2 + 3 + 4 + .... + n = Θ(n2 ) i=0
On the other hand, if the input is pre-sorted, the running time is O(n), because the test in the inner for loop always fails immediately. Indeed, if the input is almost sorted, insertion sort will run quickly. Because of this wide variation, it is worth analyzing the average-case behavior of this algorithm. It turns out that the average case is Θ(n2) for insertion sort.
7
1.2.3
Merge Sort
Algorithm Merge sort can be described in a simple and compact way using recursion. Its algorithm is based on divide-and-conquer method which is very powerful in nature if the sub-problems of the main problem do not overlap. We can visualize an execution of the mergesort algorithm through a binary tree T , often called merge-sort tree. Each node of T represents a recursive call of the merge-sort algorithm. Associate with each node v of T the sequence S that is processed by the call associated with v. The children of node v are associated with the recursive calls that processes the subsequences S1 and S2 of S. the external nodes are associated with individual elements of S, corresponding to instances of the algorithm that make no recursive calls further.
Figure 1.1: How Merge-Sort Works
8
Algorithm 3. Merge function and Merge sort Merge(A,p,q,r){ n1=q-p+1; n2=r-q; Create arrays L and R of sizes n1+1 and n2+1; for i=1 to n1{ L[i]=A[p+i-1]; } for j=1 to n2{ p[j]=A[q+j]; } L[n1+1]=R[n2+1]= INFINITY; i=j=1; for k=p to r{ if L[i]<=R[i]{ A[k]=L[i]; i=i+1; } else{ A[k]=R[j]; j=j+1; } } }
9
Merge-Sort(A,p,r){ if p
1.3
Comparison tree
In comparison sort, only comparison between the elements of the sequence are used to gather information about the ordering of the sequence. If we have two numbers only two types of comparison are enough to determine the 10
order of them. Without loss of generality assume that they are ≤ and >. If we take the case of insertion sort, then the comparison starts from first two elements of the sequence. After each comparison the relative order of those two participating numbers are determined, but to determine the exact position of the numbers in a sorted sequence, the relative order of each order pair is to be known. Now if we consider the sequence {a1 , a2 , a3 , ........, an } the sorted sequence will be {aπ(1) , aπ(2) , aπ(3) , ........, aπ(n) }, where π is a permutation on the set {1, 2, 3, ..., n}. Let us make a tree whose nodes are in the form p:q, which determines the ordering between ap and aq . If ap ≤ aq , the comparison will proceed through left child and otherwise through right child. If in this procedure we reach at such a stage where relative orders of all the all the order pairs are known then we have reached to a leaf node of the tree, which is of the form [π(1)π(2)π(3)....π(n)]. Let us denote it by π(1, 2, 3, ..., n). So the set of all leaf nodes can be expressed as {π(1, 2, 3, ..., n)|π ∈P, the set of all permutations on{1, 2, 3, ..., n}} Starting from any sequence from the root of the tree we reach to a leaf node and hence to the sorted sequence. Now any comparison sort will be of this type. Depending on the type of the algorithm the structure of the tree will change and also the complexity of the algorithm, in worst case, will be reflected by the height of the tree. So the best algorithm will be that one which would be reflected by a complete binary tree.
11
Figure 1.2: If the sequence is say {7, 15, 4}, then it follows the path of this tree highlighted
1.4
Lower bound of any comparison sort
We have already seen that The best possible algorithm which deals only with comparison will basically be a complete (or near to complete) binary tree. Its complexity is the height of the tree say h. Now there are n! Permutations are possible of a sequence of length n.The number of possible leaf nodes which can be there in the tree is 2h . So, 2h ≥ n! ⇒ h ≥ log2 (n!)
12
Now,n! = n(n − 1)(n − 2)....( n2 )( n2 − 1)( n2 − 2)....(3)(2)(1) n n n n n and ( n2 ) 2 = ( )( )( ).....( ) 2 2} | 2 2 {z n times 2 n n 2
So, clearly n! > ( 2 ) ∀n ⇒ log2 (n!) > n2 log2 n2 ⇒ log2 (n!) = Ω(nlog n2 ) ⇒ h = Ω(nlog n2 )
Hence we can say that no comparison sort algorithm can give us better complexity than n log n (from now onwards we use only log n for log2 n).
13
Chapter 2 Our Work 2.1
introduction
It is prove in the previous chapter that we can not have better complexity than O(nlogn) in any comparison sort if it only uses the comparison and no other extra information about the sequence. But if some other information is given about the sequence then, who knows, we may get something very interesting. On this note we will state two problems.
2.2
Problem Statement
Problem 1. If it is given that the whole sequence of length n is made off several sorted subsequences with length k each, then can we remove the log factor from the complexity of sorting this sequence if we are allowed to use only comparisons between elements of that sequence? Problem 2. If it is given that the whole sequence of length n is made off sev14
eral subsequences of equal length k,such that every element of a subsequence is less than every other element of the next subsequence, then can we remove the log factor from the lower bound of complexity of sorting this sequence if we are allowed to use only comparisons between elements of that sequence?
2.3
A look into the problems
2.3.1
Some experiments with Problem 1
Scheme Among the comparison sort algorithms giving the lower bound complexity(i.e nlogn) only merge sort uses the information that a particular part of the sequence is already sorted. In a simple merge sort the merge sort function is called recursively until the single elements are reached and then the merge function start merge the blocks from the bottom and ultimately sort the sequence.
Figure 2.1: In case of merge sort how Problem 1 behaves
15
Figure:2.1 shows how merge sort works. In usual merge sort shown in Figure:2.1, the merge operation has to start from taking two single elements of the sequence. The assumption in every state of merge is it takes two sorted sub sequences. So in normal case the assumption is only true if the length of the subsequence is 1. But if it is given that the whole sequence is made off k-length subsequences then the merge operation can start from some higher level. We see in 3rd diagram in Figure:2.1 the given information is the whole sequence is made off 4 sorted subsequences of length 4 each. That is why the merge operation can start from 2 level higher than normal. Analysis Let the sequence has length n and it is made off m subsequences of length k each. So, n=mk. We know merge operation is of O(2n) on two n-length sorted sequence. So if we start from taking k-length subsequences two at a time then the total no of steps required for sorting is: m O(2k) 2
+
m O(4k) 4
+
m O(8k)+....+ m O(mk) 8 m
(assuming m is of the form 2i ) = O(mk) + O(mk) + . . . + O(mk) {z } | O(logm)times
= O(mklogm) = O(nlog( nk )) So, we are seeing that we are not getting any extra facility regarding complexity from the point of view of asymptotic bound.
16
How the ’k’ factor is helping our cause? Although the complexity is O(nlog( nk )),the factor k can may serve our purpose to reduce the number of steps meaningfully while sorting the sequence using merge sort. We will now observe some experimental data. For convenience we take n in the form 2P and k again of the form 2Q ,where Q < P such that we get m = 2(P − Q) .The following C functions will help us to understand the algorithm of two different cases: Code 1. merge function void merge(int *a,long int f, long int m, long int l, long int n,long int *cn) { int b[n],jj; int i=f, j=l, k=f; while (i<=m) b[k++]=a[i++]; while (j>m) b[k++]=a[j--]; i=f; j=l; k=f; while (i<=j){ if (b[i]<=b[j]){ a[k++]=b[i++]; *cn=*cn+1; } 17
else{ a[k++]=b[j--]; *cn=*cn+1; } } } Code 2. merge-sort void mergesort(int *a,long int f, long int l,long int n,long int *cnt) { *cnt=*cnt+1; if (fk){ int m=(f+l)/2; mymergesort(a,f, m,n,cnt,k); mymergesort(a,m+1,l,n,cnt,k); 18
merge(a,f,m,l,n,cnt); } } we have tested these functions on randomly generated sequences of lengths P n which are in the form 2P and k’s are in the form 2b X c ,where X > 1 ensures that
P X
< P . We collected the result for P = 1, 2, 3, ..., 19 and X = 1.5 and
1.2
19
Table 2.1: Experimental data P
n
merge
mymerge(k=1.5) mymerge(k=1.2)
1
2
5
5
5
2
4
15
7
7
3
8
39
11
11
4
16
95
39
19
5
32
223
71
35
6
64
511
135
67
7
128
1151
399
263
8
256
2559
783
519
9
512
5631
1551
1031
10
1024
12287
4127
2055
11
2048
26623
8223
4103
12
4096
57343
16415
8199
13
8192
122879
41023
24591
14
16348
262143
81983
49167
15
32768
557055
163909
98319
16
65536
1179647
393343
196623
17
131072
2490367
786559
393231
18
262144
5242879
1572991
1048607
19
524288 11010047
3670271
2097183
20
Figure 2.2: Graph of merge and mymerge at X=1.5
21
Figure 2.3: Graph of merge and mymerge at X=1.2
What actually ’X’ means Consider the sequence of n numbers where n is in the form 2P . we are P assuming the sequence I made off several sorted subsequences of length 2b X c P 2b X c However, we can loosely speak that every 2P part of the sequence is sorted. In reality no one will complain or be more than satisfied if in a sequence of length 2000, every 10 length subsequence starting from the first element, b XP c 10 1 1 is sorted. So we take the ratio 2000 which is 200 . Now ,if 2 2P ≈ 200 P ⇒ 2b X c ≈ 200 ⇒ log(200) ≈ P (1 − ⇒1−
1 X
≈
1 ) X
1 log(200) P
22
⇒X≈
1 1− P1 log(200)
⇒X≈
1 1− P8
=
P P −8
(as log(200) ≈ 8)
So, if X = 1.2 then P = 48 which is a very big number as n = 2P . Now, if we put z in place of 8 then we get X =
P . P −z
If we want to keep X = 1.2 then
P = 6z. Again if we take 100 in place of 2000 as the length of the sequence and keep that 10 fixed, then z = log(100/10) = log10 ≈ 3. So , P will then be 6 × 3 = 18 which is a good measure.
23
Figure 2.4: Graph of merge and mymerge for X=1.2 and comparison with other curves 24
2.3.2
Some questions on Problem 2
Apparently looking into the second problem we immediately find a solution which is just to sort each part of the sequence and then the whole sequence is sorted. In this way if n = k 2 and the sequence is divided into k subsequences such that each element of a subsequence is less than every element of the next subsequence, then for sorting every such k subsequences we need O(klogk) operations and hence to sort the whole sequence we need O(k 2 logk) operations. But the question is that is this the only way to sort it? Now we will prove that whatever be the way of sorting using only comparisons, we cannot remove the log factor. Proposition 2.3.1. Consider the height h of the comparison tree in Problem 2. If we want to express h in terms of length of the sequence, then there will be a logarithm factor which cannot be removed. Proof. Let us consider the length of the is n = k 2 where k is the length of the subsequences whose every elements are less than every element of the next k length subsequences. If you consider the comparison tree of this sorting problem, the highest √ √ possible number of leaf nodes will be ( n!)( n) . The reason behind that is the leaf nodes represents nothing but possible permutations of the of the numbers given to be sorted. Now according to the given condition elements in each subsequence may permute but an element of a particular subsequence cannot be swapped with any element of any other subsequence to form a permutation √ √ based on the given condition. For n! permutations are possible for each n √ √ subsequence, the total number of possible permutations are ( n!)( n) = (k!)k 25
Again, as h is the height of the comparison tree, number of highest possible leaf nodes are 2h . Hence, 2h ≥ (k!)k ⇒ h ≥ klog(k!) ⇒ h = Ω(k 2 logk) as log(k!) = Ω(klog(k)) √ ⇒ h = Ω(nlog( n)) ⇒ h = Ω( 21 nlog(n)) ⇒ h = Ω(nlogn)
2.4
Future work
In future we want to see if there is any relation between problem 1 and two. We will try to give the proof if the ‘log’ factor is indispensable or not. If we can do this this then we will have a great result.
26
Bibliography [1] Roberto Tamassia Michael T. Goodrich. Data Structures and Algorithms in JAVA. Second Edition. WILEY-INDIA, 2007. [2] Ronald L. Rivest Thomas H. Cormen, Charles E. Leiserson and Clifford Stein. Introduction to Algorithms. Second Edition. Prentice Hall of India(EEE), 2007.
27