Sort

$T []$ to $T []$ from a comparison function $f : T, T \to b oo l$ . The function should ensure a total order, i.e. $f (a, b)$ and $f (b, a)$ should not be both true. Transitive, i.e. $f (a, b)$ and $f (b, c)$ should imply $f (a, c)$ .

Insertion sort

The simplest of all. The idea:

for i in 1..n:
    invariant: array[0..i-1] is sorted
 
    CHOICE 1:
    for( j=i; j>0 && array[j-1] > array[j] ) swap(array[j], array[j-1])
 
    CHOICE 2: more optimised swapping
    // put array[i] in its place in array[0..i]
    compare_with = array[i]
    for(;j>0 && array[j-1] > compare_with; j--):
        array[j] = array[j-1]
    array[j] = compare_with

static inline void insertion_sort(int* array, int start, int end) {
    for (int i = start + 1; i <= end; i++) {
        int key = array[i];
        int j = i;
        for (; j > start && array[j - 1] > key; j--)
            array[j] = array[j - 1];
        array[j] = key;
    }
}

Quicksort

The idea:

pick a pivot
partition to all elements less than pivot on left side and greater than pivot on right side
recursively sort the left and right partitions

The simplest implmentation is:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE 10000
int array[SIZE];
 
void iota(int* array, int size, int start) {
    for (int i = 0; i < size; i++) {
        array[i] = start + i;
    }
}
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
void quicksort_internal(int* array, int start, int end) {
    if (start >= end) return;
 
    int pivot = array[end];  // pick a pivot
 
    int at = start;                      // all the elements < pivot must be here
    for (int i = start; i < end; i++) {  // note the < since last is pivot
        if (array[i] < pivot) {
            swap(&array[i], &array[at]);
            at++;
        }
    }
    swap(&array[at], &array[end]);  // put the pivot in its place
 
    quicksort_internal(array, start, at - 1);
    quicksort_internal(array, at + 1, end);
}
 
void quicksort(int* array, int start, int end) {  // end is exclusive to mimic standers lib
    // shuffle(array, end - start);
    quicksort_internal(array, 0, end - 1);
}
 
void profile(int* array, int size) {
    double total_time = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        shuffle(array, size);
        clock_t start = clock();
        quicksort(array, 0, size);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
}
 
int main() {
    iota(array, SIZE, 0);
 
    profile(array, SIZE);
    return 0;
}

For 1e4, 10 runs, randomized input

Average: 0.703100 ms (randomized input, I shuffle prior to sorting)
Average: 147.962700 ms (sorted input)

This clearly shows the worst $O (n^{2})$ case with sorted inputs. Fixes include either shuffling prior or using better pivot selection. The theoretically best pivot is median. But it takes $O (n)$ to find the median, which again brings us to the best case sorting bound of $O (n lo g n)$ .

Before that, let’s compare shuffling vs a random pivot vs Sedgewick’s median of 3.

1e5, 10 runs

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE 100000
int array[SIZE];
 
void iota(int* array, int size, int start) {
    for (int i = 0; i < size; i++) {
        array[i] = start + i;
    }
}
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
static inline int pivot_random(int* array, int start, int end) {
    return start + rand() % (end - start + 1);
}
 
// this sorts the first, middle, and last elements and returns the mid element index
static inline int pivot_median_of_three(int* array, int start, int end) {
    int mid = start + (end - start) / 2;
    if (array[start] > array[mid]) swap(&array[start], &array[mid]);
    if (array[start] > array[end]) swap(&array[start], &array[end]);
    if (array[mid] > array[end]) swap(&array[mid], &array[end]);
    return mid;
}
 
void quicksort(int* array, int start, int end, int (*pivot_func)(int*, int, int)) {
    if (start >= end) return;
 
    int pivot_index = pivot_func(array, start, end);
    swap(&array[pivot_index], &array[end]);  // move chosen pivot to end
    int pivot = array[end];                  // pick the pivot
 
    int at = start;                      // all the elements < pivot must be here
    for (int i = start; i < end; i++) {  // note the < since last is pivot
        if (array[i] < pivot) {
            swap(&array[i], &array[at]);
            at++;
        }
    }
    swap(&array[at], &array[end]);  // put the pivot in its place
 
    quicksort(array, start, at - 1, pivot_func);
    quicksort(array, at + 1, end, pivot_func);
}
 
static inline int best_median_for_sorted(int* array, int start, int end) {
    return start + (end - start) / 2;
}
 
void profile(int* array, int size, int (*pivot_func)(int*, int, int)) {
    double total_time = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        clock_t start = clock();
        quicksort(array, 0, size - 1, pivot_func);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
}
 
int main() {
    iota(array, SIZE, 0);
 
    profile(array, SIZE, pivot_median_of_three);
    profile(array, SIZE, pivot_random);
    profile(array, SIZE, best_median_for_sorted);
 
    return 0;
}

For sorted inputs, 1e5, 10 runs

Average: 101.515800 ms (shuffle before sort, shuffle time included)
Average: 3.869900 ms (best pivot, median, not practical since I use known information about the input)
Average: 4.089400 ms (median of three, note that this still picks the perfect pivot for sorted inputs)
Average: 6.338800 ms (random pivot)

For randomized inputs, 1e5, 10 runs

Average: 9.208900 ms (median of three)
Average: 9.693100 ms (random pivot)

Worst case can still be $O (n^{2})$ , but it’s clear that median of three is a good choice.

Considering only shuffled cases now, what’s the stack depth? With the current setup, it can be $O (n)$ worst case. But following is one of Sedgwick’s optimizations that keeps the stack depth to $O (lo g n)$ guaranteed.

use a while to convert tail recursion to iteration
only recurse on the smaller partition, and use a loop for the larger one

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define max(a, b) ((a) > (b) ? (a) : (b))
 
#define SIZE 100000
int max_stack_depth = 0;
int array[SIZE];
 
void iota(int* array, int size, int start) {
    for (int i = 0; i < size; i++) {
        array[i] = start + i;
    }
}
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
// this sorts the first, middle, and last elements and returns the mid element index
static inline int pivot_median_of_three(int* array, int start, int end) {
    int mid = start + (end - start) / 2;
    if (array[start] > array[mid]) swap(&array[start], &array[mid]);
    if (array[start] > array[end]) swap(&array[start], &array[end]);
    if (array[mid] > array[end]) swap(&array[mid], &array[end]);
    return mid;
}
 
void quicksort(int* array, int start, int end, int current_stack_depth) {
    if (start >= end) return;
 
    current_stack_depth++;
    max_stack_depth = max(max_stack_depth, current_stack_depth);
 
    int pivot_index = pivot_median_of_three(array, start, end);
    swap(&array[pivot_index], &array[end]);  // move chosen pivot to end
    int pivot = array[end];                  // pick the pivot
 
    int at = start;                      // all the elements < pivot must be here
    for (int i = start; i < end; i++) {  // note the < since last is pivot
        if (array[i] < pivot) {
            swap(&array[i], &array[at]);
            at++;
        }
    }
    swap(&array[at], &array[end]);  // put the pivot in its place
 
    quicksort(array, start, at - 1, current_stack_depth);
    quicksort(array, at + 1, end, current_stack_depth);
}
 
void quicksort_tail_optimized(int* array, int start, int end, int current_stack_depth) {
    current_stack_depth++;
    max_stack_depth = max(max_stack_depth, current_stack_depth);
 
    while (start < end) {
        int pivot_index = pivot_median_of_three(array, start, end);
        swap(&array[pivot_index], &array[end]);  // move chosen pivot to end
        int pivot = array[end];                  // pick the pivot
 
        int at = start;                      // all the elements < pivot must be here
        for (int i = start; i < end; i++) {  // note the < since last is pivot
            if (array[i] < pivot) {
                swap(&array[i], &array[at]);
                at++;
            }
        }
        swap(&array[at], &array[end]);  // put the pivot in its place
 
        if (at - 1 - start < end - (at + 1)) {
            quicksort_tail_optimized(array, start, at - 1, current_stack_depth);
            start = at + 1;
        } else {
            quicksort_tail_optimized(array, at + 1, end, current_stack_depth);
            end = at - 1;
        }
    }
}
 
void profile(int* array, int size, void (*sort_func)(int*, int, int, int)) {
    double total_time = 0.0;
    double total_depth = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        shuffle(array, size);
        max_stack_depth = 0;
        clock_t start = clock();
        sort_func(array, 0, size - 1, 0);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_depth += max_stack_depth;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
    printf("Max stack depth: %d\n", max_stack_depth);
    printf("Average stack depth: %f\n", total_depth / runs);
}
 
int main() {
    iota(array, SIZE, 0);
    // so that different runs show a different stack depth
    srand((unsigned int)time(NULL));
 
    printf("No tail optimization:\n");
    profile(array, SIZE, quicksort);
    printf("\nWith tail optimization:\n");
    profile(array, SIZE, quicksort_tail_optimized);
 
    return 0;
}

1e5, 10 runs, randomized input

No tail optimization:
Average: 8.977500 ms
Max stack depth: 30
Average stack depth: 29.600000

With tail optimization:
Average: 8.981900 ms
Max stack depth: 12
Average stack depth: 12.500000

The runtime is almost the same but you optimize on the stack depth.

Note that there’s a double evaluation problem with that max macro but that’s not relevant here.

I haven’t discussed any improved partitioning scheme yet but that can be improved as well. So far the scheme I’ve used is Lomuto partitioning ( I would rather prefer to call it the “obvious” partitioning ). There’s Hoare partitioning which uses is basically a two-pointer scheme to partition, but we can do better.

3-way partitioning ( or Dutch National Flag problem ) allows us to handle repeated element even more efficiently, since instead of just less than and greater than, we discard all sizes that are equal to pivot in the subsequent recursive calls.

I’m using the non-tail optimized version to compare.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE 100000
int array[SIZE];
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
static inline int pivot_median_of_three(int* array, int start, int end) {
    int mid = start + (end - start) / 2;
    if (array[start] > array[mid]) swap(&array[start], &array[mid]);
    if (array[start] > array[end]) swap(&array[start], &array[end]);
    if (array[mid] > array[end]) swap(&array[mid], &array[end]);
    return mid;
}
 
void quicksort_lomuto(int* array, int start, int end) {
    if (start >= end) return;
 
    int pivot_index = pivot_median_of_three(array, start, end);
    swap(&array[pivot_index], &array[end]);  // move pivot to the end
    int pivot = array[end];
 
    int at = start;                      // all the elements < pivot must be here
    for (int i = start; i < end; i++) {  // note the < since last is pivot
        if (array[i] < pivot) {
            swap(&array[i], &array[at]);
            at++;
        }
    }
    swap(&array[at], &array[end]);  // put the pivot in its place
 
    quicksort_lomuto(array, start, at - 1);
    quicksort_lomuto(array, at + 1, end);
}
 
void quicksort_hoare(int* array, int start, int end) {
    if (start >= end) return;
 
    int pivot_index = pivot_median_of_three(array, start, end);
    int pivot = array[pivot_index];
 
    int i = start;
    int j = end;
 
    while (1) {
        while (array[i] < pivot) i++;
        while (array[j] > pivot) j--;
 
        if (i >= j) break;
 
        swap(&array[i], &array[j]);
        i++, j--;
    }
 
    quicksort_hoare(array, start, j);
    quicksort_hoare(array, j + 1, end);
}
 
void quicksort_3_way(int* array, int start, int end) {
    if (start >= end) return;
 
    int pivot_index = pivot_median_of_three(array, start, end);
    int pivot = array[pivot_index];
 
    int lt = start;  // array[start..lt-1] < pivot
    int i = start;   // array[lt..i-1] == pivot & current index
    int gt = end;    // array[gt+1..end] > pivot
 
    /*
    the idea is that we increase levels for each element
    - if less than pivot, move to level = pivot
        - if it's eqial it stays at that level
        - otherwise moves to the next leven when it's processed
        * once we start processing levels = and >, we are sure there cannot be any < pivot element
    - if it's greater than pivot, move to level > pivot
    */
 
    while (i <= gt) {  // when i is gt, we are done
        if (array[i] < pivot) {
            swap(&array[lt], &array[i]);
            lt++;
            i++;
        } else if (array[i] == pivot) {
            i++;
        } else {  // array[i] > pivot
            swap(&array[i], &array[gt]);
            gt--;
        }
    }
 
    quicksort_3_way(array, start, lt - 1);
    quicksort_3_way(array, gt + 1, end);
}
 
void profile(int* array, int size, void (*sort_func)(int*, int, int)) {
    double total_time = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        shuffle(array, size);
        clock_t start = clock();
        sort_func(array, 0, size - 1);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
}
 
int main() {
    for (int i = 0; i < SIZE; i++) {
        array[i] = i % 10;  // duplicates
    }
    srand(time(NULL));
 
    profile(array, SIZE, quicksort_lomuto);
    profile(array, SIZE, quicksort_hoare);
    profile(array, SIZE, quicksort_3_way);
 
    return 0;
}

The inefficiency of the obvious way is painfully clear when you have lots of duplicates. For 1e5, 10 runs, randomized input

Average: 382.232000 ms (Lomuto partitioning)
Average: 4.349600 ms (Hoare partitioning)
Average: 1.700500 ms (3-way partitioning)

Can we do still better? In practical implementations, a hubrid of quicksort with switch to insertion sort is used for small partitions, since insertion sort is faster for small inputs. Why?

no overhead of recursion
better cache locality

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE 1000000
int array[SIZE];
 
void iota(int* arr, int size, int start) {
    for (int i = 0; i < size; i++) {
        arr[i] = start + i;
    }
}
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
static inline int pivot_median_of_three(int* array, int start, int end) {
    int mid = start + (end - start) / 2;
    if (array[start] > array[mid]) swap(&array[start], &array[mid]);
    if (array[start] > array[end]) swap(&array[start], &array[end]);
    if (array[mid] > array[end]) swap(&array[mid], &array[end]);
    return mid;
}
 
static inline void insertion_sort(int* array, int start, int end) {
    for (int i = start + 1; i <= end; i++) {
        int key = array[i];
        int j = i;
        while (j > start && array[j - 1] > key) {
            array[j] = array[j - 1];
            j--;
        }
        array[j] = key;
    }
}
 
void quicksort(int* array, int start, int end, int use_insertion_sort) {
    while (start < end) {
        if (use_insertion_sort && end - start <= 50) {
            insertion_sort(array, start, end);
            return;
        }
 
        int pivot_index = pivot_median_of_three(array, start, end);
        int pivot = array[pivot_index];
 
        int lt = start;  // array[start..lt-1] < pivot
        int i = start;   // array[lt..i-1] == pivot & current index
        int gt = end;    // array[gt+1..end] > pivot
 
        while (i <= gt) {  // when i is gt, we are done
            if (array[i] < pivot) {
                swap(&array[lt], &array[i]);
                lt++;
                i++;
            } else if (array[i] == pivot) {
                i++;
            } else {  // array[i] > pivot
                swap(&array[i], &array[gt]);
                gt--;
            }
        }
 
        // Tail optimization: recurse on smaller partition, iterate on larger
        if (lt - 1 - start < end - (gt + 1)) {
            quicksort(array, start, lt - 1, use_insertion_sort);
            start = gt + 1;
        } else {
            quicksort(array, gt + 1, end, use_insertion_sort);
            end = lt - 1;
        }
    }
}
 
void profile(int* array, int size, void (*sort_func)(int*, int, int, int), int use_insertion_sort) {
    double total_time = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        shuffle(array, size);
        clock_t start = clock();
        sort_func(array, 0, size - 1, use_insertion_sort);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
}
 
int main() {
    iota(array, SIZE, 0);
    srand(time(NULL));
 
    printf("No insertion sort:\n");
    profile(array, SIZE, quicksort, 0);
    printf("With insertion sort:\n");
    profile(array, SIZE, quicksort, 1);
 
    return 0;
}

For 1e6 ( note that we jumped an order of magnitude here ), 10 runs, randomized input, no duplicates ( a lot of duplicates actually speed it up a lot and make the difference negligible )

No insertion sort:
Average: 133.825300 ms
With insertion sort:
Average: 118.120900 ms

I’ve 50 as the threshold for switching to insertion sort simply by trial and error. What is clear though is the speedup.

So the final overall algorithm for sorting compared with standard library’s qsort is:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE 1000000
int array[SIZE];
 
static inline void swap(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
 
void shuffle(int* array, int size) {
    for (int i = size - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        swap(&array[i], &array[j]);
    }
}
 
//---------------------- Quicksort ---------------------------
 
static inline int pivot_median_of_three(int* array, int start, int end) {
    int mid = start + (end - start) / 2;
    if (array[start] > array[mid]) swap(&array[start], &array[mid]);
    if (array[start] > array[end]) swap(&array[start], &array[end]);
    if (array[mid] > array[end]) swap(&array[mid], &array[end]);
    return mid;
}
 
static inline void insertion_sort(int* array, int start, int end) {
    for (int i = start + 1; i <= end; i++) {
        int key = array[i];
        int j = i;
        while (j > start && array[j - 1] > key) {
            array[j] = array[j - 1];
            j--;
        }
        array[j] = key;
    }
}
 
void quicksort(int* array, int start, int end) {
    while (start < end) {
        if (end - start <= 50) {
            insertion_sort(array, start, end);
            return;
        }
 
        int pivot_index = pivot_median_of_three(array, start, end);
        int pivot = array[pivot_index];
 
        int lt = start;  // array[start..lt-1] < pivot
        int i = start;   // array[lt..i-1] == pivot & current index
        int gt = end;    // array[gt+1..end] > pivot
 
        while (i <= gt) {  // when i is gt, we are done
            if (array[i] < pivot) {
                swap(&array[lt], &array[i]);
                lt++;
                i++;
            } else if (array[i] == pivot) {
                i++;
            } else {  // array[i] > pivot
                swap(&array[i], &array[gt]);
                gt--;
            }
        }
 
        // Tail optimization: recurse on smaller partition, iterate on larger
        if (lt - 1 - start < end - (gt + 1)) {
            quicksort(array, start, lt - 1);
            start = gt + 1;
        } else {
            quicksort(array, gt + 1, end);
            end = lt - 1;
        }
    }
}
 
// ---------------------- End -----------------------------
 
static inline int compare(const void* a, const void* b) {
    int int_a = *(const int*)a;
    int int_b = *(const int*)b;
    return (int_a > int_b) - (int_a < int_b);
}
 
void sort_library(int* array, int start, int end) {
    int size = end - start + 1;
    qsort(array, size, sizeof(int), (int (*)(const void*, const void*))compare);
}
 
void profile(int* array, int size, void (*sort_func)(int*, int, int)) {
    double total_time = 0.0;
    int runs = 10;
    for (int r = 0; r < runs; r++) {
        shuffle(array, size);
        clock_t start = clock();
        sort_func(array, 0, size - 1);
        clock_t end = clock();
        double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
        total_time += elapsed;
    }
    printf("Average: %f ms\n", (total_time / runs) * 1000);
}
 
int main() {
    srand(time(NULL));
 
    for (int i = 0; i < SIZE; i++) {
        array[i] = i;
    }
 
    printf("For shuffled unique elements:\n");
    printf("Current quicksort:\n");
    profile(array, SIZE, quicksort);
    printf("Library qsort:\n");
    profile(array, SIZE, sort_library);
 
    for (int i = 0; i < SIZE; i++) {
        array[i] = rand() % 10;
    }
 
    printf("\nFor shuffled repeated elements:\n");
    printf("Current quicksort:\n");
    profile(array, SIZE, quicksort);
    printf("Library qsort:\n");
    profile(array, SIZE, sort_library);
 
    return 0;
}

For 1e6 elements, 10 runs

For shuffled unique elements:
Current quicksort:
Average: 122.086900 ms
Library qsort:
Average: 119.518000 ms
 
For shuffled repeated elements:
Current quicksort:
Average: 16.478000 ms
Library qsort:
Average: 65.625500 ms

The custom implementations is at par for the unique case but is significantly faster for the repeated elements case. Overall it’s a good implemention since we can compare to what the lib has to offer. I also did a comparison with C++‘s std::sort and the results were similar. I think both are the same internally, on gcc atleast.

Now with all that said and done, what if I compare this monstrous and overly fancy implementation with the simplest implementaion that I started with in the beginning? Well, for unique elements, shuffled, 1e6, 10 runs ( same as above )

Average: 92.686600 ms

So what happenened?
Well, you can say that the fancy implementation has an overhead of this and that and all that, which is true, or you can defend saying that the simple implementation just doesn’t work for the case of large number of duplicates, which is true as well.

But was it all worth it? Have I betrayed myself for nothing? Which one do you think the retarded interviewer or online judge where the testcases are not worth a dime would prefer?

This is the curse of knowledge, sure what I know after having gone through this allows me to write a custom sort implementation, using expected properties of the data, that’s perhaps ( emphasis on perhaps again ) faster then a generic one, but in the end, it’s not worth it at all, and this holds true for most things in life and you can always draw parallels.

Just because something took more effort to learn or implement, doesn’t mean it’s better.

Lest I Forget

Notes

Sorting and Life

Sort

Insertion sort

Quicksort