This is a post about Big-O notation and measuring algorithmic complexity; topics usually taught to computer science undergraduates in their second to fourth semester. It’s intended for curious people outside the field, or new students. There are many posts on this subject, but this one is mine.
In computer science we often care about whether an algorithm is an efficient solution to a problem, or whether one algorithm is more efficient than another approach. One might be tempted to measure efficiency in terms of microseconds it takes a process to run, or perhaps number of assembly instructions needed. However, these metrics will vary widely depending on what language an algorithm is implemented in, what hardware it’s run on, what other software is running on the system competing for resources, and a host of other factors. We’d prefer to think more abstractly, and compare one strategy to another rather than their implementations. In particular, computer scientists often examine how an algorithm scales, or how quickly it slows down as inputs grow very large.
Let’s start with a trivial example: given a list of numbers, return their sum. Looks something like:
def sum(list): total = 0 for item in list total += item end return total end
Since we need to read the entire list, this algorithm scales linearly with the length of the list - make the list a hundred times longer, and it will take roughly a hundred times longer to get a sum. We write this formally as
O(n), meaning “scales linearly with
n, the size of the input.” We call this formal syntax “Big O notation,” where the ‘O’ stands for “order of approximation” (or in the original German, “Ordnung”).
Not all algorithms scale. If we were asked “return the third element in the list” then it wouldn’t matter whether the list is three elements long or three million elements long, we can get to the third element in a constant amount of time. This is written as
O(1), indicating no reliance on the input size.
Search algorithms give us our first example problem with divergent solutions. Given a stack of papers with names on them, tell me whether “Rohan” is in the stack. A trivial solution might look like:
def hasName(list) for name in list if name == "Rohan" return true end end return false end
This scales linearly with the length of the list, just like summing the elements. If the list is in an unknown order then we have no choice but to examine every element. However, if we know the list is in alphabetical order then we can do better. Start in the middle of the list - if the name is Rohan, we’re done. If we’re after Rohan alphabetically, then discard the second half of the list, and repeat on the first half. If we’re before Rohan alphabetically, then discard the first half of the list and repeat on the second. If we exhaust the list, then Rohan’s not in it. This approach is called a binary search, and visually looks like:
In code, a binary search looks something like:
def hasName(list) if( list.length == 0 ) return false end middle = list.length / 2 if( list[middle] == "Rohan" ) return true elsif( list[middle] > "Rohan" ) # Search left half return hasName(list.first(middle)) else # Search right half return hasName(list[middle .. list.length] end end
With every step in the algorithm we discard half the list, so we look at far fewer than all the elements. Our binary search still gets slower as the input list grows longer - if we double the length of the list we need one extra search step - so the algorithm scales logarithmically rather than linearly, denoted
We’ll end this section by looking at two sorting algorithms: insertion sort, and merge sort.
We want to sort a list, provided to us in random order. One simple approach is to build a new sorted list: one at a time, we take elements from the front of the main list, and find their correct position among the sorted list we’ve built so far. To find the correct position we just look at the value left of our new element, and check whether they should be swapped or not. Keep swapping left until the new element finds its correct position. This visually looks like:
One implementation might look like:
def insertionSort(list) for i in 0.upto(list.length-1) for j in (i-1).downto(0) if( list[j] > list[j+1] ) list[j], list[j+1] = list[j+1], list[j] else break # Done swapping, found the right spot! end end end return list end
Insertion sort is simple and easy to implement. If you were coming up with a sorting algorithm on the spot for something like sorting a deck of cards, you might invent something similar. So what’s the runtime?
In insertion sort, we walk the list from start to end, which is
O(n). For every new element we examine, however, we walk the list backwards from our current position to the start. This operation also scales linearly with the length of the list, and so is also
O(n). If we perform a backwards
O(n) walk for every step of the forwards
O(n) walk, that’s
O(n) * O(n) for a total of
O(n^2). Can we do better?
An alternative approach to sorting is to think of it as a divide-and-conquer problem. Split the list in half, and hand the first half to one underling and the second half to another underling, and instruct them each to sort their lists. Each underling does the same, splitting their lists in half and handing them to two further underlings. Eventually, an underling receives a list of length one, which is by definition already sorted. This splitting stage looks something like:
Now we want to merge our results upwards. Each underling hands their sorted list back up to their superiors, who now have two sorted sub-lists. The superior combines the two sorted lists by first making a new empty “merged” list that’s twice as long. For every position in the merged list, the superior compares the top element of each sorted sub-list, and moves the lower element to the merged list. This process looks like:
Once all elements from the two sub-lists have been combined into a merged list, the superior hands their newly sorted list upwards to their superior. We continue this process until we reach the top of the tree, at which point our work is done. This merge step looks like:
In code, the full algorithm might look something like:
# Combine two sorted lists def merge(left, right) merged =  while( left.length + right.length > 0 ) if( left.length == 0 ) # Left empty, take from right merged += right.shift(1) elsif( right.length == 0 ) # Right empty, take from left merged += left.shift(1) elsif( left < right ) # Top of left stack is less, take it merged += left.shift(1) else # Top of right stack is less, take it merged += right.shift(1) end end return merged end # Takes a single list, sub-divides it, sorts results def mergeSort(list) if( list.length <= 1 ) return list # Sorted already :) end middle = list.length / 2 left = list[0 .. middle-1] right = list[middle .. list.length-1] leftSorted = mergeSort(left) rightSorted = mergeSort(right) return merge(leftSorted, rightSorted) end
So what’s the runtime of merge sort? Well it takes
log n steps to divide the list in half down to one element. We do this division process for every element in the list. That gives us a runtime of
n * log n to break the list apart and create the full tree diagram.
Merging two sorted lists together scales linearly with the size of the lists, so the merge step is
O(n). We need to perform a merge each time we move up a “level” of the tree, and there are
log n levels to this tree. Therefore, the full merge process also scales with
O(n log n).
This gives us a total runtime of
O(n log n + n log n) or
O(2n log n) to create the tree and merge it back together. However, because we are concerned with how algorithms scale as the inputs become very large, we drop constants and all expressions but the dominant term - multiplying by 2 doesn’t mean much as
n approaches infinity - and simplify the run time to
O(n log n). That’s a lot better than insertion sort’s
Big O notation typically describes an “average” or “expected” performance and not a “best case” or “worst-case”. For example, if a list is in a thoroughly random order, then insertion sort will have a performance of
O(n^2). However, if the list is already sorted, or only one or two elements are out of place, then insertion sort’s best-case performance is
O(n). That is, insertion sort will walk the list forwards, and if no elements are out of place, there will be no need to walk the list backwards to find a new position for any elements. By contrast, merge sort will always split the list into a tree and merge the branches back together, so even when handed a completely sorted list, merge sort’s best-case performance is still
O(n log n).
Big O notation also does not describe memory complexity. The description of merge sort above creates a temporary
merged list during the merge step, meaning however long the input list is, merge sort needs at least twice as much memory space for its overhead. By contrast, insertion sort works “in place,” sorting the input list without creating a second list as a workspace. Many algorithms make a trade-off between time and space in this way.
Finally, Big O notation describes how an algorithm scales as
n gets very large. For small values of
n, insertion sort may outperform merge sort, because merge sort has some extra bookkeeping to allocate temporary space for merging and coordinate which minions are sorting which parts of the list.
In summary, Big O notation is a valuable tool for quickly comparing two algorithms, and can provide programmers with easy estimates as to which parts of a problem will be the most time-consuming. However, Big O notation is not the only metric that matters, and should not be treated as such.
All of the algorithms described above can be run in polynomial time. This means their scaling rate, or Big O value, can be upper-bounded by a polynomial of the form
O(n^k). For example, while merge sort scales with
O(n log n), and logarithms are not polynomials,
n log n is strictly less than
n^2, so merge sort is considered to run in polynomial time. By contrast, algorithms with runtimes like
O(n!) are not bounded by a polynomial, and perform abysmally slowly as
n grows large.
These definitions allow us to describe categories of algorithms. We describe algorithms that run in polynomial time as part of set P, and we typically describe P as a subset of NP - the algorithms where we can verify whether a solution is correct in polynomial time.
To illustrate the difference between running and verifying an algorithm, consider the graph coloring problem: given a particular map, and a set of three or more colors, can you color all the countries so that no two bordering countries have the same color? The known algorithms for this problem are tedious. Brute forcing all possible colorings scales with
O(k^n) for k-colors and n-countries, and the fastest known general algorithms run in
O(n * 2^n). However, given a colored-in map, it’s easy to look at each country and its neighbors and verify that none violate the coloring rules. At worst, verifying takes
O(n^2) time if all countries border most others, but more realistically
O(n) if we assume that each country only borders a small number of neighbors rather than a significant fraction of all countries.
Next, we have NP-Hard: these are the set of problems at least as hard as the most computationally intensive NP problems, but maybe harder - some NP-Hard problems cannot even have their solutions verified in polynomial time. When we describe a problem as NP-Hard we are often referring to this last property, even though the most challenging NP problems are also NP-Hard.
One example of an NP-Hard problem without polynomial verification is the Traveling Salesman: given a list of cities and distances between cities, find the shortest path that travels through every city exactly once, ending with a return to the original city. Trying all paths through cities scales with
O(n!). More clever dynamic programming solutions improve this to
O(n^2 2^n). But if someone claims to have run a traveling salesman algorithm, and hands you a path, how do you know it’s the shortest possible path? The only way to be certain is to solve the traveling salesman problem yourself, and determine whether your solution has the same length as the provided answer.
Finally, we have NP-Complete. These are the most challenging problems in NP, meaning:
Solutions to these algorithms can be verified in polynomial time
There is no known polynomial-time solution to these algorithms
Any problem in NP can be translated into an input to an NP-Complete problem in polynomial time, and the result of the NP-Complete algorithm can be translated back, again in polynomial time
Here’s a visualization of these problem classes:
Broad consensus in computer science is that the NP problem space is larger than the P problem space. That is, there are some problems that cannot be solved in polynomial time, but can be verified in polynomial time. However, no one has been able to definitively prove this, in large part because making formal arguments about such abstract questions is exceedingly difficult. There are many problems we do not know how to solve in polynomial time, but how do we prove there isn’t a faster, more clever solution that we haven’t thought of?
Therefore, a minority of computer scientists hold that P = NP, or in other words, all problems that can be verified in polynomial time can also be solved in polynomial time. This would make our set of problem classes look more like:
To prove that P equals NP, all someone would need to do is find a polynomial-time solution to any NP-Complete problem. Since we know all NP problems can be translated back and forth to NP-Complete problems in polynomial time, a fast solution to any of these most challenging problems would be a fast solution to every poly-verifiable algorithm. No such solution has been found.