Approximation algorithms are quite possibly my favorite topic in software. I find them remarkable and beautiful.

Here’s the motivation: On large data, many kinds of exact answers are expensive to calculate. At the same time, there are many, many circumstances where imperfect answers are acceptable. For instance, when deciding whether we need to execute a task (like because it’s the first time we’ve seen it), we don’t need to be exactly right – we’ll still see a significant speed-up from avoiding most duplicative work. When incoming data have already been sampled, we know that pursuing a “exact top 3” is a fool’s errand – our answer will never be anything but approximate. When a human decision is the outcome we’re driving, we need to deliver a gestalt – it’s vanishingly rare that anyone cares about the exact numbers.

Approximation algorithms are gorgeous. My favorites rely on hashing to cleverly and cheaply get us answers:

**Cardinality estimation**(e.g., HyperLogLog, CVM): We hash everything and we track the longest prefix of consecutive 0s that we see in the resulting hashes. If the data produces only one or two 0s in a row as its longest consecutive-0 prefix, that implies we’ve only seen a bit of distinct data. The longer the sequence of 0s, the more likely we are to have seen a lot of distinct data. See Alessandro Nadalin for more on HyperLogLog. (There’s other approaches for this problem, too, like the just-introduced-two-years-ago CVM algorithm. The CVM algorithm tracks a randomly chosen subset of elements in finite memory, such that the cardinality of the elements in the buffer can be trivially converted into an estimate of the overall stream’s cardinality.)**Distance metric estimation**(e.g., locality sensitive hashing/LSH, MinHash & SimHash): To each piece of data, we apply a set of hash functions. The hash functions are designed to put similar items in the same “bucket”. (In a machine learning context, we might hash on features, plus some random noise.) We approximate the distance between two samples as the extent to which the hash functions produce collisions for those two samples. See Tyler Neylon for more on LSH.**Set presence estimation**(e.g., Bloom & Cuckoo filters): When we observe an item, we turn on`k`

bits in a bitmap. We figure out which`k`

bits to turn on by hashing the item through`k`

hash functions. Then to test whether a new item is in the set, we hash the new item and we check whether all corresponding`k`

bits are already turned on. See Marciej Ceglowski for more on Bloom filters.**Frequency estimation**(e.g., count-min sketch): When we observe an item, we hash it`k`

times. To store the results, rather than setting a set-presence-style boolean, instead we increment integer counts. Additionally, rather than using a set-presence-style single array, instead we maintain`k`

distinct arrays. Then to estimate the frequency of an item, we retrieve all`k`

integer counts that correspond to that item, and we return the minimum of those values. See Vivek Bansal for more on count-min sketch.

It’s astounding to me that we have effective hash functions at all. What a wonder.

I’d love pointers to even more algorithms and probabilistic data structures that let us get most of the juice with almost no squeeze. Please share them!