Thinking About Probabilistic Filters

Comments:: 0

Tags:: Development; Data Structure; Algorithm

By Jesse Donat on: Mar. 5, 2026

I've been a big fan of probabilistic filters such as Bloom, and (more so) Cuckoo filters for a while now. I've put them to great use in my personal projects, as well as in major data processing projects at work. They fall into a neat class of data structures that let you trade certainty for performance.

They are wildly useful for membership checks without loading the entire set into memory in cases where occasional false positives are acceptable. This can make certain cases of checks far more reasonable where scanning the entire set every time would be prohibitively expensive.

Whenever I try to explain them to someone, they tend to get confused. To that end I wanted to put together an oversimplified explanation I can point friends and coworkers to on how to think about these sorts of filters when these questions arise.

This will not fully explain how probabilistic filters work on a technical level. It will arm you with a simplified way to think about them. A working mental model.

Definition

The general idea of a probabilistic set-membership filter is that it is a memory-efficient way to check membership in a set. An item once added to the filter is guaranteed to always pass the filter. An item never added to the filter will probably fail the filter but may sometimes pass.

Example

A good example of probabilistic filtering in action is Google Safe Browsing. Instead of sending your browser an enormous list of 5-billion+ known dangerous URLs, Google sends a compact probabilistic representation of that list. If a URL fails the filter, it is not in Google's list of known dangerous URLs. If it passes the filter, the browser verifies it with Google against their full database to eliminate false positives.

The Mental Model

Let's start by imagining your data. Let's let the following shapes represent the individual items in our set.

And let's let this "wall" represent our filter.

Now imagine forcing our data shapes through the wall and the impression they would make. This can be imagined as adding items to our filter. There is now a hole in the wall in the shape of our data. As we continue pushing all our items through, we are additively making the hole larger.

A flat wall that's been road runnered through by our shapes

Imagine the wall is now locked from further changes. It becomes a filter that only certain shapes may pass through.

One can see how all three of our shapes in our set would successfully "pass" the filter, while some shapes not in our set would fail to pass the filter.

A shape from our set that fits successfully

A shape not from our set that fails to pass

One can also imagine shapes we did not define that would pass. This is how false positives occur when using filters like these.

A strange shape fitting through as a false positive

Digitization

So that's all fine and good for a wall in the physical world, but now we want to digitize this. Let's imagine our wall as a bitmap - a literal map of bits. A zero where our wall has been untouched, and a one wherever it was pierced.

A matrix with our hole now represented digitally

To check if an object passes our filter, we simply check if its pattern fits into our filter's matrix.

Here we can see if we digitize our diamond from earlier, its enabled bits fit cleanly into the marked bits of our filter, and thus it passes the filter.

In many real implementations the "shape" is produced by hashing the item several times, which selects positions in the bitmap to set. The bitmap would also be magnitudes larger than the one in our illustration.

the matrix with our hole shape, covered by a diamond and it's digital shape

One can easily imagine how this could save a lot of memory. Instead of passing around an entire set with potentially millions or billions of members, you only need to pass the serialized bits of the filter.

As I said at the beginning, this is not exactly how they work. This is a drastic oversimplification, and the exact inner workings vary by filter type and configuration. This is just a mental model that can be helpful when thinking about them.