Thinking About Probabilistic Filters
- Comments:
- 0
I've been a big fan of probabilistic filters such as Bloom, and (more so) Cuckoo filters for a while now. I've put them to great use in my personal projects, as well as in major data processing projects at work. They fall into a neat class of data structures that let you trade certainty for performance.
They are wildly useful for membership checks without loading the entire set into memory in cases where occasional false positives are acceptable. This can make certain cases of checks far more reasonable where scanning the entire set every time would be prohibitively expensive.
Whenever I try to explain them to someone, they tend to get confused. To that end I wanted to put together an oversimplified explanation I can point friends and coworkers to on how to think about these sorts of filters when these questions arise.
This will not fully explain how probabilistic filters work on a technical level. It will arm you with a simplified way to think about them. A working mental model.
Definition
The general idea of a probabilistic set-membership filter is that it is a memory-efficient way to check membership in a set. An item once added to the filter is guaranteed to always pass the filter. An item never added to the filter will probably fail the filter but may sometimes pass.
Example
A good example of probabilistic filtering in action is Google Safe Browsing. Instead of sending your browser an enormous list of 5-billion+ known dangerous URLs, Google sends a compact probabilistic representation of that list. If a URL fails the filter, it is not in Google's list of known dangerous URLs. If it passes the filter, the browser verifies it with Google against their full database to eliminate false positives.
The Mental Model
Let's start by imagining your data. Let's let the following shapes represent the individual items in our set.
And let's let this "wall" represent our filter.
Now imagine forcing our data shapes through the wall and the impression they would make. This can be imagined as adding items to our filter. There is now a hole in the wall in the shape of our data. As we continue pushing all our items through, we are additively making the hole larger.
Imagine the wall is now locked from further changes. It becomes a filter that only certain shapes may pass through.
One can see how all three of our shapes would successfully "pass" the filter, while only preventing some others. One can also imagine shapes we did not define that would pass. This is how false positives occur when using filters like these.
Digitization
So that's all fine and good for a wall in the physical world, but now we want to digitize this. Let's imagine our wall as a bitmap - a literal map of bits. A zero where our wall has been untouched, and a one wherever it was pierced.
To check if an object passes our filter, we simply check if its pattern fits into our filter's matrix.
Here we can see if we digitize our diamond from earlier, its enabled bits fit cleanly into the marked bits of our filter, and thus it passes the filter.
In many real implementations the "shape" is produced by hashing the item several times, which selects positions in the bitmap to set. The bitmap would also be magnitudes larger than the one in our illustration.
One can easily imagine how this could save a lot of memory. Instead of passing around an entire set with potentially millions or billions of members, you only need to pass the serialized bits of the filter.
As I said at the beginning, this is not exactly how they work. This is a drastic oversimplification, and the exact inner workings vary by filter type and configuration. This is just a mental model that can be helpful when thinking about them.