Bloom Filter in Java with Examples
Last Updated :
16 Apr, 2020
Bloom filters are for set membership which determines whether an element is present in a set or not. Bloom filter was invented by Burton H. Bloom in 1970 in a paper called Space/Time Trade-offs in Hash Coding with Allowable Errors (1970). Bloom filter is a probabilistic data structure that works on hash-coding methods (similar to HashTable).
When do we need a Bloom Filter?
Consider any of the following situations:
- Suppose we have a list of some elements and we want to check whether a given element is present or not?
- Consider you are working on email service and you are trying to implement sign up endpoint with a feature that a given username is already present or not?
- Suppose you have given a set of blacklisted IP’s and you want to filter out a given IP is a blacklisted one or not?
Can these problem be solved without the help of Bloom Filter?
Let us try to solve these problem using a HashSet
import java.util.HashSet;
import java.util.Set;
public class SetDemo {
public static void main(String[] args)
{
Set<String> blackListedIPs
= new HashSet<>();
blackListedIPs.add( "192.170.0.1" );
blackListedIPs.add( "75.245.10.1" );
blackListedIPs.add( "10.125.22.20" );
System.out.println(
blackListedIPs
.contains(
"75.245.10.1" ));
System.out.println(
blackListedIPs
.contains(
"101.125.20.22" ));
}
}
|
Why does data structure like HashSet or HashTable fail?
HashSet or HashTable works well when we have limited data set, but might not fit as we move with a large data set. With a large data set, it takes a lot of time with a lot of memory.
Size of Data set vs insertion time for HashSet like data structure
----------------------------------------------
|Number of UUIDs Insertion Time(ms) |
----------------------------------------------
|10 <1 |
|100 3 |
|1, 000 58 |
|10, 000 122 |
|100, 000 836 |
|1, 000, 000 7395 |
----------------------------------------------
Size of Data set vs memory (JVM Heap) for HashSet like data structure
----------------------------------------------
|Number of UUIDs JVM heap used(MB) |
----------------------------------------------
|10 <2 |
|100 <2 |
|1, 000 3 |
|10, 000 9 |
|100, 000 37 |
|1, 000, 000 264 |
-----------------------------------------------
So it is clear that if we have a large set of data then a normal data structure like the Set or HashTable is not feasible, and here Bloom filters come into the picture. Refer this article for more details on comparison between the two: Difference between Bloom filters and Hashtable
How to solve these problems with the help of Bloom Filter?
Let’s take a bit array of size N (Here 24) and initialize each bit with binary zero, Now take some hash functions (You can take as many you want, we are taking two hash function here for our illustration).

- Now pass the first IP you have to both hash function, which generates some random number as given below
hashFunction_1(192.170.0.1) : 2
hashFunction_2(192.170.0.1) : 6
Now, Go to index 2 and 6 and mark the bit as binary 1.

- Now pass the second IP you have, and follow the same step.
hashFunction_1(75.245.10.1) : 4
hashFunction_2(75.245.10.1) : 10
Now, Go to index 4 and 10 and mark the bit as binary 1.

- Similarly pass the third IP to the both hash function, and suppose you got the below output of hash function
hashFunction_1(10.125.22.20) : 10
hashFunction_2(10.125.22.20) : 19
‘
Now, go to index 10 and 19 and mark as binary 1, Here index 10 is already marked by previous entry so just mark the index 19 as binary 1.

Now, It is time to check whether an IP is present in the data set or not,
- Test input #1
Let’s say we want to check IP 75.245.10.1. Pass this IP with the same two hash functions which we have taken for adding the above inputs.
hashFunction_1(75.245.10.1) : 4
hashFunction_2(75.245.10.1) : 10
Now, Go to the index and check the bit, if both the index 4 and 10 is marked with binary 1 then the IP 75.245.10.1 is present in the set, otherwise it is not with the data set.

- Test input #2
Let’s say we want to check IP 75.245.20.30 is present in the set or not? So the process will be same, Pass this IP with the same two hash functions which we have taken for adding the above inputs.
hashFunction_1(75.245.20.30) : 19
hashFunction_2(75.245.20.30) : 23
Since at index 19 it is set to 1 but at index 23 it is 0, So we can say given IP 75.245.20.30 is not present in the set.

Why is Bloom Filter a probabilistic data structure?
Let’s understand this with one more test, This time consider an IP 101.125.20.22 and check whether it is present in the set or not. Pass this to both hash function. Consider our hash function results as follows.
hashFunction_1(101.125.20.22) : 19
hashFunction_2(101.125.20.22) : 2
Now, visit the index 19 and 2 which is set to 1 and it says that the given IP101.125.20.22 is present in the set.

But, this IP 101.125.20.22 has bot been processed above in the data set while adding the IP’s to bit array. This is known as False Positive:
Expected Output: No
Actual Output: Yes (False Positive)
In this case, index 2 and 19 were set to 1 by other input and not by this IP 101.125.20.22. This is called collision and that’s why it is probabilistic, where chances of happening are not 100%.
What to expect from a Bloom filter?
- When a Bloom filter says an element is not present it is for sure not present. It guarantees 100% that the given element is not available in the set, because either of the bit of index given by hash functions will be set to 0.
- But when Bloom filter says the given element is present it is not 100% sure, because there may be a chance due to collision all the bit of index given by hash functions has been set to 1 by other inputs.
How to get 100% accurate result from a Bloom filter?
Well, this could be achieved only by taking more number of hash functions. The more number of the hash function we take, the more accurate result we get, because of lesser chances of a collision.
Time and Space complexity of a Bloom filter
Suppose we have around 40 million data sets and we are using around H hash functions, then:
Time complexity: O(H), where H is the number of hash functions used
Space complexity: 159 Mb (For 40 million data sets)
Case of False positive: 1 mistake per 10 million (for H = 23)
Implementing Bloom filter in Java using Guava Library:
We can implement the Bloom filter using Java library provided by Guava.
- Include the below maven dependency:
< dependency >
< groupId >com.google.guava</ groupId >
< artifactId >guava</ artifactId >
< version >19.0</ version >
</ dependency >
|
- Write the following code to implement the Bloom Filter:
import java.nio.charset.Charset;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterDemo {
public static void main(String[] args)
{
BloomFilter<String> blackListedIps
= BloomFilter.create(
Funnels.stringFunnel(
Charset.forName( "UTF-8" )),
10000 );
blackListedIps.put( "192.170.0.1" );
blackListedIps.put( "75.245.10.1" );
blackListedIps.put( "10.125.22.20" );
System.out.println(
blackListedIps
.mightContain(
"75.245.10.1" ));
System.out.println(
blackListedIps
.mightContain(
"101.125.20.22" ));
}
}
|
Output:

Bloom Filter Output
Note: The above Java code may return a 3% false-positive probability by default.
- Reduce the false-positive probability
Introduce another parameter in Bloom filter object creation as follows:
BloomFilter blackListedIps = BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 10000, 0.005);
Now false-positive probability has been reduced from 0.03 to 0.005. But tweaking this parameter has an effect on the side of the bloom filter.
Effect of reducing the false positive probability:
Let’s analyze this effect with respect to the hash function, array bit, time complexity and space complexity.
- Let’s look on insertion time for different data set.
-----------------------------------------------------------------------------
|Number of UUIDs | Set Insertion Time(ms) | Bloom Filter Insertion Time(ms) |
-----------------------------------------------------------------------------
|10 <1 71 |
|100 3 17 |
|1, 000 58 84 |
|10, 000 122 272 |
|100, 000 836 556 |
|1, 000, 000 7395 5173 |
------------------------------------------------------------------------------
- Now, Let’s have a look on memory(JVM heap)
--------------------------------------------------------------------------
|Number of UUIDs | Set JVM heap used(MB) | Bloom filter JVM heap used(MB) |
--------------------------------------------------------------------------
|10 <2 0.01 |
|100 <2 0.01 |
|1, 000 3 0.01 |
|10, 000 9 0.02 |
|100, 000 37 0.1 |
|1, 000, 000 264 0.9 |
---------------------------------------------------------------------------
- Bit counts
----------------------------------------------
|Suggested size of Bloom Filter | Bit count |
----------------------------------------------
|10 40 |
|100 378 |
|1, 000 3654 |
|10, 000 36231 |
|100, 000 361992 |
|1, 000, 000 3619846 |
-----------------------------------------------
- Number of Hash Functions used for various false positive probabilities:
-----------------------------------------------
|Suggested FPP of Bloom Filter | Hash Functions|
-----------------------------------------------
|3% 5 |
|1% 7 |
|0.1% 10 |
|0.01% 13 |
|0.001% 17 |
|0.0001% 20 |
------------------------------------------------
Conclusion:
Therefore it can be said that Bloom filter is a good choice in a situation where we have to process large data set with low memory consumption. Also, the more accurate result we want, the number of hash functions has to be increased.
Similar Reads
Java 8 | BiConsumer Interface in Java with Examples
The BiConsumer Interface is a part of the java.util.function package which has been introduced since Java 8, to implement functional programming in Java. It represents a function that takes in two arguments and produces a result. However, these kinds of functions doesn't return any value. This funct
5 min read
Java 8 | Consumer Interface in Java with Examples
The Consumer Interface is a part of the java.util.function package which has been introduced since Java 8, to implement functional programming in Java. It represents a function which takes in one argument and produces a result. However these kind of functions don't return any value.Hence this functi
4 min read
Types of Errors in Java with Examples
Error is an illegal operation performed by the user which results in the abnormal working of the program. Programming errors often remain undetected until the program is compiled or executed. Some of the errors inhibit the program from getting compiled or executed. Thus errors should be removed befo
8 min read
Implementation of Blockchain in Java
Blockchain is the backbone Technology of Digital CryptoCurrency BitCoin. A Blockchain is a list of records called blocks that are linked together using linked lists and use the cryptographic technique.Each block contains its own digital fingerprint called Hash, the hash of the previous block, a tim
4 min read
Functional Programming in Java 8+ using the Stream API with Example
API is an acronym for Application Programming Interface, which is software and the java streams work on a data source. Consider a stream like a flow of water in a small canal. Let's take a real-life example. Each time a user uses an application that is popular these days like WhatsApp in order to co
4 min read
Find the Intersection of Two HashSets in Java
HashSets is a type of Collection in Java that cannot contain any duplicate values. It is often used when we need to maintain a unique data set in our application, it uses hashing internally to store the elements in it, so operations like searching, insertion, and deletion take only a constant amount
6 min read
Java Program to Implement HashTables with Linear Probing
Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects. Suppose an object is to be assigned a key to it to make searching easy. To store the key/value pair, one can use a simple array like a data structure where keys (integers) can be used directly
5 min read
How to Implement a Resizable Array in Java?
In Java, Arrays store the data and manipulate collections of elements, A Resizable Array Unlike other arrays, which have fixed size, resizable arrays automatically adjust their capacity to accommodate more elements as needed. This article explores the concept of resizable arrays and demonstrates how
4 min read
Convert HashSet to a ArrayList in Java
ArrayList class is a resizable array, present in java.util package. The difference between an array and an ArrayList in Java, is that the size of an array cannot be modified (i.e. if you want to append/add or remove element(s) to/from an array, you have to create a new array. However, elements can b
4 min read
Hamming code Implementation in Java
Pre-requisite: Hamming code Hamming code is a set of error-correction codes that can be used to detect and correct the errors that can occur when the data is moved or stored from the sender to the receiver. It is a technique developed by R.W. Hamming for error correction. Examples: Input: message bi
3 min read