# Assignment 5: Bloom Filter

### (Submit at moodle.)

In this assignment we will implement a Bloom Filter which is a data structure for storing an unordered set using hash functions.

Suppose we are given a set S, drawn from a universe U and queries of the form "is x in S?", we define a Bloom filter as follows:

• Take an array A with n elements and take 3 hash functions h1, h2 and h3. Each hash function maps the universe U to [0,n-1].
• Suppose we start with S being the emptyset, then A[i] is set to 0 for all i.
• Now, suppose we insert an element e into the set S. We set A[h1(e)], A[h2(e)] and A[h3(e)] all to 1
• If any of these three positions was already set to 1 we just leave it at 1.

Membership Queries: Given a query "is x in S?" we look at A[h1(x)], A[h2(x)] and A[h3(x)]. If they are all 1 then we return "yes" as the answer, otherwise we return "no."

Note that it is possible that x is not in S but A[h1(x)], A[h2(x)] and A[h3(x)] are all set to 1 (due to some other elements being in S which have set those positions to 1). This phenomena is called a false positive i.e. the query returns a yes answer although the element is not in the set.

For this assignment you will implement a Bloom filter on the universe U = {0, 1, 2, ..., 99}. Your array should have size 100. You will choose any three hash functions of the from h(x) = ((ax + b) mod c) mod 100 where a, b and c are large positive integers. You can hardcode the hash functions, but your program should also give the TA the option of changing the hash functions at the beginning.

For your demo, you should do the following:

• Stage 1: Initialize your Bloom Filter by inserting any 20 elements at random from the universe U. Let us call this set S. Display S and the Bloom filter created. Every time your program is run a different randomly selected set of 20 elements should be inserted into an empty Bloom filter.
• Stage 2: Once the state of the Bloom filter has been displayed, the TA should be able to move the program forward with a keypress. After that the program should launch queries on each of the remaining 80 items. If any one of these returns a yes answer, the program should pause and find out why this false positive has been returned i.e. it should locate the hash function collisions that have caused this false positive to occur. More formally, if we get a yes answer for x although x is not in S, this means that A[h1(x)], A[h2(x)] and A[h3(x)] are all 1. Your program should return (at most) three elements u, v and w from S which hash to the positions h1(x), h2(x) and h3(x) through one of the three hash functions. Your program should display which location has been set to 1 by which element of S through which hash function.
• Stage 3: Finally the program should count the total number of false positives occurring in all 80 membership queries and display the false positive probability i.e. the fraction of false positives out of 80.

Assumption:

We will assume that there is no deletion for the purposes of this assignment i.e. only insertion and membership queries have to be implemented

### Notes:

• The honor code must be follow for all assignments
• The program should work on linux with java 1.6 (You could test it on one of the computer center machines or one of the CSE general computing lab machines.)