CS 111Project 3
Set
Spring 2020
I NSTRUCTIONS
Complete the project detailed in this document using valid, executable Java code. Submit your work (including only the .java
source code files indicated with in the specification below) to your lab instructor via eCampus no later than the deadline on the
course calendar. At the top of every file you submit, include a comment with your full name and the tracking code 2001-111-3
for instructor use. Your work is compared to the work of your peers and online sources to detect plagiarism.
L EARNING G OALS
The goal of this project is to reinforce key concepts of unordered data structures, hashing, and constant and logarithmic time
algorithms. It is also intended to reinforce the skills needed to implement a well-known but uncommon data structure.
S PECIFICATION
You are given an interface Set (the source code is at the end of this document) that specifies the protocols for a set of string
elements which supports only adding elements and querying them for containment but no other operations.
Do not change the Set interface because any submitted copies are discarded. The instructors use their own copies of that interface
as given. Your work is expected to integrate with the given interface without any conflicts.
S T E P 1: I M P L E M E NT A T I ON
First, write a
class NaiveListSet implements Set that encapsulates a java.util.ArrayList as its data
structure using naive linear search (without self-organizing techniques) to ensure that the set never contains duplicate elements.
Second, write a class BloomFilterSet implements Set that encapsulates a java.util.BitSet as its data structure using
the behavior of a Bloom filter without removal. Start with 𝑚 = 101 bits, and start with 𝑘 = 3 distinct hash functions as follows.
One hash function must return an integer computed directly from the string element using an original algorithm you design.
One hash function must return a call to the hashCode method of the string element.
One hash function must return a call to one of the java.security.MessageDigest algorithms on the string element.
Any additional hash functions you write later may use original algorithms you design or MessageDigest algorithms.
S T E P 2: T E S T I NG
Create a class Experiment which includes a main method that conducts the following experiment in order.
Instantiate a String[] of size 10,000 (the sample data) and fill it with randomly generated strings that are at least 30 characters
long, generated using java.util.UUID.randomUUID().toString() calls.
Instantiate one NaiveListSet and one BloomFilterSet (the test subjects) to be used throughout the rest of this step.
For each test subject, add the first half of the sample data (the first 5,000 strings) to the test subject as elements. Measure the
average time each add operation takes in nanoseconds using the difference between System.nanoTime() calls.
For each test subject, query both halves of the sample data (all 10,000 strings) for containment in the test subject. Measure the
average time each query operation takes in nanoseconds, as above. Also measure the error rate for the query operation by
checking each query to see if it returns the correct answer and finding the ratio of the number of incorrect answers to the
total number of queries.
Your NaiveListSet’s error rate must be 0% because it must always answer correctly.
Your BloomFilterSet’s error rate is generally higher than 0% because it returns false positives for nontrivial sets.
Run your main method several times until the results stabilize and seem consistent over time. You must do this because the Java
virtual machine tries to optimize how it runs your code, and you only get consistent results after running it multiple times.
S T E P 3: O P T I MI Z A T I ON
Based on your testing results and research on Bloom filters, adjust 𝑘 and 𝑚 (but not the constant 𝑛 = 5,000 entries) for your
BloomFilterSet until it runs add and query operations faster than your NaiveListSet with an error rate of 1% or less.
S T E P 4: D OC UM E NT A T I ON
Display your testing results (including the average add time for each test subject, the average query time for each test subject, and
the query error rate for each test subject) before the program terminates.
rev. April 6 – 1 / 2
Include comments in your source code explaining what you did to optimize your BloomFilterSet implementation.
G RADING R UBRIC
You are graded on 1) the correctness of your implementation of the requirements in the specification above and 2) how effectively
you optimize your Bloom filter and explain those optimizations. The lab instructor determines the exact grading rubric to assess
your performance and may provide unit tests for your convenience.
B ONUS O PPORTUNITY
For 50 bonus points, add a third set implementation to all four steps of the project as follows.
Write a class SelfOrganizingListSet implements Set that encapsulates a java.util.ArrayList as its data
structure using binary search and self-organizing in lexicographic order (that is, alphabetical order) to ensure that the set never
contains duplicate elements. You may reuse any suitable code you wrote for a lab assignment in this course and semester.
Your SelfOrganizingListSet’s error rate must be 0% because it must always answer correctly, and it must run add and query
operations faster (in logarithmic time) than your NaiveListSet which uses linear search (in linear time).
I NTERFACE
public interface Set {
/**
* Adds the element to the set, or ignores the element
* if it is already contained in the set.
*
* @param element The element to add
*/
void add(String element);
/**
* Queries whether an element is contained in the set.
*
* @param element The element to query for containment
* @return Whether the element is contained in the set
*/
boolean contains(String element);
}
2/2