Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook Agenda • Discuss what a set data structure is using math terms • Discuss the concept of a Bloom filter • Explore the mathematical magic behind Bloom filters Set! • A set is an unsorted data structure containing unique values • Most common uses are: • • • • Error-free set membership tests Storing unique members of data (remove duplicates) Iterating through data in no particular order Other fun operations like unions, intersections, subsets, etceteras! • Other sets support sorting and duplicate values • But we aren’t here to talk about those Set Insertion peter lois chris peter stewie chris insert stewie lois chris peter Set Membership Test peter is_memb er stewie lois chris peter Set Membership Test adam is_memb er stewie lois chris peter Use Case! • • • • I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science for each record x in B for each word w in x if w in A emit x Use Case, Solved! • Stuff all the data in A into a set • Get an A+ on your computer science project • Impress the boss • But what if A is stupid big? credit to mr. squarep Memory Footprint • A contains 1 billion unique strings, average of 32 characters in length • • • • • 8 bits per character 32 characters per string 1 billion of them 8 bits * 32 * 1,000,000,000 … Roughly 29.8 GB of raw storage required to hold these elements • + overhead • + even more if you are using Java • For the sake of argument, let’s all agree that A doesn’t fit comfortably on a computer… credit to xkcd and paint Making a Set Smaller • What two ‘features’ of a set can we relax to meet our requirements and have a reasonable memory footprint? • Functionality • Only want set membership operations • Accuracy • Don’t really need to be 100% accurate Use Case, Revised! • • • • • I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science I don’t really care if some stuff in C doesn’t contain words from A for each record x in B for each word w in x if w is likely in A with false positive p emit x Let me paint you a story… • We travel back to 1970… • Burton Howard Bloom was investigating means to eliminate unnecessary disk accesses for particular algorithms • Came up with the a probabilistic data structure for set membership • Useful for programs with expensive operations where the operation is often unnecessary • A structure only 15% of the size of the original can eliminate 85% of unnecessary disk accesses Bloom Filter • A space-efficient means to test if an element is a member of a set • Elements can be added, but cannot be removed • Storage cost for a single element is independent of the element size • The members are not stored, so they cannot be retrieved • There are no false negatives, but false positives are possible How It’s Made – Training a Bloom Filter Given An array of bits size m, initialized to 0 k hash functions n elements foreach element ni in n foreach function ki in k m[ki(ni) % m] = 1 Training a Bloom filter is O(n) How It’s Made – Training a Bloom Filter peterlois chris 0 01 01 01 01 01 0 01 0 01 How It’s Made – Membership Testing Given A trained Bloom filter of size m The same k hash functions An element x foreach function ki in k if m[ki(x) % m] is 0 return false return true Testing a Bloom filter is O(1) How It’s Made – Membership Testing 0 1 1 1 1 1 peter 0 1 0 1 How It’s Made – Membership Testing 0 1 1 1 1 1 adam 0 1 0 1 I know what you’re thinking The Catch • What we make up for in space, we give up the accuracy… • I give you… the false positive! 0 1 1 1 1 1 cleveland 0 1 0 1 credit to xkcd and paint Controlling the False Positive Rate Given Approximate number of elements in A, n A willingness to tolerate a percent p of false positives k is the optimal number of hash functions We can approximate m If you want the full details, read the paper or Wikipedia Back to our use case… n = 1,000,000,000 p = .1 After dusting off the calculators…. m = 4.792 x109 bits or 0.558 GB An improvement of 29.8/0.558 = 53.4! And now that we have m… We can use n and m to calculate k = m/n * ln(2) But I haven’t heard of 3.32 hash functions so let’s call it 4 References • Wikipedia