Download Short introduction to sublinear algorithms

Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science (EECS) MIT Massive data sets • examples: – – – – – sales logs scientific measurements genome project world-wide web network traffic, clickstream patterns • in many cases, hardly fit in storage • are traditional notions of an efficient algorithm sufficient? – i.e., is linear time good enough? Some hope: Don’t always need exact answers... “In the ballpark” vs. “out of the ballpark” tests • Distinguish inputs that have specific property from those that are far from having the property • Benefits: – May be the natural question to ask – May be just as good when data constantly changing – Gives fast sanity check to rule out very “bad” inputs (i.e., restaurant bills) or to decide when expensive processing is worth it Settings of interest: • Tons of data – not enough time! • Not enough data – need to make a decision! Example 1: Properties of distributions Trend change analysis Transactions of 20-30 yr olds Transactions of 30-40 yr olds trend change? Outbreak of diseases • Do two diseases follow similar patterns? • Are they correlated with income level or zip code? • Are they more prevalent near certain areas? Is the lottery uniform? • New Jersey Pick-k Lottery (k =3,4) – Pick k digits in order. – 10k possible values. • Data: – Pick 3 - 8522 results from 5/22/75 to 10/15/00 • 2-test gives 42% confidence – Pick 4 - 6544 results from 9/1/77 to 10/15/00. • fewer results than possible outcomes • 2-test gives no confidence Information in neural spike trails [Strong, Koberle, de Ruyter van Steveninck, Bialek ’98] • Apply stimuli several times, each application gives Neural signals sample of signal (spike trail) which depends on other unknown things as well • Study entropy of time (discretized) signal to see which neurons respond to stimuli Global statistical properties: • Decisions based on samples of distribution • Properties: similarities, correlations, information content, distribution of data,… • Focus on large domains Distributions with large domains: • Right kind of sample data is usually a scarce resource • Standard algorithms from statistics (2 –test, plug-in estimates, naïve use of Chernoff bounds,…) – number of samples > domain size – for stores with 1,000,000 product types, need > 1,000,000 samples to detect trend changes • Our algorithms use only a sublinear number of samples. – for our example, need t 10,000 samples Our Analysis: • For infrequent elements, analyze coincidence statistics using techniques from statistics – Limited independence arguments – Chebyshev bounds • Use Chernoff bounds to analyze difference on frequent elements • Combine results using filtering techniques Example 2: Pattern matching on Strings • Are two strings similar or not? (number of deletions/insertions to change one into the other) – Text – Website content – DNA sequences ACTGCTGTACTGACT CATCTGTATTGAT (length 15) (length 13) match size =11 Pattern matching on Strings • Previous algorithms using classical techniques for computing edit distance on strings of size n use at least n2 time – For strings of size 1000, this is 1,000,000 – Our method uses << 1000 – Our mathematical proofs show that you cannot do much better Our techniques: • Can’t look at entire string… • So sample according to a recursive fractal distribution • Clever use of approximate solutions to subproblems yields result Other examples: • Testing properties of text files – Are there too many duplicates? – Is it in sorted order? – do two files contain essentially the same set of names? • Testing properties of graph representations – High connectivity? – Large groups of independent nodes? Conclusions • sublinear time possible in many contexts – new area, lots of techniques • pervasive applicability • Algorithms are usually simple, analysis is much more involved • savings factor of over 1000 for many problems – what else can you compute in sublinear time? – other applications...?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Short introduction to sublinear algorithms