Download Short introduction to sublinear algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biology and consumer behaviour wikipedia , lookup

Transcript
Sublinear time algorithms
Ronitt Rubinfeld
Computer Science and Artificial Intelligence
Laboratory (CSAIL)
Electrical Engineering and Computer Science
(EECS)
MIT
Massive data sets
• examples:
–
–
–
–
–
sales logs
scientific measurements
genome project
world-wide web
network traffic, clickstream patterns
• in many cases, hardly fit in storage
• are traditional notions of an efficient
algorithm sufficient?
– i.e., is linear time good enough?
Some hope:
Don’t always need exact answers...
“In the ballpark” vs. “out of the
ballpark” tests
• Distinguish inputs that have specific property from
those that are far from having the property
• Benefits:
– May be the natural question to ask
– May be just as good when data constantly changing
– Gives fast sanity check to rule out very “bad” inputs (i.e.,
restaurant bills) or to decide when expensive processing
is worth it
Settings of interest:
• Tons of data – not
enough time!
• Not enough data – need
to make a decision!
Example 1:
Properties of distributions
Trend change analysis
Transactions of 20-30 yr olds
Transactions of 30-40 yr olds
trend change?
Outbreak of diseases
• Do two diseases follow similar patterns?
• Are they correlated with income level or zip
code?
• Are they more prevalent near certain areas?
Is the lottery uniform?
• New Jersey Pick-k Lottery (k =3,4)
– Pick k digits in order.
– 10k possible values.
• Data:
– Pick 3 - 8522 results from 5/22/75 to 10/15/00
• 2-test gives 42% confidence
– Pick 4 - 6544 results from 9/1/77 to 10/15/00.
• fewer results than possible outcomes
• 2-test gives no confidence
Information in neural spike trails
[Strong, Koberle, de Ruyter van Steveninck, Bialek ’98]
• Apply stimuli several times,
each application gives
Neural signals
sample of signal (spike trail)
which depends on other
unknown things as well
• Study entropy of
time
(discretized) signal to see
which neurons respond to
stimuli
Global statistical properties:
• Decisions based on samples of
distribution
• Properties: similarities, correlations,
information content, distribution of
data,…
• Focus on large domains
Distributions with large domains:
• Right kind of sample data is usually a scarce
resource
• Standard algorithms from statistics (2 –test,
plug-in estimates, naïve use of Chernoff
bounds,…)
– number of samples > domain size
– for stores with 1,000,000 product types, need >
1,000,000 samples to detect trend changes
• Our algorithms use only a sublinear number
of samples.
– for our example, need t 10,000 samples
Our Analysis:
• For infrequent elements, analyze
coincidence statistics using techniques
from statistics
– Limited independence arguments
– Chebyshev bounds
• Use Chernoff bounds to analyze difference
on frequent elements
• Combine results using filtering techniques
Example 2:
Pattern matching on Strings
• Are two strings similar or not? (number of
deletions/insertions to change one into the
other)
– Text
– Website content
– DNA sequences
ACTGCTGTACTGACT
CATCTGTATTGAT
(length 15)
(length 13)
match size =11
Pattern matching on Strings
• Previous algorithms using classical
techniques for computing edit distance on
strings of size n use at least n2 time
– For strings of size 1000, this is 1,000,000
– Our method uses << 1000
– Our mathematical proofs show that you
cannot do much better
Our techniques:
• Can’t look at entire string…
• So sample according to a recursive fractal
distribution
• Clever use of approximate solutions to
subproblems yields result
Other examples:
• Testing properties of text files
– Are there too many duplicates?
– Is it in sorted order?
– do two files contain essentially the same set of
names?
• Testing properties of graph representations
– High connectivity?
– Large groups of independent nodes?
Conclusions
• sublinear time possible in many contexts
– new area, lots of techniques
• pervasive applicability
• Algorithms are usually simple, analysis is much
more involved
• savings factor of over 1000 for many problems
– what else can you compute in sublinear time?
– other applications...?