Download PPTX - GRAIL - University of Washington

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Online Learning: An
Introduction
{
Travis Mandel
University of Washington
Overview of Machine Learning
• Supervised Learning
• Decision Trees, Naïve Bayes,
Linear Regression, K-NN, etc.
• Unsupervised Learning
• K-means, Association Rule
Learning, etc.
What else is there????
Overview of Machine Learning
• Supervised Learning
Labeled Training Data
ML
Algorithm
Classifier/
Regression
function
• Unsupervised Learning
Unlabeled Training Data
ML
Algorithm
Clusters/Rules
Is this the best way to learn?
Overview of Machine Learning
•
•
•
•
•
•
•
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Active Learning
Interactive Learning
Online Learning
Reinforcement Learning
Online Classification
Observe One Training
Instance
ML
Algorithm
Observes Label,
Receives
Loss/Reward
Issues
prediction
More generally… Online Learning!
Observe observations
Observe
Loss/Reward
Function
Receives
Loss/Reward
ML
Algorithm
Selects
output
Typical Assumption: Statistical Feedback
•
Instances and labels drawn from a fixed
distribution D
Recall: Distribution
“drug”?
“nigeria”?
Spam?
Prob
N
N
N
0.5
N
N
Y
0.1
N
Y
N
0.05
N
Y
Y
0.1
Y
N
N
0.05
Y
N
Y
0.1
Y
Y
N
0.0
Y
Y
Y
0.1
Alternate Assumption: Adversarial
Feedback
•
Instances and labels drawn adversarially
•
Spam detection, anomaly detection,
etc.
• Change over time a HUGE Issue!
Change over time example
How can we measure
performance?
• What we want to maximize: Sum of
rewards
• Why could this be a problem?
Alternative: Discounted Sum of
Rewards
𝑟0 + 𝑟1 + 𝑟2 + 𝑟3 +…
𝑟0 + γ𝑟1 + γ2 𝑟2 + γ3 𝑟3 +…
Guaranteed to be a finite sum if γ<1, r<c!
One problem…
If a small percent of times you never find the best option
(Incomplete learning), it doesn’t matter much!
But intuitively it should matter a LOT!
Regret
• Sum of rewards compared to the
BEST choice in hindsight!
∞
𝑟𝑡∗ −
𝑡=1
∞
𝑡=1
𝑟𝑡
• If training a linear model, optimal
weights with overfitting!
• See other examples in a second
• Can get good regret even if perform
very poorly in absolute terms
We really like theoretical
guarantees!
• Anyone know why?
• Running online  uncertain future
• Hard to tell if you reached best
possible performance by looking at
data
Recall…
Observe observations
Observe
Loss/Reward
Function
Receives
Loss/Reward
ML
Algorithm
Selects
output
Experts Setting
…
Observe expert
recommendations
Observe
Loss/Reward
For each expert
Receives
Loss/Reward
N experts
Assume
rewards in [0,1]
ML
Algorithm
Selects
Expert
Regret in the Experts Setting
• Do as well as the best expert in
hindsight!
• Hopefully you have (at least some)
good experts!
“Choose an Expert?”
• Each expert could make a
prediction (different ML models,
classifiers, hypotheses, etc.)
• Each expert could tell us to perform
an action (such as buy a stock)
• Experts can learn & change over
time!
Simple strategies
• Pick Expert that’s done the best in
the past?
• Pick Expert that’s done the best
recently?
No! Adversarial!
• For ANY deterministic strategy…
• The adversary can know which
expert we will choose, and give it
low reward/high loss, but all others
high reward/low loss!
• Every step, get further away from
best expert  Linear regret 
Solution: Randomness
• What if the algorithm has access to
some random bits?
• We can do better!
EG: Exponentiated Gradient
𝑤𝑡,𝑖
𝑤1 = 1, … , 1
= 𝑤𝑡−1,𝑖 𝑒 𝜂(𝑟𝑡,𝑖 −1)
𝜂=
2 𝑙𝑜𝑔𝑁
𝑇
Probabilities are normalized weights
EG: Regret Bound
𝑂( 2𝑇 log 𝑁)
Problem: What if the feedback is
a little different?
• In the real world, we usually don’t
get to see what “would have”
happened
• Shift from predicting future to
taking actions in an unknown
world
• Simplest Reinforcement Learning
problem
Multi-Armed Bandit (MAB)
N arms
Problem Assume
Statistical feedback
Assume
rewards in [0,1]
…
Observes Loss/Reward
For The Chosen Arm
ML
Algorithm
Suffers
Loss/Reward
Pulls Arm
Regret in the MAB Setting
• Since we are stochastic, each arm
has a true mean
• Do as well as if pulled always
pulled arm with highest mean
MABs have a huge number of
applications!
•
•
•
•
•
•
•
•
Internet Advertising
Website Layout
Clinical Trials
Robotics
Education
Videogames
Recommender Systems
…
Unlike most of Machine Learning, how to ACT,
not just how to predict!
A surprisingly tough problem…
“The problem is a classic one; it was
formulated during the war, and efforts to
solve it so sapped the energies and minds of
Allied analysts that the suggestion was made
that the problem be dropped over Germany,
as the ultimate instrument of intellectual
sabotage. “
Prof. Peter Whittle, Journal of the Royal
Statistical Society, 1979
MAB algorithm #1: ε-first
• Phase 1: Explore arms uniformly at
random
• Phase 2: Exploit best arm so far
• Only 1 parameter: how long to
wait!
• Problem?
• Not very efficient
• Makes statistics easy to run
MAB algorithm #2: ε-greedy
• Flip a coin:
• With probability ε, explore
uniformly
• With probability 1- ε, exploit
• Only 1 parameter: ε
• Problem?
• Often used in robotics and
reinforcement learning
Idea
• The problem is uncertainty…
• How to quantify uncertainty? Error
bars
If arm has been sampled n times,
With probability at least 1- 𝛿:
𝜇−𝜇 <
2
log( )
𝛿
2𝑛
Given Error bars, how do we act?
Given Error bars, how do we act?
• Optimism under uncertainty!
• Why? If bad, we will soon find out!
One last wrinkle
• How to set confidence 𝛿
• Decrease over time
If arm has been sampled n times,
With probability at least 1- 𝛿:
𝜇−𝜇 <
𝛿
2
=
𝑡
2
log( )
𝛿
2𝑛
Upper Confidence Bound (UCB)
1. Play each arm once
2. Play arm i that maximizes:
2log(𝑡)
𝜇𝑖 +
𝑛𝑖
3. Repeat Step 2 forever
UCB Regret Bound
Problem –independent:
𝑂( 𝑁𝑇𝑙𝑜𝑔𝑇)
Problem –dependent:
log 𝑇
𝑂
min 𝜇∗ − 𝜇𝑖
𝑖
Can’t do much better!
A little history…
William R. Thompson (1933): Was the first to examine MAB
problem, proposed a method for solving them
1940s-50s: MAB problem studied intentively during WWII,
Thompson ignored
1970’s-1980’s: “Optimal” solution (Gittins index) found but
is intractable and incomplete. Thompson ignored.
2001: UCB proposed, gains widespread use due to simplicity
and “optimal” bounds. Thompson still ignored.
2011: Empricial results show Thompson’s 1933 method beats
UCB, but little interest since no guarantees.
2013: Optimal bounds finally shown for Thompson Sampling
Thompson’s method was
fundamentally different!
Bayesian vs Frequentist
• Bayesians: You have a prior,
probabilities interpreted as beliefs,
prefer probabilistic decisions
• Frequentists: No prior, probabilities
interpreted as facts about the world,
prefer hard decisions (p<0.05)
UCB is a frequentist technique! What if we are Bayesian?
Bayesian review: Bayes’ Rule
𝑝 𝑑𝑎𝑡𝑎 𝜃 𝑝(𝜃)
𝑝 𝜃 𝑑𝑎𝑡𝑎) =
𝑝(𝑑𝑎𝑡𝑎)
Posterior
𝑝 𝜃 𝑑𝑎𝑡𝑎) ∝ 𝑝 𝑑𝑎𝑡𝑎 𝜃 𝑝(𝜃)
Prior
Likelihood
Bernoulli Case
What if distribution in the set {0,1}
instead of the range [0,1]?
Then we flip a coin with probability p  Bernoulli
Distribution!
To estimate p, we count up numbers of ones and zeros
Given observed ones and zeroes, how do we calculate
the distibution of possible values of p?
Beta-Bernoulli Case
Beta(a,b)  Given a 0’s and b 1’s, what is the
distribution over means?
Prior  pseudocounts
Likelihood  Observed counts
Posterior  psuedocounts + observed counts
How does this help us?
Thompson Sampling:
1. Specify prior (in Beta case often Beta(1,1))
2. Sample from each posterior distribution to get
estimated mean for each arm.
3. Pull arm with highest mean.
4. Repeat step 2 & 3 forever
Thompson Empirical Results
And shown to have optimal regret bounds just like
(and in some cases a little better than) UCB!
Problems with standard MAB
formulation
•
•
•
•
What happens if adversarial? (studied elsewhere)
What if arms are factored and share values?
What happens if there is delay?
What if we know a heuristic that we want to
incorporate without harming guarantees?
• What if we have a prior dataset we want to use?
• What if we want to evaluate algorithms/parameters
on previously collected data?
Problems with standard MAB
formulation
•
•
•
•
What happens if adversarial? (studied elsewhere)
What if arms are factored and share values?
What happens if there is delay?
What if we know a heuristic that we want to
incorporate without harming guarantees?
• What if we have a prior dataset we want to use?
• What if we want to evaluate algorithms/parameters
on previously collected data?
A look toward more complexity:
Reinforcement Learning
• Sequence of actions instead of take
one and reset
• Rich space of observations which
should influence our choice of
action
• Much more powerful, but a much
harder problem!
Summary
• Online Learning
• Adversarial vs Statistical
Feedback
• Regret
• Experts Problem
• EG algorithm
• Multi-Armed bandits
• e-greedy, e-first
• UCB
• Thompson Sampling
Questions?