Download DA_Lecture10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Inductive probability wikipedia , lookup

Multi-armed bandit wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Statistical inference wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Decision Analysis
Lecture 10
Tony Cox
My e-mail: [email protected]
Course web site: http://cox-associates.com/DA/
Agenda
•
•
•
•
•
Problem set 8 solutions; Problem set 9
Hypothesis testing: statistical decision theory view
Updating normal distributions
Quality control: Sequential hypothesis testing
Adaptive decision-making
–
–
–
–
Exploration vs. exploitation
Upper confidence band (UCB) algorithm
Thompson sampling for adaptive Bayesian control
Optimal stopping problems
• Influence diagrams and Bayesian networks
2
Recommended Readings
• Optimal learning. Powell and Frazier,
2008, pp 213, 216-219, 223-4,
https://pdfs.semanticscholar.org/42d8/34f981772af218022be071e739fd96882b12.pdf
• How can decision-making be improved?
Milkman et al., 2008
http://www.hbs.edu/faculty/Publication%20Files/08-102.pdf
• Simulation-optimization tutorial (Carson &
Maria, 1997) (Just skim this one)
https://pdfs.semanticscholar.org/e5d8/39642da3565864ee9c043a726ff538477dca.pdf
• Causal graphs (Elwert, 2013), pp. 245-250
https://www.wzb.eu/sites/default/files/u31/elwert_2013.pdf
3
Homework #8
(Due by 4:00 PM, April 4)
1. An investment yields a normally distributed return with mean $2000 and
standard deviation $1500. Find (a) Pr(loss) and (b) Pr(return > $4000).
2. If there are on average 3.6 chocolate chips per cookie, what is the probability
of finding (a) No chocolate chips; (b) Fewer than 5 chocolate chips; or (c) more
than 10 chocolate chips in a randomly selected cookie?
3. A strike lasts for a random amount of time, T, having an exponential distribution
with a mean of 10 days. What is the probability that the strike lasts (a) Less
than 1 day; (b) Less than 6 days; (c) Between 6 and 7 days; (d) Less than 7
days if it has lasted six days so far?
4. How would the answers to problem 3 change if T were uniformly distributed
between 0 and 20.5 days?
5. A production process for glass bottles creates an average of 1.1 bubbles per
bottle. Bottles with more than 2 bubbles are classified as non-conforming and
sent to recycling. Bubbles occur independently of each other. What is the
probability that a randomly chosen bottle is non-conforming?
4
Solution to HW8 problem 1
(Investment)
• Normal: If return has mean $2000 and
standard deviation $1500, find P(loss) and
P(return > $4000)?
a. pnorm(0, 2000, 1500) =
pnorm(-2000/1500, 0, 1) = 0.09121122;
b. 1 - pnorm(4000,2000,1500) =
1 - pnorm(2000/1500, 0, 1) = 0.09121122
5
Solution to HW8 problem 2
(chocolate chips)
• If there are on average 3.6 chocolate chips
per cookie, what is the probability of finding
(a) No chocolate chips; (b) < 5 chocolate
chips; or (c) > 10 chocolate chips in a
randomly selected cookie?
a. dpois(0,3.6)) = 0.02732372
b. ppois(4,3.6) = 0.7064384
c. 1-ppois(10,3.6) = 0.001271295
6
Solutions to HW8 problem 5
(bubbles)
• P(more than 2 bubbles | r = 1.1 bubbles
per bottle) = 1 - ppois(2, 1.1) =
0.09958372 ≈ 0.1
7
Solutions to HW8 problem 3
(exponential strike)
a. P(strike lasts < 1 day) = pexp(1, 0.1) = 1 exp(-m*t) = 1 - exp(-0.1*1) = 0.09516258
– pexp(t, r) = P(T < t | r arrivals per unit time) =
P(T < t | 1/r mean time to arrival)
b. P(strike < 6 days) = pexp(6, 0.1)
=1 - exp(-0.1*6) = 0.451188
c. P(6 < T < 7) = pexp(7,0.1) - pexp(6,0.1)
= 1 - exp(-7*0.1) – [1 – exp(-6*0.1)]
= exp(-6*0.1) - exp(-7*0.1) = 0.05222633
8
Solutions to HW8 problem 3
(exponential strike)
d. P(T < 7 | T > 6) = P(6 < T < 7)/P(T > 6) (by
definition of conditional probability, P(A | B)
= P(A & B)/P(B), A = 6 < T, B = T < 7)
= (pexp(7,0.1)-pexp(6,0.1))/(1-pexp(6,0.1))
= 0.09516258 (memoryless, so same as for
part a)
9
Solutions to HW8 problem 4
(uniform strike)
a. P(T < 1) = 1/10.5 = punif(1,0,10.5) =
0.0952381
b. P(T < 6) = 6/10.5 = punif(6,0,10.5) =
0.5714286
c. P(6 < T < 7) = (7 – 6)/10.5 = punif(7,0,10.5)
- punif(6,0,10.5)= 0.0952381
d. P(T < 7 | T > 6) = P(6 < T < 7)/P(T > 6) =
0.0952381 /(1 - 0.5714286) = 0.22222
e. Not memoryless: 0.22 > 0.0952
10
Homework #9, Problem 1
(Due by 4:00 PM, April 11)
• Starting from a uniform prior, U[0, 1], for
success probability, you observe 22 successes
in 30 trials.
• What is your Bayesian posterior probability that
the success probability is greater than 0.5?
11
Homework #9, Problem 2
(Due by 4:00 PM, April 11)
• In a manufacturing plant, it costs $10/day to stock 1 spare
part, $20/day to stock 2 spare parts, etc. ($10 per spare part
per day).
• There are 50 machines in the plant. Each machine breaks
with probability 0.004 per machine per day. (More than one
machine can fail on the same day.)
• If a spare part is available (in stock) when a machine breaks,
it can be repaired immediately, and no production is lost.
• If no spare part is available when a machine breaks, it is idle
until a new part can be delivered (1 day lag). $65 of
production is lost.
• How many spare parts should the plant manager keep in
stock to minimize expected loss?
12
Homework #9 discussion problem
for April 11 (uncollected/ungraded)
•
•
•
•
Choice set: Take or Do Not Take
Chance set (states): Sunshine or Rain
P(Sunshine) = p = 0.6
Utilities of act-state pairs:
– u(Take, Sunshine) = 80
– u(Take, Rain) = 80
– u(Do Not Take, Sunshine) = 100
– u(Do Not Take, Rain) = 0
13
Homework #9 discussion
problem (uncollected/ungraded)
1. If p = 0.6, find EU(Take) and EU(Don’t
Take) using Netica
– Goal is to see how Netica deals with
decisions and expected utilities
– May also try it via simulation
2. Update these EUs if a forecast (with error
probability 0.2) predicts rain
14
Hypothesis testing (Cont.)
15
Logic and vocabulary of
statistical hypothesis testing
• Formulate a null hypothesis to be tested, H0
– H0 is “What you are trying to reject”
– If true, H0 determines a probability distribution for
the test statistic (a function of the data)
• Choose  = significance level for test =
P(reject null hypothesis H0 | H0 is true)
• Decision rule: Reject H0 if and only if the
test statistic falls in a critical region of values
that are unlikely (p < ) if H0 is true.
16
Hypothesis testing picture
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_test_1.htm
17
Interpretation of hypothesis test
• Either something unlikely has happened
(having probability p < , where p = P(test
statistic has observed or more extreme
value | H0 is correct) or H0 is not true.
• It is conventional to choose a significance
level of  = 0.05, but other values may be
chosen to minimize the sum of costs of
type 1 error (falsely reject H0) and type 2
error (falsely fail to reject H0).
18
Neyman-Pearson Lemma
• How to minimize Pr(type 2 error), given ?
• Answer: Reject H0 in favor of HA if and
only if P(data | HA)/P(data | H0) > k, for
some constant k
– The ratio LR = P(data | HA)/P(data | H0) is
called the likelihood ratio
– With independent samples, P(data | H) =
product of P(xi | H) values for all data points xi
– k is determined from .
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_neyman_pearson.htm
19
Statistical decision theory:
Key ideas
• Statistical inference from data can be formulated
in terms of decision problems
• Point estimation: Minimize expected loss from
error, given a loss function
– Implies using posterior mean if loss function is
quadratic (mean squared error)
– Implies using posterior median if loss function is
absolute value of error
• Hypothesis testing: Minimize total expected loss
= loss from false positives + loss from false
negatives + sampling costs
20
Updating normal distributions
21
Updating normal distributions)
• Probability model: N(m, s2) ; pnorm(x, m, s)
• Initial uncertainty about input m is modeled
by a normal prior with parameters m0, s0
– Prior N(x, m0, s0) has mean m0
• Observe data: x1 = sample mean of n1
independent observations
• Posterior uncertainty about m: N(m*, s*2),
m* = wm0 + (1 - w)x1, s* = sqrt(ws02)
• w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2)
22
Bayesian updating of normal
distributions (Cont.)
• Posterior uncertainty about m: N(m*, s*2),
m* = wm0 + (1 - w)x1, s* = sqrt(ws02)
• w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2)
• Let’s define an “equivalent sample size,” n0,
for the prior, as follows: s02 = s2/n0.
• Then w = n0/(n0 + n1), posterior is N(m*, s*2)
– m* = (n0m0 + n1x1)/(n0 + n1)
– s* = sqrt(s2/(n0 + n1))
23
Predictive distributions
• How to predict probabilities when the inputs
to probability models (p for binom, m and s
for pnorm, etc.), are uncertain?
• Answer 1: Find posterior by Bayesian
conditioning of prior on data.
• Answer 2: Use simulation to sample from
distribution of inputs. Calculate conditional
probabilities from model, given sampled
inputs. Average them to get final probability.
24
Example: Predictive normal
distribution
• If posterior distribution is N(m*, s*2), then
the predictive distribution is N(m*, s2 + s*2)
• Mean is just posterior mean, m*
• Total uncertainty (variance) in prediction =
sum of variance around the (true but
uncertain) mean and variance of the mean
25
Example: Exact vs. simulated
predictive normal distributions
• Model: N(m, 1) with m ~ N(3, 4)
• Exact predictive dist.: N(m*, s2 + s*2) = N(3, 5)
• Simulated predictive dist.: N(2.99, 5.077)
> m = y = NULL; m = rnorm(10000, 3, 2); mean(m); sd(m)^2; for (j in 1:10000) {;
y[j] = rnorm(1, m[j], 1)}; mean(y); sd(y)^2
[1] 3.000202
[1] 4.043804
[1] 2.993081
[1] 5.077026
26
Simulation: The main idea
• To quantify Pr(outcome), create a model for
Pr(outcome | inputs) and Pr(inputs).
– Pr(input) = joint probability distribution of inputs
• Sample values from Pr(input)
– Use rdist
• Create indicator variable for outcome
– 1 if it occurs on a run, else 0
• Mean value of indicator variable = Pr(outcome)
27
Bayesian inference via simulation:
Mary revisited
•
•
•
•
•
•
Pr(test is positive | disease) = 0.95
Pr(test is negative | no disease) = 0.90
Pr(disease) = 0.03
Find P(disease | test is positive)
Answer from Bayes’ Rule” 0.2270916
Answer by simulation:
Disease_state
Yes 22.7
No
77.3
Test_result
Positive
100
Negative
0
# Initialize variables
disease_status = test_result = test_result_if_disease = test_result_if_no_disease = NULL;
n = 100000;
# simulate disease state and test outcomes
disease_status = rbinom(n,1, 0.03); test_result_if_disease = rbinom(n,1, 0.95);
test_result_if_no_disease = rbinom(n,1,0.10); test_result = disease_status*
test_result_if_disease + (1- disease_status)*test_result_if_no_disease;
# calculate and report desired conditional probability
sum(disease_status*test_result)/sum(test_result)
[1] 0.2263892
28
Wrap-up on probability models
• Highly useful for estimating probabilities in
many standard situations
– Pr(0 arrivals in h hours) if mean arrival rate is
known
– Conservative estimates for proportions
• Useful for showing uncertainty about
probabilities using Bayes’ Rule
– Beta posterior distribution for proportions
29
Binomial models for statistical
quality control decisions:
Sequential and adaptive
hypothesis-testing
30
Quality control decisions
• Observe data, decide what to do
– Intervene in process, accept or reject lot
• P-chart for quality control of process
– For attributes (pass/fail, conform/not conform)
• Lot acceptance sampling
– Accept or reject lot based on sample
• Adaptive sampling
– Sequential probability ratio test (SPRT)
31
“Rule of 3”: Using the binomial
model to bound probabilities
• If no failures are observed in N binomial
trials, then how large might the failure
probability be?
• Answer: At most 3/N
– 95% upper confidence limit
– Derivation: If failure probability is p, then the
probability of 0 failures in N trials is (1 - p)N.
– (1 - p)N > 0.05  1 - p > 0.051/N  ln(1 - p) >
-2.9957/N  -p > -3/N  p < 3/N
log(1 − x) =
32
P-chart: Pay attention if process
exceeds upper control limit (UCL)
http://www.centerspace.net/blog/nmath-stats/nmath-stats-tutorial/statistical-quality-control-charts/
Decision analysis: Set UCL to minimize average cost of
type 1 (false reject) and type 2 (false accept) errors
33
Lot acceptance sampling
(by attributes, i.e., pass/fail inspections)
• Take sample of size n
• Count non-conforming (fail) items
• Accept if number is below threshold; reject
if it is above
• Optimize choice of n and threshold to
minimize expected total costs
– Total cost = cost of sampling + cost of
erroneous decisions
34
Lot acceptance sampling:
Inputs and outputs
http://www.minitab.com/en-US/training/tutorials/accessing-the-power.aspx?id=1688
35
Zero-based acceptance sampling
plan calculator
Squeglia Zero-Based Acceptance Sampling Plan Calculator
Enter your process parameters:
Batch/lot size (N)
The number of items in the batch (lot).
AQL
The Acceptable Quality Level.
If no AQL is contractually specified, an AQL of 1.0% is suggested
http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator
36
Zero-based acceptance sampling
plan calculator
Squeglia Zero-Based Acceptance Sampling Plan (Results)
For a lot of 91 to 150 items, and AQL= 10.0% , the Squeglia zero-based
acceptance sampling plan is:
Sample 5 items.
If the number of non-conforming items is
0
accept the lot
1
reject the lot
This plan is based on DCMA (Defense Contract Management Agency)
recommendations.
http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator
37
Multi-stage lot acceptance sampling
• Take sample of size n
• Count non-conforming (fail) items
• Accept if number is below threshold 1;
reject if it is above threshold 2; sample
again if between the thresholds
– For single-sample decisions, thresholds 1 and
2 are the same
• Optimize choice of n and thresholds to
minimize expected total costs
38
Decision rules for adaptive binomial
sampling: Sequential probability ratio test
(SPRT)
Intuition: The expected slope
of the cumulative-defects
line is the average
proportion of defectives.
This is just the probability of
defective (non-conforming)
items in a binomial sample.
Simulation-optimization (or
math) can identify optimal
slopes and intercepts to
minimize expected total cost
(of sampling + type 1 and
type 2 errors)
http://www.stattools.net/SeqSPRT_Exp.php
39
Generalizations of SPRT
• Main ideas apply to many other (non-binomial) problems
• SPRT decision rule: Use data to compute the likelihood
ratio LRt = P(ct | HA)/P(ct | H0).
• If LRt > (1 – )/ then stop and reject H0
• If LRt < / 1 – ) then stop and accept H0
• Else continue sampling
– ct = number of adverse events by time t
– H0 = null hypothesis (process has acceptably small defect rate); H0
= alternative hypothesis
–  = false rejection rate for H0 (type 1 error rate)
–  = false acceptance rate for H0 (type 2 error rate)
http://www.tandfonline.com/doi/pdf/10.1080/07474946.2011.539924?noFrame=true
40
Implementing the SPRT
• Optimal slopes and intercepts to achieve different
combinations of type 1 and type 2 errors are tabulated.
Example application:
Testing for mean time to
failure (MTTF) of
electronic components
41
Decision rules for adaptive binomial
sampling: Sequential probability ratio test
(SPRT)
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
42
Application: SPRT for deaths from
hospital heart operations
http://www.bmj.com/content/328/7436/375?ijkey=144017772645bb38936abd6f209cd96bfd1930c3&key
type2=tf_ipsecsha&linkType=ABST&journalCode=bmj&resid=328/7436/375
43
SPRT can greatly reduce sample sizes (e.g.,
from hundreds to 5, for construction defects)
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
44
Nonlinear boundaries and truncated
stopping rules can refine the basic idea
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
45
Wrap-up on SPRT
• Sequential and adaptive sampling can reduce
total decision costs (costs of sampling + costs of
error)
• Computationally sophisticated (and challenging)
algorithms have been developed to
approximately optimize decision boundaries for
statistical decision rules
• Adaptive approaches are especially valuable for
decisions in uncertain and changing
environments.
46
Multi-arm bandits and adaptive
learning
47
Multi-arm bandit (MAB) decision problem:
Comparing uncertain reward distributions
• Multi-arm bandit (MAB) decision problem: On
each turn, can select any of k actions
– Context-dependent bandit: Get to see a “context”
(signal) x before making decision
• Receive random reward with (initially unknown)
distribution that depends on the selected action
• Goal: Maximize sum (or discounted sum) of
rewards; minimize regret (= expected difference
between best (if distributions were known) and
actually received cumulative rewards)
http://jmlr.org/proceedings/papers/v32/gopalan14.pdf Gopalan et al., 2014
https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/
48
MAB applications
• Clinical trials: Compare old drug to new.
Which has higher success rate?
• Web advertising, A/B testing: Which
version of a web ad maximizes clickthrough, purchases, etc.?
• Public policies: Which policy best
achieves its goals?
– Use evidence from early adopter locations to
inform subsequent choices
49
Upper confidence bound (UCB1)
algorithm for solving MAB
• Try each action once.
• For each action a, record average reward m(a)
obtained from it so far and how many times it has
been tried, n(a).
• Let N = an(a) = total number of actions so far.
• Choose next the action with the greatest upper
confidence bound (UCB): m(a) + sqrt(2*log(N)/n(a))
– Implements “Optimism in the face of uncertainty” principle
– UCB for a decreases quickly with n(a), increases slowly with N
– Achieves theoretical optimum: logarithmic growth in regret
• Same average increase in first 10 plays as in next 90, then next 900, and so on
– Requires keeping track each round (not batch updating)
Auer et al., 2002 http://homes.dsi.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
50
Thompson sampling and adaptive
Bayesian control: Bernoulli trials
• Basic idea: Choose each of the k actions
according to the probability that it is best
• Estimate the probability via Bayes’ rule
– It is the mean of the posterior distribution
– Use beta conjugate prior updating for
“Bernoulli bandit” (0-1 reward, fail/succeed)
S = success
F = failure
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012
51