Download DA_Lecture11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Decision Analysis
Lecture 11
Tony Cox
My e-mail: [email protected]
Course web site: http://cox-associates.com/DA/
Agenda
•
•
•
•
•
•
•
•
•
Recommended readings
Problem set 9 solutions
Problem set 10
Where are we going?
Wrap-up on multi-arm bandit (MAB)
– Thompson sampling
Wrap-up on adaptive decisions
Optimal stopping and optimal replacement
Bayesian networks, CART, partial dependence plots
Influence diagrams
2
Recommended Readings
• Skim Bell (1988), pages 1416-1418, on
single-attribute utility theory
http://www.people.hbs.edu/dbell/one%20switch.pdf
• Skim Abbas (2010), pages 62-67 and 7477, on multiattribute utility theory (MAUT)
http://create.usc.edu/sites/default/files/publications/tutorialmau.pdf
3
Homework #9, Problem 1
Updating a uniform prior
• Starting from a uniform prior, U[0, 1], for
success probability, you observe 22 successes
in 30 trials.
• What is your Bayesian posterior probability that
the success probability is greater than 0.5?
• Bayesian updating: P(p > 0.5 | x = 22, n = 30) =
1 - pbeta(0.5, 23, 9) = 0.994663
• Uses pbeta with updated parameters x + 1 and n +
2 based on x successes in n trials, x = 22, n = 30
• Posterior is beta(x + 1, n - x + 1).
4
Homework #9, Problem 2:
Spare parts
• In a manufacturing plant, it costs $10/day to stock 1 spare
part, $20/day to stock 2 spare parts, etc. ($10 per spare part
per day).
• There are 50 machines in the plant. Each machine breaks
with probability 0.004 per machine per day. (More than one
machine can fail on the same day.)
• If a spare part is available (in stock) when a machine breaks,
it can be repaired immediately, and no production is lost.
• If no spare part is available when a machine breaks, it is idle
until a new part can be delivered (1 day lag). $65 of
production is lost.
• How many spare parts should the plant manager keep in
stock to minimize expected loss?
5
Spare parts solution
• Number of failures in a day has binomial
distribution with n = 50 machines and p =
0.004 per machine per day.
– Mean is np= 50*0.004 = 0.2 failures per day,
so we expect about one breakdown every 5
days
• The cost of keeping 1 spare is $10/day
• The cost per machine failure is $65 if no
spare is available, else 0
6
Spare parts (cont.)
• With 0 spares, the average loss per day
due to failures is ($65 per unrepaired
failure)*(0.2 failures/day) = $13/day
• With 1 spare, average loss/day ≈ $10/day
+ $65*(1 - pbinom(1, 50, 0.004)) = $11.12
– Exact cost < $ 65*dbinom(2,50,0.004) +
2*65*dbinom(3,50,0.004) + 3*65*dbinom(4,50,0.004) +
sum(c(4:50))*65*dbinom(5,50,0.004) = $11.34
• With 2 spares, average loss > $20/day
• So, optimal decision is: Stock 1 spare
7
Homework #9 discussion problem
for April 11 (uncollected/ungraded)
•
•
•
•
Choice set: Take or Do Not Take
Chance set (states): Sunshine or Rain
P(Sunshine) = p = 0.6
Utilities of act-state pairs:
– u(Take, Sunshine) = 80
– u(Take, Rain) = 80
– u(Do Not Take, Sunshine) = 100
– u(Do Not Take, Rain) = 0
8
Homework #9 discussion
problem (uncollected/ungraded)
1. If p = 0.6, find EU (Take) and EU(Don’t
Take) using Netica
– Goal is to see how Netica deals with
decisions and expected utilities
– May also try it via simulation
2. Update these EUs if a forecast (with error
probability 0.2) predicts rain
9
The Umbrella Problem
•
•
•
•
Choice set: Take or Do Not Take
Chance set (states): Sunshine or Rain
P(Sunshine) = p = 0.6
Utilities of act-state pairs:
– u(Take, Sunshine) = 80
– u(Take, Rain) = 80
– u(Do Not Take, Sunshine) = 100
– u(Do Not Take, Rain) = 0
10
Netica solution without forecast known
Weather
Sunshine
Rain
60.0
40.0
forecast
Rain predicted
Sunshine predicted
44.0
56.0
Take_Umbrella
Take
Do Not Take
80.0000
60.0000
Consequence
11
Netica solution with forecast of rain
Weather
Sunshine 27.3
Rain
72.7
Take_Umbrella
Take
80.0000
Do Not Take 27.2727
forecast
Rain predicted
100
Sunshine predicted
0
Consequence
• Netica automatically updates EU of different decisions, given observations.
• Influence diagrams are just as easy to create and solve as Bayesian networks
12
Solving by simulation in R:
Part A (no forecast)
# Initialize variables
weather = utility = NULL; p = 0.6; n = 100000;
# simulate weather state
weather = rbinom(n, 1, p);
# calculate and report simulated probability of sun
psun = mean(weather); psun
# calculate and report utilities of each action
EUtake = psun*80+(1- psun)*80;
EUleave = psun*100+(1- psun)*0;
EUtake; EUleave
13
Solving by simulation in R:
Part A (no forecast)
> # Initialize variables
> weather = utility = NULL; p = 0.6; n = 100000;
> # simulate weather state
> weather = rbinom(n, 1, p);
> # calculate and report simulated probability of sun
> psun = mean(weather); psun
[1] 0.59519
> # calculate and report utilities of each action
> EUtake = psun*80+(1- psun)*80;
> EUleave = psun*100+(1- psun)*0;
> EUtake; EUleave
[1] 80
[1] 59.519
14
Solving by simulation in R
Part B (with forecast)
# Initialize variables
weather = forecastsun = utility = NULL; p = 0.6; n = 100000;
# simulate weather states and forecasts
weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2);
forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun =
weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain;
forecastRain=1-forecastsun
# calculate and report desired conditional probability
psunifForecastRain =
sum(weather*forecastRain)/sum(forecastRain); psunifForecastRain
# calculate and report utilities of each action
EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80;
EUleave = psunifForecastRain*100+(1- psunifForecastRain)*0;
EUtake; EUleave
15
Solving by simulation in R
Part B (with forecast)
> # Initialize variables
> weather = forecastsun = utility = NULL; p = 0.6; n = 100000;
> # simulate weather states and forecasts
> weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2);
forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun =
weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain;
forecastRain=1-forecastsun
> # calculate and report desired conditional probability
> psunifForecastRain = sum(weather*forecastRain)/sum(forecastRain);
psunifForecastRain
[1] 0.2764975
> # calculate and report utilities of each action
> EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80; EUleave =
psunifForecastRain*100+(1- psunifForecastRain)*0; EUtake; EUleave
[1] 80
[1] 27.64975
16
Homework #10 (optional)
(Due by 4:00 PM, April 18 if you want it graded)
• A deep sea oil drilling platform has a normally distributed lifetime (until
failure) with a mean of 30 years and a standard deviation of 4 years.
• While it is operating, the platform produces oil worth $10M per year.
• Voluntarily stopping operations and closing down the platform costs
$0.
• Having the platform fail while still in use leads to involuntary closure
and a cost of $50M.
• At what age should the platform be voluntarily shut down (if it has not
yet failed)?
– Hint: Continue until marginal benefit < expected marginal cost of
continuing
– Hint: Use a hazard function calculator for normally distributed
lifetimes, e.g., http://reliabilityanalyticstoolkit.appspot.com/normal_distribution
17
Where are we going?
• Student observation 1: “Here is my solution for homework
9. I have to say the course is really getting deeper and
deeper. And I'm trying to understand all the content.”
• Student observation 2: If I could totally understand and
use a few of these techniques, it might be more useful
than seeing more. (Exploration vs. exploitation trade-off.)
• Goals:
– Be able to use key techniques (e.g., computing EMV with known
probabilities and values) that are relatively simple and useful.
– Be aware of more advanced methods and “big ideas” that may
prove useful; know what to look for or ask for as a manager
18
Top-level ideas
• Many valuable decision problems can be solved using
the philosophy of simulation-optimization: Try different
decisions, evaluate their probable consequences,
choose the one with best (EMV or EU-maximizing)
probability distribution of consequences
• Both the probability modeling (“simulation”) and the
search for a best decision or decision rule can become
very technical
• For business applications, understanding what problems
can be solved and how to formulate them is the most
important part. (The rest is “just software” or using
appropriate expertise)
19
MAB Thompson sampling
(cont.)
20
Thompson sampling and adaptive
Bayesian control: Bernoulli trials
• Basic idea: Choose each of the k actions
according to the probability that it is best
• Estimate the probability via Bayes’ rule
– It is the mean of the posterior distribution
– Use beta conjugate prior updating for “Bernoulli
bandit” (0-1 reward, fail/succeed)
– Sample from posterior for each arm, 1… k; choose
the one with highest sample value. Update & repeat.
S = success
F = failure
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012
21
Thompson sampling: General
stochastic (random) rewards
• Second idea: Generalize to arbitrary
reward distribution (normalized to the
interval [0, 1]) by considering a trial a
“success” with probability equal to its
reward
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012
22
Thompson sampling with
complex online actions
• Main idea: Embed simulationoptimization in Thompson
sampling loop
–  = state space, S
– Sample the states
• Applications: Job scheduling
(assigning jobs to machines);
web advertising with reward
depending sets of ads shown
– Y = observation h = reward, X =
random variable depending on 
• Updating posteriors can be
done efficiently using a
sampling-based approach
(particle filtering)
Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf
23
Comparing methods
• In simulation experiments,
Thompson sampling works
well with batch updating, even
with slowly or occasionally
changing rewards and other
realistic complexities.
• Beats UCB1 in many but not
all comparisons
• More practical than UCB1 for
batch updating because it
keeps experimenting (trying
actions with some
randomness) between
updates.
http://engineering.richrelevance.com/recommendations-thompson-sampling/
24
MAB variations
• Contextual bandits
– See signal before acting
– Constrained contextual bandits: Actions constrained
• Adversarial bandits
– Adaptive adversaries
–
Bubeck and Slivens, 2012, https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/COLT12_BS.pdf
• Restless bandits: Probabilities change
• Gittins index maximizes expected discounted
reward, not easy to compute
• Correlated bandits
25
Wrap-up on MAB problems
• Adaptive Bayesian learning works well in
simple environments, including many of
practical interest
• The resulting rules are *much* simpler to
implement than previous methods (e.g.,
Gittins index policies)
• Sampling-based approaches (Thomposn,
particle filtering, etc.) promote
computationally practical “online learning”
26
Wrap-up on adaptive learning
• No need for a causal model
– Learn act-consequence probabilities and
optimal decision rules directly
• Assumes a stationary (or slowly changing)
decision environment, known choice set,
immediate feedback (reward) following
action
• Works very well when these assumptions
are met: low-regret learning is possible
27
Optimal stopping
28
Optimal stopping decision problems
• Suppose that a decision-maker (d.m.) faces a
random sequence of opportunities
• How long to wait for best one?
• When to stop and commit to a final choice?
• Examples: Selling a house, hiring a new
employee, accepting a job offer, replacing a
component, shuttering an aging facility, taking a
parking spot, etc.
• Other optimal stopping problems: Least-cost
policies for replacing aging components
29
Hazard functions: Conditional rate
of failure given survival so far
• Let T = length of life for a component (or person,
or time until first occurrence of an event, etc.)
– T is a random variable with cdf F(t) = Pr(T < t) and
survival function S(t) = 1 – F(t) = Pr(T > t)
– The pdf for T is then f(t) = F’(t) = dF(t)/dt
• The hazard function for T is defined as:
• h(t) = limdt0Pr(t < T < t + dt | T > t)/dt
• h(t) = f(t)/S(t) = f(t)/[1 – F(t)]
– Interpretation: “instantaneous failure rate”
• h(t)dt  Pr(occurs in next dt | survival until t)
– In discrete time, dt = 1, no limit is taken
30
Using hazard functions to guide decisions
• The shape of the hazard
function can often guide
decisions, e.g…
– If h(t) is increasing, then optimal
time to stop is when h(t)
reaches a certain threshold
– If h(t) is decreasing, then best
decision is either don’t start or
else continue until failure occurs
– Normal distribution hazard
function calculator is at
http://reliabilityanalyticstoolkit.ap
pspot.com/normal_distribution
– SPRT and other calculators:
http://reliabilityanalyticstoolkit.ap
pspot.com/
www.wolfram.com/mathematica/new-in-9/enhanced-probability-and-statistics/define-adistribution-given-its-hazard-function.html
https://www.ncss.com/software/ncss/survival-analysis-in-ncss/
31
Example: optimal age replacement
• The lifetime T of a component is a random variable
with known distribution
• Suppose it costs $10 to replace the plant before it
fails and $50 to replace it if it fails.
• When should the component be voluntarily replaced
(if not failed yet)?
• Answer can be calculated by minimizing expected
average cost per cycle (or equating marginal benefit
to marginal cost for continuing), but calculations are
detailed and soon get tedious
• Alternative: Google “optimal replacement age
calculator”
32
Optimal age replacement calculator
http://www.reliawiki.org/index.php/Optimum_Replacement_Time_Example
33
Optimal selling of an asset
• If offers arrive sequentially from a known distribution and
costs of waiting are known, then an optimal decision
boundary (blue) can be constructed to maximize EMV
Sell when red line first
hits blue decision
boundary
W(t) = price series
S(t) = maximum price
so far
http://file.scirp.org/Html/9-1040163_25151.htm
34
Optimal stopping: Variations
• Offers arrive sequentially from an unknown distribution
– Bayesian updating provides solutions
• Time pressure: Must sell by deadline, or fixed number of
offers
• With or without being able to go back to previous offers
Sell when blue line
first hits green
decision boundary
http://file.scirp.org/Html/9-1040163_25151.htm
35
Wrap-up on optimal stopping
and statistical decision theory
• Many valuable decision problems can be solved using
the philosophy of simulation-optimization:
– Try different decisions, evaluate their probable consequences
– Choose the one with best (EMV or EU-maximizing) probability
distribution of consequences
• Finding a best decision or decision rule can become very
technical
– Use appropriate software or on-line calculators
• For business applications, understanding how to
formulate decision problems and solve them with
software can create high value in practice
36
Introduction to descriptive
analytics using informationbased algorithms (BN learning,
CART trees, randomForest,
partial dependence plots)
37
Descriptive analytics:
What’s going on?
• What is the current situation?
– Attribution: How much harm/loss/opportunity cost
is being caused by X?
• Causes are often unobserved or uncertain
• What has changed recently? (Why?)
– Example: More extreme event reports caused by
real change or by media?
– Change-point analysis (CPA) algorithms
• What should we worry about?
– How is this year’s season shaping up?
38
Air pollution example in CAT*
• Load data in Excel, click
Excel to R to send it to R
– Los Angeles air basin
– 1461 days, 2007-2010
(Lopiano et al., 2015,
thanks to Stan Young for
data)
– PM2.5 data from CARB
– Elderly mortality
(“AllCause75”) from CA
Department of Health
– Daily min and max temps &
max relative humidity from
ORNL and EPA
Risk question: Does PM2.5 exposure increase elderly
mortality risk? If so, by how much? (P(death | PM2.5) = ?)
Causal Analytics Toolkit, http://cox-associates.com/downloads/
39
Classification Trees
• A powerful, popular method for data mining,
machine learning, and CPT estimation
– In R, partytree, ctree, rpart, and other algorithms provide
CART (Classification and Regression Tree) algorithms
– Can download applet from:
www.cs.ubc.ca/labs/lci/CIspace/dTree/
• Basic idea: “Always ask the most informative
question next” for reducing uncertainty (conditional
entropy) about the dependent variable.
• Partitions a set of cases into groups or clusters
(the leaf nodes) with similar conditional
probabilities for dependent variable.
40
Air pollution-mortality example:
Classification tree descriptive analytics
• tmin, tmax, month, year,
MAXRH are identified as
potential predictors of
AllCause75 (elderly mortality)
• PM2.5 does not appear in this
tree
– AllCause75 = elderly mortality
count per day is conditionally
independent of PM2.5 in this
analysis, given the other
variables in the tree
– Making year and month into
categorical variables changes
the tree but not this
conclusion.
41
How a CART tree works: Concept
•
Basic idea: Always ask the most informative question
next, given answers so far.
–
–
–
–
•
•
Questions are represented by splits in tree
Leaf nodes show conditional means (or conditional
distributions) of dependent variable
Internal nodes show significance level for split: how
significant are differences between conditional
distributions
Can handle continuous, categorical, ordered categorical,
and binary variables and missing values
Reduces prediction error for dependent variable
Stop this “recursive partitioning” when further
questions (splits in tree) do not significantly improve
prediction.
–
–
Classification & Regression Tree (CART) algorithm
Which variables are informative in predicting dependent
variable can depend on what other predictors are
included.
•
•
Taking out month makes PM2.5 informative.
Some refinements:
–
–
–
–
–
Grow a large tree and prune back to minimize crossvalidation error
fit multiple trees to random subsets of data and let them
vote for best splits (“bagging”)
over-train on mis-predicted cases (“boosting”)
average predictions from many trees (“RandomForest”
ensemble prediction)
Join prediction “patches” together smoothly (MARS)
42
How to read a CART tree
•
Tips of the tree (“leaf nodes” give the
conditional expected value (for continuous
variables) or conditional distributions (for
discrete variables) of the dependent
variable, given the value ranges of the
variables along the path from the top of the
tree (the “root node”) to the leaf node.
– The dependent variable is called “y” by default
in the tree
– Branches (“splits”) show the value ranges being
conditioned on
•
The tips of the tree also show how many
cases in the data set are described by the
combination of values leading to that leaf
node.
– These are the “n” values at the leaf nodes
•
Example:
– n = 92 days had tmin < 43 degrees and PM2.5
< 14 g/m3.
– An average of y = 141.696 elderly people per
day died on days with this description.
43
Creating a CART tree in CAT
1.
2.
Select dependent variable first, by
clicking on a column heading in the
Data sheet
Select predictor columns using Ctrl +
click
–
3.
4.
5.
Can select in any order
Click on desired analysis (Tree)
Output appears on new tab
(If unsure what analyses to do after
columns selected, click on Analyze.)
44
Classification/Decision Tree
Algorithms in Practice
• Different decision tree (= classification tree)
algorithms use different splitting, stopping,
and pruning criteria.
• Examples of tree algorithms:
– CHAID: chi-square automatic interaction detection
– CART: Classification and regression trees, allows
continuous as well as discrete variables
– MARS: Multiple adaptive regression splines, smooths
the relations at tree-tips to fit together smoothly
– KnowledgeSeeker allows multiple splits, variable types
– ID3, C5.0, etc.
45
Recursively identify parents (or
potential parents) for each node.
Legend
Q1_1_=_5? breakdown
0.000000
1.000000
total
40.2%
59.8%
1424
Q2jjj all questions answered ok
???
1.000000
to
3.000000
42.0%
58.0%
976
92.0%
8.0%
75
4.000000
5.000000
70.7%
29.3%
99
8.8%
91.2%
274
Q3_b SA offered to demonstrate
1.000000
0.000000
30.6%
69.4%
709
72.3%
27.7%
267
Q8_b easy to use and deal with
???
34.5%
65.5%
229
1.000000
70.4%
29.6%
27
2.000000
to
4.000000
37.8%
62.2%
254
Q8_b easy to use and deal with
5.000000
9.000000
8.2%
91.8%
182
47.1%
52.9%
17
???
64.5%
35.5%
62
1.000000
to
4.000000
85.6%
14.4%
153
5.000000
9.000000
42.3%
57.7%
52
46
Causal graph structure # 1
Q3h reviewed check list
Net1_3 Helped me choose
products
Q7_1 End-to-end experience
NET1_5 Wait time ok
Q3f offered to set up on-line
billing
Q3b offered to demonstrate
Q2uu acknowledged
when entered store
Q2jjj questions answered
Classification-tree Tests of CI
• In a classification tree, the dependent
variable (root node) is conditionally
independent of all variables not in the
tree, given the variables that are in the
tree (at least as far as the tree-growing
heuristic can discover).
• Starting with a childless node (output
node), we can recursively seek direct
parents of all nodes.
48