Download Data analysis and uncertainty

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Data analysis and uncertainty
Outline
• Random Variables
• Estimate
• Sampling
Introduction
• Reasons for Uncertainty
– Prediction
• Making a prediction about tomorrow based on data we have
today
– Sample
• Data maybe a sample from the population, and we don’t
know the difference between our data and other sample(or
population)
– Missing value or unknown value
• We need to guess these value
• Example : Censored Data
Introduction
• Dealing with Uncertainty
– Probability
– Fuzzy
• Probability Theory v.s. Probability Calculus
– Probability Theory
• Mapping from real world to the mathematical
representation
– Probability Calculus
• Based on well-defined and generally accepted axioms
• The aim is to explore the consequences of those axioms
Introduction
• Frequentist (Probability is objective)
– The probability of an event is defined as the
limiting proportion of times that the event would
occur in identical situations
– Example
• The proportion of times a head comes up in tossing a
same coin repeatedly
• Assess the probability that a customer in a supermarket
will buy a certain item(Use similarly customer)
Introduction
• Bayesian(Subjective probability)
– Explicit characterization of all uncertainty including
any parameters estimated from the data
– Probability is an individual degree of belief that a
given event will occur
• Frequentist v.s. Bayesian
– Toss a coin 10 times, get 7 head
– In Frequentist, probability is P(A) = 7/10
– In Bayesian, I guess a probability P(A) = 0.5, then use
this prior idea and the data to estimate probability
Random variable
• Mapping from property of objects to a variable
that can take a set of possible values via a process
that appears to the observer to have an element
of unpredictability
• Example
– Coin toss (domain is the set [heads , tails])
– No of times a coin has to be tossed to get a head
• Domain is integers
– Student’s score
• Domain is a set of integers between 0~100
Properties of single random variable
• X is random variable and x is its value
• Domain is finite:
– probability mass function p(x)
• Domain is real line:
– probability density function f(x)
• Expectation
of
X

– E[ X ]   xi p( xi )
i 1
– E[ X ]   x f ( x )dx
 i i i
Multivariate random variable
• Set of several random variables
• For p-dimensional vector x={x1,..,xp}
• The joint mass function
p( X 1  x1 , X p  x p )  p( x1 ,, x p )
The joint mass function
• For example
– Rolling two fair dice, X represent first dice’s result
and Y represent another
– Then p(x=3, y=3) = 1/6 * 1/6 = 1/36
The joint mass function
X=1
X=2
X=3
X=4
X=5
X=6
Px(X)
Y=1
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=2
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=3
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=4
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=5
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=6
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Py(Y)
0.17
0.17
0.17
0.17
0.17
0.17
1.00
Marginal probability mass function
• The marginal probability mass function of X
and Y are
f X ( x)  p( X  x)   p XY ( x, y )
y
fY ( y)  p(Y  y)   p XY ( x, y)
x
Continuous
• Marginal probability density function of X and
Y are
f X ( x)  


fY ( y )  


f XY ( x, y )dy
f XY ( x, y )dx
Conditional probability
• Density of a single variable (or a subset of
complete set of variables) given (or
“conditioned on”) particular values of other
variables
• Conditional density of X given some value of Y
is denoted f(x|y) and defined as
f ( x, y )
f ( x | y) 
f ( y)
Conditional probability
• For example
– If a student’s score is given at random
– Sample space is S = {0,1,…,100}
– What’s the probability that the student is fail?
•
P( F ) 
F

S
60
101
– Given that student’s score is even(including 0),
then what’s the probability that the student is fail?
• E  51
E  F  30
P( F | E ) 
P( E  F ) 30 / 101

P( E )
51 / 101
Supermarket data
Conditional independence
• Generic problem in data mining is finding
relationships between variables
– Is purchasing item A likely to be related to
purchasing item B?
• Variables are independent if there is no
relationship; otherwise they are dependent
• Independent if p(x,y)=p(x)p(y)
p( A  B) p( A) p( B)
p( A | B) 

 p( A)
p( B)
p( B)
p( A  B) p( A) p( B)
p( B | A) 

 p( B)
p( A)
p( A)
Conditional Independence: More
than 2 variables
• X is conditional independence of Y
– Given Z if for all values of X, Y, Z we have
– p ( x, y | z )  p ( x | z ) p ( y | z )
Conditional Independence: More
than 2 variables
• Example
– P(F)=60/101
– P(E∩F)=30/51
– Now E and F are dependence
– If student’s score !=100, then
•
•
•
•
P(F|B)=60/100
P(E|B)=1/2
P(E∩F|B)=30/100=60/100*1/2
Given B condition,E and F are independence
Conditional Independence: More
than 2 variables
• Example
– If student’s score == 100,then
•
•
•
•
P(F|C)=0
P(E|C)=1
P(E ∩ F|C)=0=1*0
Given C condition,E and F are independence
– Now we can calculate P(E ∩ F)
p( E  F )  p( E  F | grade  100) p( grade  100)  p( E  F | grade  100) p( grade  100)
100
 0  0.5  0.6 
101
30

101
Conditional Independence
• Conditional independence don’t imply
marginal independence
p ( x, y | z )  p ( x | z ) p ( y | z )
not imply
p ( x, y )  p ( x ) p ( y )
• Note that X and Y may be unconditionally
independence but conditionally dependent
given Z
On assuming independence
• Independence is a strong assumption
frequently violated in practice
• But provides modeling
– Fewer parameters
– Understandable models
Dependence and Correlation
• Covariance measures how X and Y vary
together
– Large positive if large X is associated with large Y,
and small X with small Y
– Negative if large X is associated with small Y
• Two variables may be dependent but no
linearly correlated
Correlation and Causation
• Two variables may be highly correlated
without a causal relationship between the two
– Yellow stained finger and lung cancer may be
correlated but causally linked only by a third
variable : smoking
– Human reaction time and earned income are
negatively correlated
• Does not mean one causes the other
• A third variable “age” is causally related to both
Samples and Statistical inference
• Samples can be used to model the data
• If goal is to detect the small deviations form
the data,the size of samples will effect the
result
Dual Role of Probability and Statistics
in Data Analysis
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Estimation
• In inference we want to make statements
about entire population from which sample is
drawn
• The two important methods for estimating
parameters of a model
– Maximum Likelihood Estimation
– Bayesian Estimation
Desirable properties of estimators
• Let ˆ be an estimate of parameter 
• Two measures of estimator quality ˆ
– Expected value of estimate (Bias)
• Difference between expected and true value
Bias (ˆ)  E[ˆ]  
– Variance of Estimate
Var (ˆ)  E[ˆ  E[ˆ]]2
Mean squared error
• The mean of the squared difference between
the value of the estimator and the true value
of parameter
E[(ˆ   ) 2 ]
• Mean squared error can be partitioned as sum
of squared bias and variance
E[(ˆ   ) 2 ]  ( Bias (ˆ)) 2  Var (ˆ)
Mean squared error
E[(ˆ   ) 2 ]  E[(ˆ  E[ˆ]  E[ˆ]   ) 2 ]
 E[(ˆ  E (ˆ)) 2  ( E (ˆ)   ) 2  2( E[(ˆ  E[ˆ])( E[ˆ]   )])]
 Var (ˆ)  Bias 2 (ˆ)  2 E[ˆE[ˆ]  ˆ  E[ˆ]E[ˆ]  E[ˆ] ]
 Var (ˆ)  Bias 2 (ˆ)  2[( E (ˆ)) 2  ( E (ˆ)) 2  E (ˆ)  E (ˆ)]
 Var (ˆ)  Bias 2 (ˆ)
(a  b  b  c) 2  (a  b) 2  (b  c) 2  2(a  b)(b  c)
 a 2  2ab  b 2  b 2  2bc  c 2  2ab  2ac  2b 2  2bc
 a 2  2ac  b 2
E[ E[ X ]]  E[ X ]
E[c]  c where c is a constant
Maximum Likelihood Estimation
• Most widely used method for parameter
estimation
• Likelihood Function is probability that data D
would have arisen for a given value of θ
L( | D)  L( | x(1),  , x(n))
 p ( x(1),  x(n) |  )
n
  p ( x(i ) |  )
i 1
• Value of θ for which the data has the highest
probability is the MLE
Example of MLE for Binomial
• Customers either purchase or not purchase
milk
– We want estimate of proportion purchasing
• Binomial with unknown parameterθ
• Samples x(1),…,x(1000) where r purchase milk
• Assuming conditional independence,
likelihood function is
1000
L( | x(1),  x(1000))    x (i ) (1   ) (1 x (i ))   r (1   ) (1000 r )
i
Log-likelihood Function
• We want the highest probability,so change
to Log-likelihood function
l ( )  log L( )  r log(  )  (1000  r ) log( 1   )
• Then Differentiating and setting equal to zero
r log(  )  (1000  r ) log( 1   ) r (1000  r )
 
d

(1   )
r (1000  r )
 Let 
0

(1   )
 r  r  1000  r
r
 
1000
Example of MLE for Binomial
• r milk purchases out of n customers
• θis the probability that milk is purchased by
random customer
• For 3 data set
– r = 7,n =10
– r = 70,n =100
– r = 700,n =1000
• Uncertainty becomes smaller as n increases
Example of MLE for Binomial
Likelihood under Normal Distribution
• For 1 variance,Unknown mean
• Likelihood function
n
1
L( | x(1), , x(n))   (2 ) 1/ 2 exp(  ( x(i )   ) 2 )
2
i 1
 (2 )
n / 2
1 n
exp(   ( x(i )   ) 2 )
2 i 1
 ( x   )2
f ( x) 
e
2
2
2

2
1
Log-likelihood function
n
1 n
l ( | x(1), , x(n))   log 2   ( x(i )   ) 2
2
2 i 1
• To find the MLE set derivative d/dθ to zero
dl ( ) n
  ( x(i )   )
d
i 1
n
 Let  ( x(i )   )  0
i 1
n
n
i 1
i 1
  ( x(i )   )  x(i )  n  0
n
 
 x(i)
i 1
n
Likelihood under Normal Distribution
• θis the estimated mean
• For 2 data set(By random)
– 20 data points
– 200 data points
Likelihood under Normal Distribution
Sufficient statistic
• Quantity s(D) is a sufficient statistic forθ if
the likelihood l(θ) only depends on the data
through s(D)
• no other statistic which can be calculated from
the same sample provides any additional
information as to the value of the parameter
Interval estimate
• Point estimate doesn’t convey uncertainty
associated with it
• Interval estimate provide a confidence interval
Likelihood under Normal Distribution
Normal distributi on :
 ( x   )2
f ( x) 
exp(
)
2
2
2
2
Likelihood function :
1
n
2 2
L( | x1 ,  xn )  (2 ) exp( 
n
1
2 2
2
(
x


)
)
 i
i 1
Log  likelihood function :
n
1
2
l ( | x1 ,  xn )  log( 2 )  2
2
2
n
2
(
x


)
 i
i 1
Mean
1
n
l ( | x1 ,  xn )  log( 2 2 )  2
2
2
dl ( | x1 ,  xn ) 1
 2

d

1

2
n
 (x  )
 (x  )  0
i
i 1
n
  ( xi   )  0
i 1
n
  ( xi )  n  0
i 1
n

 (x )
i
i 1
n
n
i 1
i
n
 (x  )
i 1
i
2
Variance
n
1
2
l ( | x1 , xn )  log( 2 )  2
2
2
n
2
(
x


)
 i
i 1
dl ( | x1 ,  xn ) n
1
1
  2  (
)
2
2
d
2
2
2 4
n
1
1
  2  (
)

2
2 2
2 4

n
2 2

1
2 4
n
2
(
x


)
 i
i 1
1 n
    ( xi   ) 2
n i 1
2
n
n
2
(
x


)
 i
i 1
2
(
x


)
0
 i
i 1
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Bayesian approach
• Frequestist approach
– The parameters of population are fixed but unknown
– Data is a random sample
– Intrinsic variability lies in data D  {x(1), , x(n)}
• Bayesian approach
–
–
–
–
Data are known
Parameters θ are random variables
θhas a distribution of values
p ( ) reflects degree of belief on where true
parameters θ may be
Bayesian estimation
• Modification done by Bayesian rule
p( | D) 
p( D |  ) p( )

p( D)
p( D |  ) p( )
 p( D | ) p( )d
• Leads to a distribution rather than single value
– Single value can be obtained by mean or mode
Bayesian estimation
• P(D) is a constant independent of θ
• For a given data set D and a particular
model(model = distribution for prior and
likelihood)
p( | D)  p( D |  ) p( )
• If we have a weak belief about parameter
before collecting data, choose a wide
prior(normal with large variance)
Binomial example
• Single binary variable X : wish to estimate
  p ( X  1)
• Prior for parameter in [0, 1] is the Beta
distribution
p( )   ( 1) (1   ) (  1)
where   0 and   0 are two parameters of this model
(   ) ( 1)
Beta( |  ,  ) 

(1   ) (  1)
( )(  )
Binomial example
• Likelihood function
L( | D)   (1   )
• Combining likelihood and prior
r
( n r )
p( | D)  p( D |  ) p( )   r (1   ) nr  ( 1) (1   )(  1)   ( r  1) (1   )( nr   1)
• We get another Beta distribution
– With parameters r   and n  r  
Beta(5,5) and Beta(145,145)
Beta(5,5)
Beta(45,50)
Advantages of Bayesian approach
• Retain full knowledge of all problem uncertainty
• Calculating full posterior distribution onθ
• Natural updating of distribution
p( | D1 , D2 )  p( D2 |  ) p( D1 |  ) p( )
Predictive distribution
• In equation to modify prior to posterior
p( | D) 
p( D |  ) p( )

p( D)
p( D |  ) p( )
 p( D | ) p( )d
• Denominator is called predictive distribution
of D
• Useful for model checking
– If observed data have only small probability then
it is unlikely to be correct
Normal distribution example
• Suppose x comes from a normal distribution
With unknown mean θand known variance
α
• Prior distribution for θis  ~ N (0 ,  0 )
Normal distribution example
p ( | x)  p ( x |  ) p ( )
1
1
1
1
2

exp( 
(x  ) )
exp( 
(   0 ) 2 )
2
2 0
2
20
1 2 1 1
 x
x 2 0
 exp(   (  )   (  ) 

)
2
0 
 0  2 2 0
2
1
let 1  ( 0  1) 1 1  1 (
0 x
 )
0 
1 2
1
 exp(   / 1  1 / 1 )  exp(  (  1 ) 2 / 1 )
2
2
1
1
p ( | x) 
exp(  (  1 ) 2 / 1 )
2
21
Jeffrey’s prior
• A reference prior
• Fisher information
 log L( | x)
I ( | x)   E[
]
2

2
• Jeffrey’s prior
p( )  I ( | x)
Conjugate priors
• p(θ) is a conjugate prior for p(x| θ) if the
posterior distribution p(θ|x) is in the same
family as the prior p(θ)
– Beta to Beta
– Normal distribution to Normal distribution
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Sampling in Data Mining
• The data set is only fit statistical analysis
– “Experimental design” in statistics is concerned
with optimal ways of collecting data
– Data miners can’t control the data collection
process
– The data may be ideally suited to the purposes for
which it was collected, but not adequate for its
data mining uses
Sampling in Data Mining
• Two ways in which sample arise
– Database is sample of population
– Database contains every cases, but the analysis is
based on the sample
• Not appropriate when we want to find unusual records
Why sampling
• Draw a sample from the database that allows
us to construct a model reflects the structure
of the data in the database
– Efficiency, quicker, easier
• The sample must representative of the entire
database
Systematic sampling
• Try to ensure representativeness
– Taking one out of every two records
• Can lead to problems when there are
regularities in database
– Data set where records are of married couples
Random Sampling
• Avoiding regularities
• Epsem Sampling
– Each record has same probability of being chosen
Variance of Mean of Random Sample
• If variance of population of size N is  , the
variance of mean of a simple random sample
of size n without replacement is  (1  n )
2
2
n
N
– Usually N >> n, so the second term is small, and
variance decreases as sample size increases
Example
• 2000 points, population mean = 0.0108
• Random sample n = 10, 100, 1000, repeat 200
times
Example
Stratified Random Sampling
• Split population into non-overlapping
subpopulations or strata
• Advantages
– Enable making statements about each of the
subpopulations separately
• For example, one of the credit card companies
we work with categorizes transactions into 26
categories : supermarket, gas station, and so
on
Mean of Stratified Sample
• The total size of population is N
th
• k stratum has N k elements in it
• nk are chosen for the sample from this
stratum
th
• Sample mean within k stratum is xk
• Estimate of population mean
N k xk
 N
Cluster Sampling
• Every cluster contains many elements
• Simple random sample on elements is not
appropriate
• Select cluster, not element