Download Probability

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics for
Data Miners: Part I (continued)
S.T. Balke
Probability = Relative Frequency
Typical Distribution for a Discrete
Variable
Binomial Distribution (n=10, p=.50)
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
No. of Successes
7
8
9
10
Typical Distribution for a
Continuous Variable
Normal (Gaussian) Distribution
Probability Density Function
0.82
0.83
0.835
0.84
0.845
0.85
dx= width of a bar
8
0. 2
82
0. 2
82
0. 4
82
0. 6
82
8
0.
8
0. 3
83
0. 2
83
0. 4
83
0. 6
83
8
0.
8
0. 4
84
0. 2
84
0. 4
84
0. 6
84
8
0.
8
0. 5
85
2
120
100
80
60
40
20
0
0.825
0.
Probability Density (f(x))
Histogram Fit by Gaussian Curve
Observation ( x)
120
100
80
60
40
20
0
The Normal Distribution
(Also termed the “Gaussian Distribution”)
f ( x) 
 (x   )2 
1

exp 
2
2 
2 

Note: f(x)dx is the probability of observing a value of x between
x and x+dx. Note the statement on page 87 of the text re: dx
canceling for the Bayesian method.
Selecting One Normal Distribution
The Normal Distribution can fit data with any
mean and any standard deviation…..which
one shall we focus on?
We do need to focus on just one….for tables
and for theoretical developments.
Need for the Standard Normal
Distribution
• The mean, , and standard deviation, ,
depends upon the data----a wide variety of
values are possible
• To generalize about data we need:
– to define a standard curve and
– a method of converting any Normal curve to
the standard Normal curve
The Standard Normal
Distribution
=0
=1
The Standard Normal
Distribution
 z
1
f (z ) 
exp 
2
 2
2



P.D.F. of z
Standard Normal Curve
0.5
0.4
f(z)
0.3
0.2
0.1
0
-6
-4
-2
0
z
2
4
6
Transforming Normal to
Standard Normal Distributions
• Observations xi are transformed to zi:
xi  
zi 

This allows us to go from f(x) versus x to f(z) versus z.
Areas under f(z) versus z are tabulated.
The Use of
Standard Normal Curves
Statistical Tables
• Convert x to z
• Use tables of area of curve segments
between different z values on the standard
normal curve to define probabilities
Z Table
http://www.statsoft.com/textbook/stathome.html
Emphasis on Mean Values
• We are really not interested in individual
observations as much as we are in the mean
value.
• Now we have f(x) versus x where x is the
value of observations.
• We need to deal in xbar, the sample mean,
instead of individual x values.
Introduction to Inferential
Statistics
• Inferential statistics refers to methods for
making generalizations about populations
on the basis of data from samples
Sample Quantities
Mean
n
 xi
x  i1
n
Note: These quantities can be
for any distribution, Normal or
otherwise.
is an estimate of 
Standard Deviation
n
2 is an estimate of 
(
x

x
)
 i
s  i 1
n 1
Population and Sample Measures
Parameters:
Mean of the Population  
Standard Deviation of the Population  
Variance of the Population  2
x
Statistics (sample estimates of the parameters):
Sample estimate of   x
Sample Estimate of   s
Population and Samples
Population
Sample 1
x11
x12
x13
Sample 2
x1n
n observations
per sample.
x21
x22
x23
x2n
x31
Sample 3
Sample....
x32
x3n
x33
Sample....
P.D.F. of the Sample Means
PDF for the Sample Means (n=5)
Probability Density
250
200
150
Note: The std. dev. of
this distribution is xbar
100
50
0
0.82
0.825
0.83
0.835
x bar
0.84
0.845
0.85
Types of Estimators
• Point estimator - gives a single value as an
estimate of the parameter of interest
• Interval estimator - specifies a range of
values of the parameter and our confidence
that the parameter value is in that range
Point Estimators
• Unbiased estimator: as the number of
observations, n, increases for the sample the
average value of the estimator approaches
the value of the population parameter.
Interval Estimators
• P(lower limit<parameter<upper limit)
=1-
• lower limit and upper limit = confidence
limits
• upper limit-lower limit=confidence interval
• 1- = confidence level; degree of
confidence; confidence coefficient
Comments on the Need to
Transform to z for C.I. of Means
P(low<<high)=1-
• We have a point estimate of , xbar.
• Now the interval estimate consists of a
lower and an upper bound around our point
estimate of the population mean:
  x  boundaries
Confidence Interval for a
Population Mean
P(low<<high)=1-
If f(xbar) versus xbar is a Normal distribution and if we can
define z as we did before, then:
low =xbar-z/2xbar
high =xbar+z/2xbar
A Standard Distribution for f(xbar)
versus xbar
• Previously we transformed f(x) versus x to
f(z) versus z
• We can still use f(z) versus z as our standard
distribution.
• Now we need to transform f(xbar) versus
xbar to f(z) versus z.
P.D.F. of the Sample Means
PDF for the Sample Means (n=5)
Probability Density
250
200
150
Note: The std. dev. of
this distribution is xbar
100
50
0
0.82
0.825
0.83
0.835
x bar
0.84
0.845
0.85
P.D.F. of z
Standard Normal Curve
0.5
0.4
f(z)
0.3
0.2
0.1
0
-6
-4
-2
0
z
2
4
6
Transforming Normal to
Standard Normal Distributions
• This time the sample means, xbar are
transformed to z:
x
z 
x
Note that now we use
xbar and sigma for the
p.d.f. of xbar.
The Normal Distribution Family
120
250
Probability Density
100
80
60
40
20
0
0.815
0.82
0.825
0.83
0.835
0.84
0.845
0.85
200
150
100
50
0.855
0
Observation ( x)
0.82
0.825
0.83
0.835
0.84
x bar
Standard Normal Curve
x 
zi  i

x
z 
x
0.5
0.4
0.3
f(z)
Probability Density (f(x))
PDF for the Sample Means (n=5)
0.2
0.1
0
-6
-4
-2
0
z
2
4
6
0.845
0.85
Remaining Questions
• When can we assume that f(xbar) versus
xbar is a Normal Distribution?
– when f(x) versus x is a Normal Distribution
– but….what if f(x) versus x is not a Normal
Distribution
• How can we calculate μ and σ for the
f(xbar) versus xbar distribution?
The Answer to Both Questions
The Central Limit Theorem
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased
Note that the distribution of x is not necessarily Normal.
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased
Every member of the population must have an equally likely
chance of becoming a member of your sample.
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased.
Note: The standard deviation depends upon n, the
number of replicate observations in each sample.
The Central Limit Theorem
If x is distributed with mean  and standard
deviation , then the sample mean (xbar)
obtained from a random sample of size n
will have a distribution that approaches a
normal distribution with mean  and
standard deviation (/n0.5) as n is increased.
Note: n, the number of replicates per sample, should be
at least thirty.
Calculating a
Confidence Interval
Assume: n30,  known
  x  z  / 2 x

x 
n
Effect of 1-
Standard Normal Curve
0.5
1-
0.4
/2
f(z)
0.3
0.2
/2
0.1
0
-6
-4
 z-2 / 2
0
z
z 2/ 2
4
6
Understanding What is a 95%
Confidence Interval
• If we compute values of the confidence
interval with many different random
samples from the same population, then in
about 95% of those samples, the value of
the 95% c.i. so calculated would include the
value of the population mean, .
• Note that  is a constant.
• The c.i. vary because they are each based
on a sample.
Summary
•
•
•
•
•
•
•
The Binomial Distribution
Histograms and p.d.f.’s
Area segments and the normal distribution
The standard normal distribution
p.d.f. of the sample means
est. of mean=point est. + interval est.
The Central Limit Theorem
Improving the Estimate of the
Mean
• Reduce the confidence interval.
• Variables to examine:
1-
n

  x  z / 2

n
Effect of 1-
Standard Normal Curve
0.5
1-
0.4
/2
f(z)
0.3
0.2
/2
0.1
0
-6
-4
 z-2 / 2
0
z
z 2/ 2
4
6
Effect of n
Effect of Number of Replicates on the Breadth of the P.D.F. of xbar
(the sampling distribution of xbar)
Probability Density
600
500
n=30
400
300
200
n=5
100
n=1
0
0.82
0.825
0.83
0.835
x bar
0.84
0.845
0.85
Effect of 
Effect of Sigma on the
Width of the Sampling Distribution of xbar
Probability Density
600
500
sigma=7.30 x 10-4
400
300
200
n=1 for all
three distributions
sigma=1.79 x 10-3
100
sigma=4.00 x 10-3
0
0.82
0.825
0.83
0.835
x bar
0.84
0.845
0.85
Understanding the Question
• If we are asked to estimate the value of the
population mean then we provide:
– the point estimate + the interval estimate of the
mean
• If we are asked to estimate the noise in the
experimental technique then we provide:
– the point estimate + the interval estimate of the
standard deviation (something not reviewed
yet)
Complication for Small Samples
• For small samples (n<30), if the
observations, x, follow a Normal
distribution, and if  must be approximated
by s, then the sample means, xbar, tend to
follow a “Student’s t” distribution rather
than a Normal distribution.
• So, we must use t instead of z.
Confidence Intervals for Small
Samples (n<30)
• Assume the xi follow a Normal distribution.
• Assume  is unknown.
• Use t and s instead of z and 
  x  t  / 2 , n 1
s
n
Large Samples:
Estimation of C.I. for 
Calculation of Confidence Intervals for Large Samples
n>29
Sigma Unknown
x Normally
Distributed
  x  t  / 2 , n 1
s
n
Sigma Known
x Not
Normally
Distributed
  x  z / 2
s
n
X Normally
Distributed
  x  z / 2

n
x Not
Normally
Distributed
  x  z / 2

n
Small Samples:
Estimation of C.I. for 
Calculation of Confidence Intervals for Small Samples
n<30
Sigma Unknown
x Normally
Distributed
  x  t  / 2 , n 1
s
n
Sigma Known
x Not
Normally
Distributed
X Normally
Distributed
No Soln
  x  z / 2

n
x Not
Normally
Distributed
No Soln
Return to a Data Mining Problem
• Predicting Classifier Performance…..
Predicting Classifier Performance
(Page 123)
•
•
•
•
•
•
y=750 successes (symbol: S in text)
n=1000 trials (symbol: N in text)
f=y/n=0.750 success rate for the training set
What will be the success rate for other data?
What is the error in the estimate of f as 0.750?
From statistics we can calculate that we are 80%
confident that the confidence interval 0.732 to
0.767 will contain the true error rate for any data.
The Binomial Distribution
• The probability of y
successes in n trials is:

 y
n!
p (1  p)n y
g(y )  b(n, p)  
 y!(n  y )! 
The total probability of having any number of successes is
the sum of all the g(y) which is unity.
The probability of having any number of successes up to a
certain value y’ is the sum of f(y) up to that value of y.
See page 178 regarding quantifying the value of a rule.
Shape Changes for the Binomial
Distribution
if np>5 when p≤0.5
or
if n(1-p)≥5 when p≥0.5
the Normal Distribution becomes a good
approximation to the Binomial distribution
N(np,np(1-p)0.5)=N(μ,σ)
Confidence Intervals for p
x
z 
x
where f(z) versus z is N(0,1)
y  np
z
np (1  p)
is approximately N(0,1)
Calculating a
Confidence Interval
Recall, for large samples:
  x  z  / 2 x
So, now we could say:
np  y  z  / 2 np (1  p)
But, we want the limits for p,
not np.
Focus on p instead of np


( y / n)  p
P  z  / 2 
 z/2   1 
p(1  p) / n


np (1  p)
y
p   z/ 2
n
n
y
p   z  / 2 p(1  p) / n
n
but now p is on both sides
of the equation!
Focus on p instead of np
Let’s return to z:
z
y  np
np (1  p)
z
( y / n)  p
p(1  p) / n
is approximately N(0,1)
and now, solve for p:
2
2
2

z
f
f
z
f 
z
  2

2n
n n 4n

p
 z2 
1  
n





Two values of p are obtained:
the upper and lower limits.
where f=(y/n)=observed success
rate
Predicting Classifier Performance
(Page 123)
•
•
•
•
•
•
y=750 successes (symbol: S in text)
n=1000 trials (symbol: N in text)
f=y/n=0.750
If 1-α=0.80 (80% confidence=c in text)
From z table: z=1.28
Interval from Eqn: 0.732, 0.767
Using the z Tables
for the Binomial Distribution
k  0.5
P(y  k ) 
 f (y)dy
k  0.5
 k  0.5  np 
 k  0.5  np 
  

P ( y  k )  
 np (1  p) 
 np (1  p) 




Where Φ( z) is the value obtained from the z table.
Summary
•
•
•
•
•
•
•
•
The Binomial Distribution
Histograms and p.d.f.’s
Area segments and the normal distribution
The standard normal distribution
p.d.f. of the sample means
est. of mean=point est. + interval est.
The Central Limit Theorem
Confidence Intervals
In Two Weeks
• Hypothesis Testing:
How do we know if we can accept a batch of
material from a few replicate analyses of a
sample?
Are the error rates obtained from two data mining
methods really different?