Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Data Miners: Part I (continued) S.T. Balke Probability = Relative Frequency Typical Distribution for a Discrete Variable Binomial Distribution (n=10, p=.50) 0.3 Probability 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 No. of Successes 7 8 9 10 Typical Distribution for a Continuous Variable Normal (Gaussian) Distribution Probability Density Function 0.82 0.83 0.835 0.84 0.845 0.85 dx= width of a bar 8 0. 2 82 0. 2 82 0. 4 82 0. 6 82 8 0. 8 0. 3 83 0. 2 83 0. 4 83 0. 6 83 8 0. 8 0. 4 84 0. 2 84 0. 4 84 0. 6 84 8 0. 8 0. 5 85 2 120 100 80 60 40 20 0 0.825 0. Probability Density (f(x)) Histogram Fit by Gaussian Curve Observation ( x) 120 100 80 60 40 20 0 The Normal Distribution (Also termed the “Gaussian Distribution”) f ( x) (x )2 1 exp 2 2 2 Note: f(x)dx is the probability of observing a value of x between x and x+dx. Note the statement on page 87 of the text re: dx canceling for the Bayesian method. Selecting One Normal Distribution The Normal Distribution can fit data with any mean and any standard deviation…..which one shall we focus on? We do need to focus on just one….for tables and for theoretical developments. Need for the Standard Normal Distribution • The mean, , and standard deviation, , depends upon the data----a wide variety of values are possible • To generalize about data we need: – to define a standard curve and – a method of converting any Normal curve to the standard Normal curve The Standard Normal Distribution =0 =1 The Standard Normal Distribution z 1 f (z ) exp 2 2 2 P.D.F. of z Standard Normal Curve 0.5 0.4 f(z) 0.3 0.2 0.1 0 -6 -4 -2 0 z 2 4 6 Transforming Normal to Standard Normal Distributions • Observations xi are transformed to zi: xi zi This allows us to go from f(x) versus x to f(z) versus z. Areas under f(z) versus z are tabulated. The Use of Standard Normal Curves Statistical Tables • Convert x to z • Use tables of area of curve segments between different z values on the standard normal curve to define probabilities Z Table http://www.statsoft.com/textbook/stathome.html Emphasis on Mean Values • We are really not interested in individual observations as much as we are in the mean value. • Now we have f(x) versus x where x is the value of observations. • We need to deal in xbar, the sample mean, instead of individual x values. Introduction to Inferential Statistics • Inferential statistics refers to methods for making generalizations about populations on the basis of data from samples Sample Quantities Mean n xi x i1 n Note: These quantities can be for any distribution, Normal or otherwise. is an estimate of Standard Deviation n 2 is an estimate of ( x x ) i s i 1 n 1 Population and Sample Measures Parameters: Mean of the Population Standard Deviation of the Population Variance of the Population 2 x Statistics (sample estimates of the parameters): Sample estimate of x Sample Estimate of s Population and Samples Population Sample 1 x11 x12 x13 Sample 2 x1n n observations per sample. x21 x22 x23 x2n x31 Sample 3 Sample.... x32 x3n x33 Sample.... P.D.F. of the Sample Means PDF for the Sample Means (n=5) Probability Density 250 200 150 Note: The std. dev. of this distribution is xbar 100 50 0 0.82 0.825 0.83 0.835 x bar 0.84 0.845 0.85 Types of Estimators • Point estimator - gives a single value as an estimate of the parameter of interest • Interval estimator - specifies a range of values of the parameter and our confidence that the parameter value is in that range Point Estimators • Unbiased estimator: as the number of observations, n, increases for the sample the average value of the estimator approaches the value of the population parameter. Interval Estimators • P(lower limit<parameter<upper limit) =1- • lower limit and upper limit = confidence limits • upper limit-lower limit=confidence interval • 1- = confidence level; degree of confidence; confidence coefficient Comments on the Need to Transform to z for C.I. of Means P(low<<high)=1- • We have a point estimate of , xbar. • Now the interval estimate consists of a lower and an upper bound around our point estimate of the population mean: x boundaries Confidence Interval for a Population Mean P(low<<high)=1- If f(xbar) versus xbar is a Normal distribution and if we can define z as we did before, then: low =xbar-z/2xbar high =xbar+z/2xbar A Standard Distribution for f(xbar) versus xbar • Previously we transformed f(x) versus x to f(z) versus z • We can still use f(z) versus z as our standard distribution. • Now we need to transform f(xbar) versus xbar to f(z) versus z. P.D.F. of the Sample Means PDF for the Sample Means (n=5) Probability Density 250 200 150 Note: The std. dev. of this distribution is xbar 100 50 0 0.82 0.825 0.83 0.835 x bar 0.84 0.845 0.85 P.D.F. of z Standard Normal Curve 0.5 0.4 f(z) 0.3 0.2 0.1 0 -6 -4 -2 0 z 2 4 6 Transforming Normal to Standard Normal Distributions • This time the sample means, xbar are transformed to z: x z x Note that now we use xbar and sigma for the p.d.f. of xbar. The Normal Distribution Family 120 250 Probability Density 100 80 60 40 20 0 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 200 150 100 50 0.855 0 Observation ( x) 0.82 0.825 0.83 0.835 0.84 x bar Standard Normal Curve x zi i x z x 0.5 0.4 0.3 f(z) Probability Density (f(x)) PDF for the Sample Means (n=5) 0.2 0.1 0 -6 -4 -2 0 z 2 4 6 0.845 0.85 Remaining Questions • When can we assume that f(xbar) versus xbar is a Normal Distribution? – when f(x) versus x is a Normal Distribution – but….what if f(x) versus x is not a Normal Distribution • How can we calculate μ and σ for the f(xbar) versus xbar distribution? The Answer to Both Questions The Central Limit Theorem The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased Note that the distribution of x is not necessarily Normal. The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased Every member of the population must have an equally likely chance of becoming a member of your sample. The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased. Note: The standard deviation depends upon n, the number of replicate observations in each sample. The Central Limit Theorem If x is distributed with mean and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean and standard deviation (/n0.5) as n is increased. Note: n, the number of replicates per sample, should be at least thirty. Calculating a Confidence Interval Assume: n30, known x z / 2 x x n Effect of 1- Standard Normal Curve 0.5 1- 0.4 /2 f(z) 0.3 0.2 /2 0.1 0 -6 -4 z-2 / 2 0 z z 2/ 2 4 6 Understanding What is a 95% Confidence Interval • If we compute values of the confidence interval with many different random samples from the same population, then in about 95% of those samples, the value of the 95% c.i. so calculated would include the value of the population mean, . • Note that is a constant. • The c.i. vary because they are each based on a sample. Summary • • • • • • • The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Improving the Estimate of the Mean • Reduce the confidence interval. • Variables to examine: 1- n x z / 2 n Effect of 1- Standard Normal Curve 0.5 1- 0.4 /2 f(z) 0.3 0.2 /2 0.1 0 -6 -4 z-2 / 2 0 z z 2/ 2 4 6 Effect of n Effect of Number of Replicates on the Breadth of the P.D.F. of xbar (the sampling distribution of xbar) Probability Density 600 500 n=30 400 300 200 n=5 100 n=1 0 0.82 0.825 0.83 0.835 x bar 0.84 0.845 0.85 Effect of Effect of Sigma on the Width of the Sampling Distribution of xbar Probability Density 600 500 sigma=7.30 x 10-4 400 300 200 n=1 for all three distributions sigma=1.79 x 10-3 100 sigma=4.00 x 10-3 0 0.82 0.825 0.83 0.835 x bar 0.84 0.845 0.85 Understanding the Question • If we are asked to estimate the value of the population mean then we provide: – the point estimate + the interval estimate of the mean • If we are asked to estimate the noise in the experimental technique then we provide: – the point estimate + the interval estimate of the standard deviation (something not reviewed yet) Complication for Small Samples • For small samples (n<30), if the observations, x, follow a Normal distribution, and if must be approximated by s, then the sample means, xbar, tend to follow a “Student’s t” distribution rather than a Normal distribution. • So, we must use t instead of z. Confidence Intervals for Small Samples (n<30) • Assume the xi follow a Normal distribution. • Assume is unknown. • Use t and s instead of z and x t / 2 , n 1 s n Large Samples: Estimation of C.I. for Calculation of Confidence Intervals for Large Samples n>29 Sigma Unknown x Normally Distributed x t / 2 , n 1 s n Sigma Known x Not Normally Distributed x z / 2 s n X Normally Distributed x z / 2 n x Not Normally Distributed x z / 2 n Small Samples: Estimation of C.I. for Calculation of Confidence Intervals for Small Samples n<30 Sigma Unknown x Normally Distributed x t / 2 , n 1 s n Sigma Known x Not Normally Distributed X Normally Distributed No Soln x z / 2 n x Not Normally Distributed No Soln Return to a Data Mining Problem • Predicting Classifier Performance….. Predicting Classifier Performance (Page 123) • • • • • • y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 success rate for the training set What will be the success rate for other data? What is the error in the estimate of f as 0.750? From statistics we can calculate that we are 80% confident that the confidence interval 0.732 to 0.767 will contain the true error rate for any data. The Binomial Distribution • The probability of y successes in n trials is: y n! p (1 p)n y g(y ) b(n, p) y!(n y )! The total probability of having any number of successes is the sum of all the g(y) which is unity. The probability of having any number of successes up to a certain value y’ is the sum of f(y) up to that value of y. See page 178 regarding quantifying the value of a rule. Shape Changes for the Binomial Distribution if np>5 when p≤0.5 or if n(1-p)≥5 when p≥0.5 the Normal Distribution becomes a good approximation to the Binomial distribution N(np,np(1-p)0.5)=N(μ,σ) Confidence Intervals for p x z x where f(z) versus z is N(0,1) y np z np (1 p) is approximately N(0,1) Calculating a Confidence Interval Recall, for large samples: x z / 2 x So, now we could say: np y z / 2 np (1 p) But, we want the limits for p, not np. Focus on p instead of np ( y / n) p P z / 2 z/2 1 p(1 p) / n np (1 p) y p z/ 2 n n y p z / 2 p(1 p) / n n but now p is on both sides of the equation! Focus on p instead of np Let’s return to z: z y np np (1 p) z ( y / n) p p(1 p) / n is approximately N(0,1) and now, solve for p: 2 2 2 z f f z f z 2 2n n n 4n p z2 1 n Two values of p are obtained: the upper and lower limits. where f=(y/n)=observed success rate Predicting Classifier Performance (Page 123) • • • • • • y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 If 1-α=0.80 (80% confidence=c in text) From z table: z=1.28 Interval from Eqn: 0.732, 0.767 Using the z Tables for the Binomial Distribution k 0.5 P(y k ) f (y)dy k 0.5 k 0.5 np k 0.5 np P ( y k ) np (1 p) np (1 p) Where Φ( z) is the value obtained from the z table. Summary • • • • • • • • The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Confidence Intervals In Two Weeks • Hypothesis Testing: How do we know if we can accept a batch of material from a few replicate analyses of a sample? Are the error rates obtained from two data mining methods really different?