Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Data Miners: Part I S.T. Balke Statistics Statistics is concerned with how to collect and analyze data in the presence of variability. • Variability=random error + systematic error • Precision: reproducibility; associated with random error. • Accuracy: deviation from the truth; associated with systematic error. Accuracy and Reproducibility • Error rate is an expression of accuracy of a data mining method. • Our estimate of error rate is based upon the data that we have. • We want to say that the error rate will be the same when other data of the same type is used. • However, all data has some random error. • Thus, our estimate of error rate obtained using data is affected by the presence of random error. • Our estimate has some uncertainty. • Statistics can quantify that uncertainty and can tell us how to decrease it. Statistics Statistics Descriptive Statistics Inferential Statistics Probability Statistics Lectures: Part I: The Basics: S.T. Balke Part II: The Analysis of Count Data: S. Sayad Part III: Hypothesis Testing for Numeric Values: S.T. Balke Statistics in Data Mining • • • • • • • Data visualization and cleaning Rule and tree construction Basis for Bayesian Approaches Assessment of competing data mining methods Assessing the significance of an error rate Fitting equations to data Reducing the dimensions of the problem Part I: The Basics • Discrete Distributions: the Binomial Distribution • Continuous Distributions: – Histograms – Distributions – Measures of Distributions • The Normal Distribution • The Central Limit Theorem • Confidence Intervals Data Types • Categorical (Nominal): labels • Numerical: – Discrete: integers – Continuous: numeric non-integers Initial Focus Random Variables • Quantities that cannot be predicted with certainty • If only distinct values: discrete random variable • If any value in a continuum: continuous random variable Distributions Probability The Relative Frequency Concept • If an experiment is repeated n times and event A is observed b times, then • For large n: P(A)= b/n • Or: P(A)= no. Of b’s observed/no of total observations • Simply put: Probability = relative frequency Probability = Relative Frequency Distributions • Portray what happens when the same experiment is repeated a number of times. • When you see a distribution think: REPRODUCIBILITY Typical Distribution for a Discrete Variable Binomial Distribution (n=10, p=.50) 0.3 Probability 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 No. of Successes 7 8 9 10 The Binomial Distribution • Consider a random experiment where one of two mutually and exhaustive outcomes can occur (success and failure or heads and tails, etc.). • Repeat n times. • Outcomes are mutually independent. • The probability, p, of success is the same in each trial. The Binomial Distribution • The probability of y successes in n trials is: y n! p (1 p)n y g(y ) b(n, p) y!(n y )! The total probability of having any number of successes is the sum of all the g(y) which is unity. The probability of having any number of successes up to a certain value y’ is the sum of f(y) up to that value of y. See page 178 regarding quantifying the value of a rule. Binomial Distribution (n=10, p=0.30) Binomial Distribution (n=10, p=.30) 0.3 Probability 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 No. of Successes 7 8 9 10 Binomial Distribution (n=10,p=0.80) Binomial Distribution (n=10, p=.80) 0.35 Probability 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 No. of Successes 7 8 9 10 Binomial Distribution (n=25,p=0.80) Binomial Distribution (n=25, p=.80) 0.25 Probability 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 No. of Successes 18 20 22 24 Typical Distribution for a Continuous Variable Normal (Gaussian) Distribution Example • In 1798, Henry Cavendish estimated the density of the earth by using a torsion balance Cavendish Experiment mearth 4 3 rearth 3 Fg m1m 2 d 2 Cavendish Experiment: Sources of Error • • • • • • • • torsional strength of wire air currents body mass of experimenter placement of masses measurement of distances measurement of angle contribution of damping device radius from Eratosthenes, 200BC Density of the Earth [g/cm3] 5.5 5.57 5.42 5.61 5.53 5.47 4.88 5.62 5.63 4.07 5.29 5.34 5.26 5.44 5.46 5.55 5.34 5.3 5.36 5.79 5.75 5.29 5.1 5.86 5.58 5.27 5.85 5.65 5.39 Note: water =1 g/cc granite=2.7g/cc Density Data Ascending Order 1 2 3 4 5 6 7 4.07 4.88 5.10 5.26 5.27 5.29 5.29 8 9 10 11 12 13 14 5.30 5.34 5.34 5.36 5.39 5.42 5.44 Accepted Value: 5.50 g/mL Iron: 7.85 g/mL Nickel: 8.90 g/mL 15 16 17 18 19 20 21 5.46 5.47 5.50 5.53 5.55 5.57 5.58 22 23 24 25 26 27 28 29 5.61 5.62 5.63 5.65 5.75 5.79 5.85 5.86 Basic Histogram Calculations Density Freq. 4 0 4.25 1 4.5 0 5 1 5.25 1 5.5 14 5.75 9 6 3 more 0 Total 29 0 0.034483 0 0.034483 0.034483 0.482759 0.310345 0.103448 0 1 0 0.137931 0 0.137931 0.137931 1.931034 1.241379 0.413793 0 4 0 0.034483 0.034483 0.068966 0.103448 0.586207 0.896552 1 1 Histogram Freq.= 14 for earth density values between 5.25 and 5.5 Histogram 16 14 Frequency 12 10 8 6 4 2 0 4 4.25 4.5 5 5.25 Bin 5.5 5.75 6 More Histograms • Height of bars=frequency • Frequency obtained depends upon total number of observations • We would like to remove that dependency! Basic Histogram Calculations Density Freq. Rel. Freq. 4 0 4.25 1 4.5 0 5 1 5.25 1 5.5 14 5.75 9 6 3 more 0 Total 29 0 0.034483 0 0.034483 0.034483 0.482759 0.310345 0.103448 0 1 0 0.137931 0 0.137931 0.137931 1.931034 1.241379 0.413793 0 4 0 0.034483 0.034483 0.068966 0.103448 0.586207 0.896552 1 1 Histogram Relative Frequency versus Density Relative Frequency 0.6 0.5 Rel.Freq.= 0.482 for earth density values between 5.25 and 5.5 0.4 0.3 0.2 0.1 0 4 4.25 4.5 5 5.25 Density 5.5 5.75 6 more Histograms • The eye reacts to area more than to height of a bar (important if class sizes are different!) • We want the area of a bar to be the relative frequency Basic Histogram Calculations Density Freq. Rel Freq. Rel Freq./Width 4 0 4.25 1 4.5 0 5 1 5.25 1 5.5 14 5.75 9 6 3 more 0 Total 29 0 0.034483 0 0.034483 0.034483 0.482759 0.310345 0.103448 0 1 0 0.137931 0 0.137931 0.137931 1.931034 1.241379 0.413793 0 4 0 0.034483 0.034483 0.068966 0.103448 0.586207 0.896552 1 1 Histogram Rel.Freq. Relative Frequency/Width Relative Frequency/Width vs Density 2.5 2 1.5 =1.93x0.25=0.482 for earth density values between 5.25 and 5.5 1 0.5 0 4 4.25 4.5 5 5.25 Density 5.5 5.75 6 more Basic Histogram Calculations Density Freq. Rel. Freq. Rel Freq/Width Cum. Rel. Freq. 4 0 4.25 1 4.5 0 5 1 5.25 1 5.5 14 5.75 9 6 3 more 0 Total 29 0 0.137931 0 0.137931 0.137931 1.931034 1.241379 0.413793 0 4 0 0.034483 0 0.034483 0.034483 0.482759 0.310345 0.103448 0 1 0 0.034483 0.034483 0.068966 0.103448 0.586207 0.896552 1 1 Histogram Differential and Cumulative Histograms of Density 1.2 1 2 0.8 1.5 0.6 1 0.4 0.5 0.2 0 0 4 4.25 4.5 5 5.25 Density 5.5 5.75 6 more Cumulative Relative Frequency Relative Frequency/Width 2.5 When will you see a histogram in this course? • Data Visualization http://stat.skku.ac.kr/~myhuh/software/DAVIS/DAVIS.htm • Even more important: Probability Density Functions and Probability Distributions are both related to histograms! Differential Probability Density Distribution • Picture a histogram with the area of each bar equal to the relative frequency • Assume that the histogram represents a very large number of observations • Reduce the width of the bars until they each reach dx and the height of a bar is f(x) • The area of a bar is then f(x) dx Probability Density Distribution Probability Density (f(x)) Histogram 120 100 80 60 dx= width of a bar 40 20 0 Observation ( x) Probability Density Function 0.82 0.83 0.835 0.84 0.845 0.85 dx= width of a bar 8 0. 2 82 0. 2 82 0. 4 82 0. 6 82 8 0. 8 0. 3 83 0. 2 83 0. 4 83 0. 6 83 8 0. 8 0. 4 84 0. 2 84 0. 4 84 0. 6 84 8 0. 8 0. 5 85 2 120 100 80 60 40 20 0 0.825 0. Probability Density (f(x)) Histogram Fit by Gaussian Curve Observation ( x) 120 100 80 60 40 20 0 Probability Density Function: The Normal Distribution The Normal Distribution (Also termed the “Gaussian Distribution”) f ( x) (x )2 1 exp 2 2 2 Note: f(x)dx is the probability of observing a value of x between x and x+dx. Note the statement on page 87 of the text re: dx canceling for the Bayesian method. The Normal Distribution: Areas Referring to the x axis: • Area from - to + is 0.6826 • Area from -2 to +2 is 0.9544 • Area from -3 to +3 is 0.9974 • Area from -1.96 to +1.96 is 0.9500 • Total area under the curve = 1.0000 Excel: Descriptive Statistics Mean 5.42 Standard Error 0.0629 Median 5.46 Mode 5.29 Standard Deviation 0.3388 Sample Variance 0.1148 Kurtosis 8.487 Skewness -2.329 Accepted Value: 5.50 g/mL Range 1.79 Minimum 4.07 Maximum 5.86 Sum 157.17 Count 29 Largest(1) 5.86 Smallest(1) 4.07 Confidence Level(95.0%) 0.1289 Measures of Location The Mean: n xi x i1 n The Median: is the (n+1)/2 value of xi in an ordered array from lowest to highest. About 50% of the ordered density values observed fall below the median. Comments on the Mean versus the Median Rank x 1 2 3 4 5 6 7 8 5 7 9 10 14 16 17 50 rank of median= 4.5 median= 12 mean 16 Measures of Location (Con.) The Mode: is the value of xi at the peak of the histogram (the most frequent value as defined by the mid-point of the bar corresponding to the peak). Measures of Dispersion Range: highest value of xi -lowest value of xi Variance: Standard Deviation: n (xi x)2 s2 i1 n 1 n 2 ( x i x) s i 1 n 1 Comments on Standard Deviation x xbar= sum stdev x-xbar 35 47 48 50 51 53 54 70 75 53.67 12.08 (x-xbar)^2 -18.67 -6.67 -5.67 -3.67 -2.67 -0.67 0.33 16.33 21.33 348.44 44.44 32.11 13.44 7.11 0.44 0.11 266.78 455.11 0.00 1168.00 146.00 12.08 Comments on Standard Deviation x xbar= sum stdev x-xbar 35 47 48 50 51 53 54 70 75 53.67 12.08 (x-xbar)^2 -18.67 -6.67 -5.67 -3.67 -2.67 -0.67 0.33 16.33 21.33 348.44 44.44 32.11 13.44 7.11 0.44 0.11 266.78 455.11 0.00 1168.00 146.00 12.08 Comments on Standard Deviation x xbar= sum stdev x-xbar 35 47 48 50 51 53 54 70 75 53.67 12.08 (x-xbar)^2 -18.67 -6.67 -5.67 -3.67 -2.67 -0.67 0.33 16.33 21.33 348.44 44.44 32.11 13.44 7.11 0.44 0.11 266.78 455.11 0.00 1168.00 146.00 12.08 Quartiles and Quantiles Rank of the First Quartile: i=0.25n+0.5 (Value of First Quartile=Q1) Rank of the Third Quartile: i=0.75n+0.5 (Value of Third Quartile=Q3) Rank of the bth Quantile: i=bn+0.5 Value of the Inter Quartile Range=Q3-Q1 Use of Quartiles and Quantiles • Box Plots • Defining Cumulative Distributions Summary to this Point: • Discrete variables – Binomial Distribution • Continuous variables: – Probability is the same as relative frequency – Relative frequency can be expressed as a histogram – The fit of a “narrow bar” histogram where relative frequency has been replaced by probability is a probability density function – The most famous p.d.f. is the Normal Curve (or Gaussian Distribution) Need for the Standard Normal Distribution • The mean, , and standard deviation, , depends upon the data----a wide variety of values are possible • To generalize about data we need: – to define a standard curve and – a method of converting any Normal curve to the standard Normal curve The Standard Normal Distribution =0 =1 Transforming Normal to Standard Normal Distributions • Observations xi are transformed to zi: xi zi The Standard Normal Distribution z 1 f (z ) exp 2 2 2 The Use of Standard Normal Curves Statistical Tables • Convert x to z • Use tables of area of curve segments between different z values on the standard normal curve to define probabilities Z Table http://www.statsoft.com/textbook/stathome.html P.D.F. of z Standard Normal Curve 0.5 0.4 f(z) 0.3 0.2 0.1 0 -6 -4 -2 0 z 2 4 6