Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STATISTICS Summarizing, Visualizing and Understanding Data I. Populations, Variables, and Data Populations and Samples To a statistician, the population is the set or collection under investigation. Individual members of the population are not usually of interest. Rather, investigators try to infer with some degree of confidence the general features of the population. Examples Students currently enrolled at a certain university. Registered voters in a certain Congressional district. The population of large-mouthed bass in a certain lake. The population of all decay times of a radioactive isotope. Statistical Inference Drawing and quantifying the reliability of conclusions about a population from observations on a smaller subset of the population. Sample: The subset observed. Variables and Data A population variable is a descriptive number or label associated with each member of a population. The values of a population variable are the various numbers (or labels) that occur as we consider all the members of the population. Values of variables that have been recorded for a population or a sample from a population constitute data. Types of Data Nominal variables are variables whose values are labels. Ordinal variables are variables whose values have a natural order. Interval variables have values represented by numbers referring to a scale of measurement. Ratio variables have values that are positive numbers on a scale with a unit of measurement and a natural zero point. Guess the Type Age Questionnaire responses: 1=”strongly agree”,2=”agree”…,5=”strongly disagree” Letter grades Reading comprehension scores Gender Zip codes Molecular velocities II. Summarizing Data Location Measures (Measures of Central Tendency) A location measure or measure of central tendency for a variable is a single value or number that is taken as representing all the values of the variable. Different location measures are appropriate for different types of data. The Mean For interval or ratio variables x N individuals in the sample or population xi = value of x for ith individual 1 x ( x1 x 2 x N ) N The mean of a population variable is denoted by m (the Greek letter mu). The Mean with Repeated Values Distinct values of x: x1, x2 ,, xM nj = frequency of occurrence of x j 1 x (n1 x1 n2 x 2 n M x M ) N The Mean with Repeated Values Relative frequencies: fj nj N x f1 x1 f 2 x2 f M xM Example x j -2 1 3 4 6 nj 2 1 3 5 3 The Median Informally, the “middle” value when all the values are arranged in order A number m is a median of x if at least half the individuals i in the population have xi m and at least half of them have xi m The Median – Example 1 x: –2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions) median(x)=2.2 The Median – Example 2 x: -2.0, 1.5, 3.1, 3.1, 3.1 median(x) = 3.1 The Median – Example 3 x: -2.0, 1.5, 3.1, 5.7, 5.9, 7.1 median(x)=Any number in [3.1,5.7] By convention, for an even number of individuals choose the midpoint between the smallest and largest medians, e.g., 3.1 5.7 m 4.4. 2 Example Change 7.1 to 71. What happens to the mean and the median? The mean changes from 3.55 to 14.2 No change in the median The median is much less sensitive to outliers (which may be mistakes in recording data) The Median for Ordered Categories A A- B+ B B- C+ C C- D+ D D- F 8 5 10 18 18 15 14 6 4 1 1 0 N=100. The median grade is B-. The Mode The data value with the greatest frequency Not useful for interval or ordinal data if recorded with precision The only useful location measure for strictly nominal data Example A A- B+ B B- C+ C C- D+ D D- F 8 5 10 18 18 15 14 6 4 1 1 0 The modes are B and B-. Cumulative Frequencies and Percentiles x is an interval or ratio variable. Ordered distinct values: x1 x2 xM Relative frequencies: f1 , f 2 ,, f M Cumulative Frequencies and Percentiles Cumulative Frequencies Cumulative Relative Frequencies N1 n1 F1 f1 N 2 n1 n2 F2 f1 f 2 N 3 n1 n2 n3 F3 f1 f 2 f 3 N M n1 n2 n M FM f1 f 2 f M The Weather Person’s Prediction Errors x x'j -2 1 3 4 6 nj 2 1 3 5 3 Nj 2 3 6 11 14 fj .1429 .0714 .2143 .3571 .2143 Fj .1429 .2143 .4286 .7857 1.000 Exercise From the table above, what fraction of the data is less than 1? What fraction is greater than 3? What fraction is greater than or equal to 3? Percentiles x: an interval or ratio variable A number a is a pth percentile of x if at least p% of the values of x are less than or equal to a and at least (100-p) % of the values of x are greater than or equal to a. The 25th percentile is called the first quartile of x and the 75th percentile is the third quartile of x. The 50th percentile is the second quartile or median. Example For the weather person’s errors, the 25th percentile is 3. The 50th percentile and third quartile are both 4. Measures of Variability Statisticians are not only interested in describing the values of a variable by a single measure of location. They also want to describe how much the values of the variable are dispersed about that location. Population Variance and Standard Deviation x: an interval or ratio variable. N=number of individuals in population. Variance of x: 2 2 2 ( x m ) ( x m ) ( x m ) 2 1 2 N N Standard deviation of x: 2 Sample Variance and Standard Deviation n: the number of individuals in a sample from a population Sample variance: ( x1 x ) ( x2 x ) ( xn x ) s n 1 2 2 2 Sample standard deviation: s s 2 2 Alternative Formulas for the Variance Using frequencies: 2 2 2 n1 ( x1 m ) n2 ( x2 m ) nM ( xM m ) 2 N Using relative frequencies: f1 ( x1 m ) f 2 ( x2 m ) f M ( xM m ) 2 2 2 2 The Interquartile Range Q1, Q3 : 1st and 3rd quartiles, respectively Interquartile range: IQR Q3 Q1 Not influenced by a few extremely large or small observations (outliers) The Range The difference between the largest data value and the smallest Range of sample values is not a reliable indicator of the range of a population variable III. Graphical Methods Pie Charts (Circle Graphs) Sources: AT&T (1961) The World’s Telephones R: A language and environment for statistical computing, the R core development team. Bar Charts (Bar Graphs) Pros and Cons Bar chart has a scale of measurement – more precise information Pie chart gives more vivid impression of relative proportions, e.g., obvious at a glance that N. America had more than half the telephones in the world. Stemplots (Stem and Leaf Diagrams) Stem|Leaves Cumulative Frequency 4|7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 Grades of 50 students on a test Find the Median Stem|Leaves Cumulative Frequency 4|7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 25th and 26th leaves circled. Median = 78 Exercise Stem|Leaves Cumulative Frequency 4|7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 The 1st quartile is 70 and the 3rd quartile is 82. Boxplots (Box and Whisker Diagrams) Elements of a Boxplot largest outlier box whisker quartiles median Boxplot Shows Distribution Skewed to the Left Histograms For interval or ratio data Data is grouped into class intervals Superficially like a bar chart Frequency Histogram Height=bin frequency Class interval (bin) Source: R: A language and environment for statistical computing, the R core development team. Probability Histogram Area of bar = relative bin frequency E.g., .011×25=.275 Ogives (Cumulative Frequency Polygons) Related to probability histograms Examples of cumulative distribution functions Probability histograms are examples of density functions Example Ogive Relationship Between Probability Histogram and Ogive The height of the ogive is the cumulative area under the histogram Estimating Percentiles from Ogives Horizontal line has height .75 Vertical line intersects horizontal axis at 60 Estimated 3rd quartile is 60 True 3rd quartile is 62 Scatterplots (Scatter Diagrams) Used for jointly observed interval or ratio variables Example: Heights and weights of individuals Example: State per capita spending on secondary education and state crime rate Example: Wind speed and ozone concentration Example Scatterplot centroid Fitting a Line Relationship between variables x and y is approximately linear. Approximately, y = a + bx. Find a and b so that data comes closest to satisfying the equation. Least squares – a formal mathematical technique to be shown later. Line Fitted by Least Squares IV. Sampling Why Sample? Because the population is too large to observe all its members. The population may be partly inaccessible. The population may even be hypothetical. Statistical Inference Drawing conclusions about the population based on observations of a sample. Reliability of inferences must be quantifiable. Random sampling allows probability statements about the accuracy of inferences. Sampling With Replacement Population has N members. n population members chosen sequentially. Once chosen, a member of the population may be chosen again. At each stage, all members of the population are equally likely to be chosen. n Random experiment with N possible equally likely outcomes. Sampling With Replacement (continued) x is a population variable. X1 = value of x for 1st sampled individual, X2 = value of x for 2nd sampled individual, etc. Each Xi is a random variable. The random variables X 1 , X 2 ,, X n are independent. The sequence X 1 , X 2 ,, X n is a random sample of values of x, or a random sample from the distribution of x. Sampling Without Replacement Population has N individuals. n members chosen sequentially. Once chosen, an individual may not be chosen again. At each stage, all of the remaining members are equally likely to be chosen next. Random experiment with N ( N 1) ( N n 1) possible equally likely outcomes. Sampling Without Replacement (continued) Sample without replacement. Ignore the order of the sequence of individuals in the sample. Random experiment whose outcomes are subsets of size n. N! C Experiment has n!( N n)! possible equally likely outcomes. Common meaning of “random sample of size n” N ,n Random Number Generators Calculators and spreadsheet programs can generate pseudorandom sequences. Press the random number key of your calculator several times. Simulates a random sample with replacement from the set of numbers between 0 and 1 (to high precision). Generating a Sample with Replacement Number the individuals from 1 to N. Generate a pseudorandom number R. Include individual i in the sample if i 1 R N i Repeat n times. Individuals may be included more than once. Exercise Suppose you have 30 students in your class. Use the procedure just described to obtain a sample of size 10 (a) with replacement, (b) without replacement. V. Estimation The Sample Mean and Standard Deviation X 1 , X 2 ,, X n is a random sample from the distribution of a population variable x. The sample mean is 1 X ( X1 X 2 X n ) n The sample variance is 1 S [( X 1 X ) 2 ( X 2 X ) 2 ( X n X ) 2 ] n 1 2 The Sample Mean and Standard Deviation (continued) The sample standard deviation is S S2 The sample mean, variance and standard deviation are all random variables because they depend on the outcome of the random sampling experiment. Estimators The sample mean, variance, and standard deviation have distributions derived from the distribution of values of the population variable x. They are estimators of the population mean m, the population variance 2, and the population standard deviation of x. Unbiased Estimators The theoretical expected values of the sample mean and sample variance are equal to their population counterparts, i.e., E (X ) m and E (S 2 ) 2 X and S2 are said to be unbiased estimators of m and 2, respectively S is biased. E (S ) The Distribution of the Random Variable X The mean of X is m, the same as the mean of the population variable x. The standard deviation of X is / n. These are the theoretical mean and standard deviation. Density Functions A density function is a nonnegative function such that the total area between the graph of the function and the horizontal axis is 1. A probability histogram is a density function. Other density functions are limits of histograms as the number of data elements grows without bound. The Standard Normal Density Function Percentiles of the Standard Normal Distribution za is the 100(1-a) percentile of the distribution Symmetry About the Vertical Axis Probabilities Related to the Standard Normal Distribution Other Normal Distributions Let Z be a random variable with the standard normal distribution. The mean of Z is 0 and the standard deviation of Z is 1. Let m and be any numbers, >0. Let Y =Z+m Y has the normal distribution with mean m and standard deviation . Other Normal Distributions Example m = 1 and = 1.5 Standardizing: The Inverse Operation Let Y be normally distributed with mean m and standard deviation . Y m Let Z . This is the zscore of Y. Then Z has the standard normal distribution and P[a Y b] P[ am Z bm ] The Central Limit Theorem Let X be the sample average of a random sample of n values of a population variable x. The population variable x has mean m and standard deviation . Standardize X by subtracting its mean and dividing by its standard deviation Z X m / n n(X m) The Central Limit Theorem (continued) Get Ready for the Central Limit Theorem! The Central Limit Theorem (continued) The Central Limit Theorem: As the sample size n grows without bound, the distribution of Z approaches the standard normal distribution. This is true no matter what the distribution of values of the population variable x. Another Statement of the CLT For sufficiently large sample sizes n and for all numbers a and b, P[a X b] P[ n (a m ) n (b m ) Z ] In almost all applications, n≥50 is large enough. The CLT in Action Sample n=30 from the population variable COUNTS whose distribution is tabulated. Calculate the sample average. Repeat this 500 times and construct a histogram of the z-scores of the 500 sample averages. Note: The distribution of COUNTS is very far from normal. xj 0 1 2 3 4 5 6 fj .36 .33 .19 .08 .02 .01 .01 Distribution of COUNTS Result-500 Averages of 30 Samples from COUNTS Estimating a Population Mean The sample mean X is an unbiased estimator of the population mean m. For “large” sample sizes n, X has approximately a normal distribution with mean m and standard deviation / n For large n, the sample mean is an accurate estimator of the population mean with high probability. Example Suppose 2 and we want X to estimate m with an error no greater than 0.05. Assume X is exactly normally distributed. Standardize. P[| X m | 0.05] P[| Z | 0.025 n ] Probabilities of 1-place Accuracy =2 Confidence Intervals for the Population Mean – Review of za / 2 100(1-a)% Confidence Interval By the CLT 1 a P[ za / 2 X m za / 2 ] / n Rearranging the inequalities 1 a P[ X za / 2 n m X za / 2 n ] A Difficulty is probably unknown, so the confidence interval X za / 2 n can’t be used. What to do? Enhanced Central Limit Theorem Define the modified z-score for X as X m n(X m) Z S S/ n As n grows without bound, the distribution of Z approaches the standard normal distribution. A More Useful Confidence Interval By the enhanced CLT 1 a P[ X za / 2 S S m X za / 2 ] n n An approximate 100(1-a)% confidence interval is X za / 2 S n Example n=50 from COUNTS (m = 1.14) X = 1.32 S = 1.39 1-a = .95 S 1.39 = =1.32±0.39 X za / 2 1.32 1.96 n 50 95% confidence interval: (0.93, 1.71) Don’t say .95=P[0.93<m<1.71] Confidence Intervals for Proportions x is a population variable with only two values, 0 and 1. Numerical code for two mutually exclusive categories, e.g., “male” and “female”, or “approves” and “disapproves”. p=relative frequency of x=1. m=p; 2=p(1-p) Confidence Intervals for Proportions (continued) Sample n values of x, with replacement. Result is a sequence of 1s and 0s. Sample mean is the relative frequency in the sample of 1s, e.g., the relative frequency of females in the sample of individuals. Denote the sample mean by p̂ since it is an estimator of p. Confidence Intervals for Proportions (continued) n ( pˆ p) By the enhanced CLT, Z is pˆ (1 pˆ ) approximately standard normal. An approximate 100(1-a)% confidence interval is pˆ za / 2 pˆ (1 pˆ ) n Example A public opinion research organization polled 1000 randomly selected state residents. Of these, 413 said they would vote for a 1¢ sales tax increase dedicated to funding higher education. Find a 90% confidence interval for the proportion of all voters who would vote for such a proposal. Solution n = 1000 413 pˆ 0.413 1000 1-a =.90; za z.05 1.645 2 pˆ (1 pˆ ) n pˆ za / 2 (0.387, 0.489) 0.413 0.587 = 0.413 ± 1.645 1000 Linear Regression and Correlation x and y are jointly observed numeric variables, i.e., defined for the same population or arising from the same experiment. Have observations for n individuals or outcomes. Data: ( x1 , y1 ), , ( x n , y n ) Examples (An observational study) Let x be the height and y the weight of individuals from a human population. (A designed experiment) Let x be the amount of fertilizer applied to a plot of cotton seedlings and let y be the weight of raw cotton harvested at maturity. Data on Fertilizer and Cotton Yield x 2 2 2 4 4 4 6 6 6 8 8 8 y 2.3 2.2 2.2 2.5 2.9 2.7 3.4 2.7 3.4 3.5 3.4 3.3 Scatterplot of Fertilizer vs. Yield Assumptions of Linear Regression There is a population or distribution of values of y for any particular value of x. There are unknown constants a and b so that for any particular value of x, the mean of all the corresponding values of y is m y a bx The standard deviation of the values of y corresponding to a value of x is the same for all values of x. The Method of Least Squares Estimate a and b by choosing them to minimize the sum of squared differences between the observed values yi and their putative expected values a bxi In symbols, minimize ( y1 a bx1 ) ( y 2 a bx2 ) ( y n a bxn ) 2 2 2 The Least Squares Estimates Let x and and y’s. Let x’s. y be the means of the observed x’s s x2 be the sample variance of the The covariance between the x’s and the y’s is 1 s xy [( x1 x )( y1 y ) ( xn x )( y n y )] n 1 The least squares estimate of the slope is b s xy s x2 The least squares estimate of the intercept is a y bx Least Squares Line for Cotton Yield Correlation The correlation between the x’s and y’s is r s xy sx s y r is related to the slope b of the least squares regression line by sx r b sy r is always between -1 and 1. r measures how nearly linear the relationship between x and y is. If r = 0, then x and y are uncorrelated. Examples