Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Descriptive Statistics / Describing Distributions with Numbers A. Common Measurements of Location -these measurements give you a sense of where a data point falls in line relative to other data points. Note: If you calculate a stat from a sample it is called a sample statistic. If you calculate it from a population statistic then it is called a population statistic. 1. Mean – this is the most common measurement. It is simply the average. It is one of three measures of centrality. _ a. sample mean - x = ∑xi / n = x1 + x2 + …. + xn b. population mean – μ = ∑xi / N note: n = total number in sample N = total number in population 2. Median–Md – The middle value when the data is arranged in ascending order; second measure of center. - if the data has an odd number of points, then there is a true middle -if the data has an even number of points, then take the average of the two middle points. Example: If you are given 3, 6, 7, 8, 8, 10 Since there are an even number of points we take (x3 + x4) / 2 = (7 + 8) /2 = 7.5 3. Mode–Mo - this is the value that occurs with the greatest frequency. If there is more than on value that takes on the greatest frequency then we say that the value is bi, tri, or multi-modal; last measure or centrality. 4. Percentiles/Quartiles – tells us how data are spread over a 100 percent interval from smallest to largest. How to calculate percentage: (a) arrange data in ascending order (b) Compute the index of the percent Index – i = (p/100)n p = percentile of interest n = number of operations (c) If it is not an integer round up. If it is an integer then the pth percentile is average of the i and i+1 value a. For Lower Quartile (Q1 or 25th percentile): i. Sort all observations in ascending order 1 ii. Compute the position L1 = 0.25 * N, where N is the total number of observations. iii. If L1 is a whole number, the lower quartile is midway between the L1-th value and the next one. iv. If L1 is not a whole number, change it by rounding up to the nearest integer. The value at that position is the lower quartile. b. For Upper Quartile (Q3 or 75th percentile): i. Sort all observations in ascending order ii. Compute the position L3 = 0.75 * N, where N is the total number of observations. iii. If L3 is a whole number, the lower quartile is midway between the L3-th value and the next one. iv. If L3 is not a whole number, change it by rounding up to the nearest integer. The value at that position is the lower quartile. Example: 61, 61, 61, 67, 73, 73, 74, 79, 81, 81, 87, 89, 89, 92, 97, 100 Given our data from test scores again if we wanted to know the 40th percentile we obtain it as follows. i = (40/100) 16 = .4*16 = 6.4 = 7 So our 7th value is our 40th percentile. This corresponds to 74. 5. Quartiles – these divide our data in 4 equal parts. Q1 = first quartile; or the 25th percentile Q2 = second quartile; or the 50th percentile Q3 = third quartile; or the 75th percentile The percentiles are calculated just as shown above. 6. Five-number summary – this is a way to show the data with 5 important values. It gives us the max, min, and the 3 quartiles above. Graph 1: Test Scores from Before Test Scores Test Scores in Percent 100 90 80 70 60 2 7. Weighted Mean a. Weighted mean- μ= Σ wixi / Σ wi; this is used when the data points take on different levels of importance or weights. Example: Suppose we wanted the average cost per lb gives the following data Purchase Cost/lb # lbs 1 3 1200 2 4 500 3 2 800 So μ= Σ wixi / Σ wi = So the average cost per pound is $2.88 = $2.88 b. Sample Mean if data is Grouped : = Σ fiMi / Σ fi = Σ fiMi / n fi – the frequency for class i Mi – the midpoint for class i n = sample size; this always equals total frequency for all classes. _ Note: the sample variance – s2 = ∑ fi( Mi – x )2 / n-1 B. Measures of Variability - this gives us an idea of how disperse the data is. The data can be tightly bunched or not, what scale it is, and where the majority of the data points lie. 1. Range – the largest value from the smallest value; gives you a sense of what the difference between the extreme points look like. 2. Interquartile Range – IQR = Q3 – Q1; gives the middle 50% range. -rule for outliers: if it lies farther out than 1.5*IQR either above or below the range of the data we generally call it an outlier. 3. Variance – measures how different each data point is from the mean. It gives us an overall idea of how different the data is from the mean value. a. Population variance – σ2= ∑ ( xi – μ)2 / N _ b. Sample Variance – s2 = ∑ ( xi – x )2 / n-1 alt. formula: s2 = [∑ ( xi 2 – (Σxi)2/n ] note: i. we divide by n-1 for the sample variance because it has been found that it gives a better predictor of population variance than if we just divided by n. 3 ii. We must make sure to square the differences. If we didn’t the sum of all the _ differences would just add to 0; i.e. ∑ ( xi – x ) = 0. iii. Variance measures cannot really be compared to one another since the scale of different variables are different. If we measured variance of ages in a class, the ages of people are in much smaller units than if we did variance of incomes which would be in thousands. 4. Standard Deviation – It gives us a better measure of spread because it gets us back to our units of the variable of interest. a. Population standard deviation – σ = 2 b. Sample standard deviation – s = s2 5. Coefficient of Variation – tells how the standard deviation relates to the mean in terms of magnitude. It gives us a way to compare standard deviations across different distributions. Coefficient of Variation = (Standard deviation / mean) * 100 The smaller the value the more compact the data. The larger the value the more disperse the data is. 6. Skewness -For skewness we are looking at where the mean is relative to the median and where the tail of the data is pointing. So, there are two cases. We can get skewed to the right which can be seen in the case below. If the mean is to the right of the median this is the result. We also see that the tail points in the rightward direction. Relative Freq. Median Mean 4 Also, if the mean is to the left then it is said to be left skewed. An example can by seen below. The tail of the distribution points to the left. Relative Freq. Mean Median If we have the mean and the median in the same place then the data is evenly proportioned and there is said to be symmetry. This is what we talk about when we say something is normal or if it has a bell-shape. See chapter 3 for more information. 7. Empirical Rule and Chebyshev’s Theroem a. Empirical Rule – This gives the percentage of data that lies within 1, 2, and 3 standard deviations from the mean if we have a normal distribution. See chapter 6. b. Chebyshev’s Theorem – If we have a population with a mean of μ and stand1ard deviation of σ then for any integer value of k (with k > 1) then at least 100(1- )% of the data points lie in the interval μ± k*σ example: Suppose we have μ = 5 with σ =1 Then if we consider k = 1.5 then 100(1- )= 33.3%. So what this tells us is that at least 33.3% of the data lies between 5±1.5. This is a very conservative estimate, because there could in fact be much more. C. Covariance/ Correlation/Line of Best Fit (OLS) 1. cross-tabulations – this is a method of showing the relationship between the data for two variables simultaneously. -it is very good at showing relationship between two variables -it can be used for both quantitative, qualitative, or both -you may use frequency, percent frequency, or relative frequency distributions 2. Scatter diagram/plot – typical graphical representation between two quantitative variables. Can show positive, negative, or no relationship. 5 Graphs 1 & 2 Graphs 3 & 4 a. Explanatory Variable – may explain or influence the response variable. It is generally called an independent variable. b. Response Variable – measures the outcome of s study. This is often called the dependent variable. Ex: consider we want to focus on GPA. GPA would be the response variable, and all the variables we thought would influence GPA would be the explanatory variables such as: -study time -ACT/SAT score -age Note: we can just as easily change our variable of interest to become the response variable. It would most likely change our variables we use to explain it. 6 c. positive relationship – when one variables moves with another in the same direction. So as one variable goes up, so does the other. d. negative relationship - when one variables moves with another in the opposite direction. So as one variable goes up, the other variable goes down. e. outlier – a variable that falls well outside the overall pattern of a relationship between two variables f. strength – how closely the data seems to follow a certain pattern. If the data closely follows a specific pattern the relationship is said to be strong. If it does not closely follow one, it is said to be weak. 3. Covariance – measures the linear association between two variables. Mathematically: _ _ _ _ a. Sample Covariance = Sxy = (1/ (n-1)) ∑ ( xi – x ) ( yi – y ) for i = 1….n b. Population Covariance = σxy = (1/ N) ∑ ( xi – x ) ( yi – y ) for i = 1….N x 2 6 7 4 3 Example: Suppose we have the following data sample. So we first need the means of each set of data. _ _ We find that x = 4.4 and y = 14 and n = 5 y 15 14 12 17 12 that define as coming from a Using these facts we can now find covariance given the above formula as follows: (1/4) * [( 2 – 4.4 ) ( 15 – 14 ) + ( 6 – 4.4 ) ( 14 – 14 ) + ( 7 – 4.4 ) ( 12 – 14 ) + ( 4 – 4.4 ) ( 17 – 14 ) + ( 3 – 4.4 ) ( 12 – 14 ) ] = (1/4) [ -2.4 + 0 -5.2 -1.2 + 2.8] = -6 Note: the value of the covariance tells you whether or not there is a positive or negative relationship based on its sign. If it is very close to zero then it would imply no relationship. 4. Correlation Coefficient – measures the relationship between the two variables and it also measures magnitude. _ _ a. sample correlation - rxy = (1/ (n-1)) ∑ (( xi – x )/ Sx ) * (( yi – y ) / Sy ) = Sxy / Sx Sy _ _ b. population correlation - ρxy = (1/ N) [ ∑ (( xi – x ) / σx ) * (( yi – y ) / σy ) ] = σxy / σx σy Note: Correlation always lies between -1 and 1. The closer to each of the values the stronger the relationship. The closer to 0 the weaker the relationship between the two variables. 7 ex: Let’s use the same data that we had before. To calculate correlation we need the covariance and standard deviations of both the x & y variables. So calculating the standard deviation of both x and y is as follows: ( 2 – 4.4 ) 2 + ( 6 – 4.4 ) 2 + ( 7 – 4.4 ) 2 + ( 4 – 4.4 ) 2 + ( 3 – 4.4 ) 2 / 4 Sx = (5.76 + 2.56 + 6.76 + 0.16 + 1.96) / 4 = 17.2 /2 ≈ 2.07 = ( 15 – 14 )2 +( 14 – 14 ) 2 + ( 12 – 14 ) 2 + ( 17 – 14 ) 2 + ( 12 – 14) 2 / 4 Sy = (1 +0 + 4 + 9 + 4) / 4 = 18 /2 ≈ 2.12 = rxy = Sxy / Sx Sy = -4 / (2.07)(2.12) ≈ - 0.909 This tells us that there is a strong negative relationship between the x and y data. If we graph the data we can see this. Graph 3: Scatterplot of X & Y 18 16 Y data 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 X data 5. Least Squares Method / Regression Line -the least squares method is what is used to determine the best fit line. It is not simply drawn in. -what is essentially done is the data is plotted as a scatter diagram and the line that minimizes the difference between the actual data points and the line drawn is the best fit line. a. Least Squares Criterion – here is the mathematical idea of the explanation above. Min Σ (yi - ŷi )2 , where yi = the actual value for the ith observation and ŷi = the estimated value of the dependent variable for the ith observation. 8 -we find the b0 & b1 that minimize this distance. b. Slope and Y-intercept for the estimated regression equation _ _ _ i. b1 = ∑ ( xi – x ) ( yi – y ) / ∑ ( xi – x )2 or b1 = Sxy / Sx2 _ _ ii. b0 = y - b1 x -so we use the above equations to estimate our regression equation and get a relationship between x & y. Note: we are assuming a linear relationship with this estimation, but in other estimation techniques this need not be the case. For our purposes we will always assume a linear form. c. Graphical example of the above analysis ŷ - Line of best fit y ( ŷ -yi ) – so it is the difference between the data point and the estimated line _ _ y ( yi – y )- the difference between the avg. and the data point. _ ( ŷ - y ) – the total distance from estimated line from the mean. x **the important thing to get out of this is that the regression line or ‘line of best fit’ minimizes the distance of each actual data point to the line. 9