Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SCIENTIFIC INQUIRY AND ANALYSIS UNIT 2 STATISTICAL DATA ANALYSIS SCIENTIFIC DATA ANALYSIS 1 STATISTICAL DATA ANALYSIS OBJECTIVES: The student will be able to: • Create a frequency table from a set of data. (CCCS.HSS.ID.A.1) • Compute, interpret, and analyze the measures of central tendency (mean, median, and mode) of a set of data. (CCCS.HSS.ID.A.2) • Compute measures of spread (variance, standard deviation, quartiles, and interquartile range) (CCCS.HSS.ID.2) • Graph one variable by hand. (histogram, boxplot) (CCCS.HSS.ID.A.1) SCIENTIFIC DATA ANALYSIS 2 STATISTICAL DATA ANALYSIS OBJECTIVES: The student will be able to: • Identify outliers informally and recognize their effect on a set of data. (CCCS.HSS.ID.A.3) • Define the characteristics of the Normal distribution by examining a histogram. (CCCS.HSS.ID.4) • Explain how a histogram, which is a discrete probability distribution, is related to the Normal distribution curve, a continuous probability distribution. (CCCS.HSS.ID.A.4) • Determine if a given set of data is approximately Normal using the empirical rule (68 - 95 - 99.7 rule). (CCCS.HSS.ID.A.4) • Estimate areas under the Normal curve using the empirical rule. SCIENTIFIC DATA ANALYSIS 3 STATISTICAL DATA ANALYSIS OBJECTIVES: The student will be able to: • Graph two variables by hand (scatterplot). (CCCS.HSS.ID.B.6) • Describe a scatterplot in terms of form, direction, strength, and the presence of outliers. (CCCS.HSS.ID.B.6) • Find equations of lines of best fit by fitting a line by hand and using technology (TI-84 regression function and/or Excel). (CCCS.HSS.ID.B.6.A) • Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data. (CCCS.HSS.ID.C.7) SCIENTIFIC DATA ANALYSIS 4 STATISTICAL DATA ANALYSIS OBJECTIVES: The student will be able to: • Compute the correlation coefficient using technology (TI-84 or Excel) and interpret it in the context of the data. (CCCS.HSS.ID.C.8) • Informally assess the fit of a function by plotting and analyzing residuals. (CCCS.HSS.ID.B.6.B) • Make predictions based upon analysis of data. (5.2.12.A.3) • Distinguish between correlation and causation. (CCCS.HSS.ID.C.9) SCIENTIFIC DATA ANALYSIS 5 STATISTICAL DATA ANALYSIS • Statistics – collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions from the data. SCIENTIFIC DATA ANALYSIS 6 STATISTICAL DATA ANALYSIS • Measures of Central Tendency – a method to describe the entire sample or population in a single number known as an average (mode, median and mean) • Mode – the value that occurs most frequently in data. – Example 1: What is the mode of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)? SCIENTIFIC DATA ANALYSIS 7 STATISTICAL DATA ANALYSIS • Mode – Example 2: What is the mode of the following data: (2, 5, 3, 10, 1, 4, 4, 10, 1, 2, 3, 4, 1, 10, 3, 2, 5, 5)? – Mode is not a stable average, but it gives you the most common value in a distribution if that is the information desired. – There can sometimes be more than one mode in a given piece of data. SCIENTIFIC DATA ANALYSIS 8 STATISTICAL DATA ANALYSIS • Median – the central value that occurs in an ordered distribution of data. – If there is an odd number of data, it is the center value. – If there is an even number of data, there are two center values therefore: Median = sum of two middle values / 2 SCIENTIFIC DATA ANALYSIS 9 STATISTICAL DATA ANALYSIS • Median – Example 1: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10)? – Example 2: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10, 120)? – Median is a more stable average than the mode, but it does not indicate the range of values above or below it. SCIENTIFIC DATA ANALYSIS 10 STATISTICAL DATA ANALYSIS • Mean – adds all values of a distribution of data and divides by the amount of data. 𝑥 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 #′ 𝑠 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 = 𝑥 = = 𝑛 𝑡ℎ𝑒 𝑎𝑚𝑡. 𝑜𝑓 #′ 𝑠 𝑥 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 = 𝜇 = 𝑛 SCIENTIFIC DATA ANALYSIS 11 STATISTICAL DATA ANALYSIS • Mean – Trimmed Mean: will remove the highest and lowest values of a group of data before taking a mean. The typical trim amounts are either 5% or 10%. – 5% Trim Mean: take 5% of the number of data points, round out the answer, take that amount off the top and bottom, and then take the average. SCIENTIFIC DATA ANALYSIS 12 STATISTICAL DATA ANALYSIS • Mean – Example: Given the following data take the 5% trimmed mean: 34, 56, 72, 74, 78, 82, 85, 85, 88, 90, 90, 92, 95, 95, 99, 100. • 5% of 16 values is .8, therefore round up to 1 and remove the top and bottom scores. • Remove 34 & 100; add up the remains = 1181 / 14 = 84.4% • If no trimming is done, then the mean would be 82.2%. SCIENTIFIC DATA ANALYSIS 13 STATISTICAL DATA ANALYSIS • Measures of Variation – a cross reference of the spread of the data. • Range – the difference between the largest and smallest values of a distribution. – Example 1: What is the range of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)? – Range fails to tell how much values vary from one another. SCIENTIFIC DATA ANALYSIS 14 STATISTICAL DATA ANALYSIS • Sample Standard Deviation – a measurement that gives you a better idea of how the data entries differ from the mean. 𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠2 = 𝑥−𝑥 𝑛−1 𝑥−𝑥 𝑛−1 2 2 – x = a value in the distribution – 𝑥 = the sample mean value of the distribution. – n = the total number of values in a sample distribution SCIENTIFIC DATA ANALYSIS 15 STATISTICAL DATA ANALYSIS • Population Standard Deviation – this is the same as the sample standard deviation with the exception that this includes the complete population that you are studying not just a sample set. NOTE: the symbol is different and you divide by the whole population (N). 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = 𝑥−𝜇 𝑁 2 – x = a value in the distribution – 𝜇 = the population mean value of the distribution. – N = the total number of values in the population SCIENTIFIC DATA ANALYSIS 16 STATISTICAL DATA ANALYSIS • Standard Deviation – Example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10). 𝑥 𝒙−𝒙 (𝒙 − 𝑥)2 1 1 – 6.5 = -5.5 30.3 2 7 9 10 10 Σ(𝒙 − 𝑥)2 = 𝑀𝑒𝑎𝑛 = 𝑥 = s2 = s= SCIENTIFIC DATA ANALYSIS 17 STATISTICAL DATA ANALYSIS • Standard Deviation – the following is an alternate means to calculate sample std. deviation. 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠 = 𝑆𝑆𝑥 𝑤ℎ𝑒𝑟𝑒 𝑆𝑆𝑥 = Σ(𝑥 2 ) − 𝑛−1 SCIENTIFIC DATA ANALYSIS (𝑥) 𝑛 2 18 STATISTICAL DATA ANALYSIS • Standard Deviation – Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method x x2 1 1 2 4 7 9 10 10 Σx = Σx2 = SSx = SCIENTIFIC DATA ANALYSIS s= 19 STATISTICAL DATA ANALYSIS • Coefficient of Variation – while standard deviation computes a value which indicates the range of data around the mean value, coefficient of variation (CV) will indicate it as a % . 𝑠 𝐶𝑉𝑓𝑜𝑟 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 = × 100 𝑥 𝜎 𝐶𝑉𝑓𝑜𝑟 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = × 100 𝜇 – – – – s = sample standard deviation 𝑥 = the sample mean value of the distribution 𝜎 = population standard deviation. 𝜇 = the population mean value of the distribution. SCIENTIFIC DATA ANALYSIS 20 STATISTICAL DATA ANALYSIS • Histograms – Sometimes it is difficult to see how data is distributed by just looking at the numbers. To see how data is distributed, a histogram is used. – A histogram is a type of bar graph with the exception that all of the bars touch, and the width of the bars represents something. SCIENTIFIC DATA ANALYSIS 21 STATISTICAL DATA ANALYSIS • Histograms Probability Test # of students 10 8 6 4 2 0 59.5 - 65.5 - 71.5 - 77.5 - 83.5 - 89.5 - 95.5 65.5 71.5 77.5 83.5 89.5 95.5 101.5 Test Scores SCIENTIFIC DATA ANALYSIS 22 STATISTICAL DATA ANALYSIS • Histograms Procedure 1. 2. Decide how many classes (bars) you want. It will be given by the problem. To figure out the width of the bars, divide the range by the # of bars and then round up to the next whole number. (NOTE: Always round up even if the number is less than 5, i.e. 5.41 rounds to 6.0) 𝐵𝑎𝑟 𝑊𝑖𝑑𝑡ℎ = 3. (ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 −𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒) # 𝑜𝑓 𝑏𝑎𝑟𝑠 Take the bar width and add it to the lowest value to get the range of the first bar, then add the bar width to the last value to get the range of the next bar. Keep going until you get all of your bar ranges. (i.e. if the lowest value was 60, your bar width was 6 then the first bar would be 60 – 66, the second bar would be 66 – 72, etc.) SCIENTIFIC DATA ANALYSIS 23 STATISTICAL DATA ANALYSIS • Histograms Procedure (continued) The problem occurs if your data point is 66 as in the example. In order to alleviate this problem, a boundary is calculated for the bars. 4. Calculate the boundaries of each bar: a. b. c. Find the interval of the data. Is the data given down to whole numbers, tenths, hundredths, etc? (Note: the data will always have the same interval) Take the interval and divide by 2. This is the boundary adjustment. (i.e. whole numbers means intervals of 1, so ½ = 0.5) For each bar range calculated previously in step 3, subtract the upper and lower limit by the boundary adjustment value. These will be your new bar ranges or boundaries. (i.e. 60 – 0.5 = 59.5 and 66 – 0.5 = 65.5; first bar 59.5 – 65.5) SCIENTIFIC DATA ANALYSIS 24 STATISTICAL DATA ANALYSIS • Histograms Procedure (continued) 5. Calculate the midpoint of each bar: a. Take the upper and lower limit of a bar add them together and divide by 2. This will be the midpoint. (i.e. (59.5 + 65.5) / 2 = 62.5) 𝑏𝑎𝑟 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑏𝑎𝑟 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑏𝑎𝑟 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 = 2 b. Do this for all of the rest of the bars. c. The midpoint is sometimes used instead of the boundaries to graph the bars. SCIENTIFIC DATA ANALYSIS 25 STATISTICAL DATA ANALYSIS • Histograms Procedure (continued) 6. Construct a frequency table by using tally marks. 59.5 – 65.5 65.5 – 71.5 71.5 – 77.5 77.5 – 83.5 83.5 – 89.5 89.5 – 95.5 95.5 – 101.5 || | || | || |||| || |||| |||| 7. Graph the frequency table using a bar graph arrangement. SCIENTIFIC DATA ANALYSIS 26 STATISTICAL DATA ANALYSIS • Draw the histogram for the following data. Put it into 5 classes. The data is the number of passing touchdowns for the top 20 rated quarterbacks in the 2011 season. 45 46 39 31 41 15 29 29 17 21 27 16 13 18 21 18 9 20 13 20 SCIENTIFIC DATA ANALYSIS 27 STATISTICAL DATA ANALYSIS • Histograms – If the midpoint of each class is plotted, they can be interconnected with a straight line. – This straight line graph of the midpoints is known as a Frequency Polygon SCIENTIFIC DATA ANALYSIS 28 STATISTICAL DATA ANALYSIS • Histograms – What did the histogram indicate? – Histograms can be used as a means of predicting outcome or probability. These are known as probability distributions. – One of the famous probability distributions is the normal distribution, also known as the normal curve or bell curve. SCIENTIFIC DATA ANALYSIS 29 STATISTICAL DATA ANALYSIS • Normal Distribution – The graph to the right is an example of a normal distribution. Not only does it indicate the results of the scores, but it can also be used for probability or predictions. SCIENTIFIC DATA ANALYSIS 30 STATISTICAL DATA ANALYSIS • Normal Distribution Properties – The curve is bell shaped with the highest point at the mean value. – It is symmetrical about a vertical line through the mean value. – The curve approaches the horizontal axis but never touches it. – The transition points (between cup down and cup up) occur at (mean + standard deviation) and (mean – standard deviation). SCIENTIFIC DATA ANALYSIS 31 STATISTICAL DATA ANALYSIS • Empirical Rule – For a normal distribution the following can be said about the data: • 68.2% of the data will lie within 1 standard deviation on either side of the mean • 95.4% of the data will lie within 2 standard deviations on either side of the mean. • 99.7% of the data will lie within 3 standard deviation on either side of the mean SCIENTIFIC DATA ANALYSIS 32 STATISTICAL DATA ANALYSIS • Normal Distribution Properties • 𝜎 = 34.1% • 2𝜎 = 13.6% • 3𝜎 = 2.15% • >3𝜎 = 0.15% • These %’s are used to indicate probabilities. SCIENTIFIC DATA ANALYSIS 33 STATISTICAL DATA ANALYSIS • Example: Assume the heights of college women are normally distributed, with a mean of 65 inches and a SD of 2.5 inches. – What % of women are taller than 65 inches –OR- what is the probability if one woman is selected she is taller than 65 inches? – Shorter than 65 inches? – Between 62.5 and 67.5 inches? – Between 60 and 70 inches? SCIENTIFIC DATA ANALYSIS 34 STATISTICAL DATA ANALYSIS • Percentiles – Sometimes it is more important to see the relative position of piece of data rather than its exact value. – Percentile refers to where data lies relative to the other data in the distribution. A data point at the nth percentile means n% of the data falls at or below that point and 100 – n% falls at or above that point. – Example: You scored in the 85th percentile therefore 85% of the people who took the test scored at or below you while 15% scored at or above you. Note: this does NOT mean you scored 85% on the test. SCIENTIFIC DATA ANALYSIS 35 STATISTICAL DATA ANALYSIS • Percentiles – The median is a type of percentile. It is the middle data point in the distribution therefore it is at the 50th percentile. – A special type of percentile known as the quartile is also used to evaluate the position of data. – Quartiles split data into fourths. – The 1st quartile (Q1) is the 25th percentile, the 2nd quartile (Q2) is the median, and the 3rd quartile (Q3) is the 75th percentile. SCIENTIFIC DATA ANALYSIS 36 STATISTICAL DATA ANALYSIS • Quartiles Q1 Q2 Q3 – Interquartile Range (IQR) = Q3 – Q1 SCIENTIFIC DATA ANALYSIS 37 STATISTICAL DATA ANALYSIS • Quartiles – Procedure to compute quartiles: 1. Order the data from smallest to largest. 2. Find the median; this is the 2nd quartile, Q2. 3. The first quartile Q1 is then the median of the lower half of the data. It is the median of the data falling below the Q2 and not including Q2. 4. The third quartile Q3 is then the median of the upper half of the data. It is the median of the data falling above the Q2 and not including Q2. SCIENTIFIC DATA ANALYSIS 38 STATISTICAL DATA ANALYSIS • Quartiles – Example (even # of data): – Find Q1, Q2 & Q3 & IQR for the following data: 1. 2. 3. 4. (3, 4, 9, 13, 20, 24) Find Q2. Find the median of all of the data. No center data point so take mean of the two center data points. 13 + 9 / 2 = 11. Find Q1. Find the median of the first half of the data not including Q2. Q1 = 4 Find Q3. Find the median of the second half of the data not including Q2. Q3 = 20 IQR = Q3 – Q1 = 20 – 4 = 16. SCIENTIFIC DATA ANALYSIS 39 STATISTICAL DATA ANALYSIS • Quartiles – Example: A study of ice cream bars was done. Twenty seven bars tested were rated as tasting “fair.” The cost per bar is listed below. Find the quartiles and the IQR. 0.99 1.07 1.00 0.50 0.37 1.03 1.07 1.07 0.97 0.63 0.33 0.50 0.97 1.08 0.47 0.84 1.23 0.25 0.50 0.40 0.33 0.35 0.17 0.38 0.20 0.18 0.16 SCIENTIFIC DATA ANALYSIS 40 STATISTICAL DATA ANALYSIS • Quartiles – Knowing Q1, Q2, Q3, highest value and lowest value in a table of data is known as a FiveNumber Summary. – In order to graphically represent the five-number summary, a Box-and Whisker Plot will be used. SCIENTIFIC DATA ANALYSIS 41 STATISTICAL DATA ANALYSIS • Quartiles – Box-and Whisker Plot (Shown vertically but can be done horizontally as well) Highest Value Q3 Q2 Q1 Lowest Value SCIENTIFIC DATA ANALYSIS 42 STATISTICAL DATA ANALYSIS • Quartiles – Proceure to make a Box-and Whisker Plot : • Draw a vertical scale to include the lowest and highest data values. • To the right of the scale draw a box from Q1 to Q3. • Include a solid line through the box at the median level. • Draw solid lines called whiskers from Q1 to the lowest value and from Q3 to the highest value. – EXAMPLE: Go back to the ice cream problem and create a box-and-whisker plot. SCIENTIFIC DATA ANALYSIS 43 STATISTICAL DATA ANALYSIS • Outliers – Sometimes data can skew the average of a range of data. – When data is 1.5X the difference of the 1st and 3rd quartiles, than it may be considered an outlier. – Outliers are sometimes removed from the data so that is does not skew the results. SCIENTIFIC DATA ANALYSIS 44 STATISTICAL DATA ANALYSIS • Scatter Plots – Remember from last unit that data can be plotted as a series of x and y points known as a scatter plot. – We estimated a line of best fit. In doing this, we were finding a linear correlation that exists between the x and y points. – We shall analyze the data of a scatter plot more closely in the next couple of slides. SCIENTIFIC DATA ANALYSIS 45 STATISTICAL DATA ANALYSIS Time (seconds) Position (meters) 0.7 3.8 1.8 3.2 2.6 2.8 3.4 2.2 3.8 1.8 4.1 1.4 4.9 0.8 6.0 0.2 6.5 0 SCIENTIFIC INQUIRY AND ANALYSIS 46 STATISTICAL DATA ANALYSIS • Scatter Plots – The y-distance that a data point is away from the line of best fit is known as a Residual. – The optimal line of best fit occurs when the sum of all of the square of all of the residual values is the smallest. This is know as finding the line of best fit through Least Squares method. SCIENTIFIC DATA ANALYSIS 47 STATISTICAL DATA ANALYSIS • Least Squares Method – Recall that the slope of a linear line is in the format: 𝑦 = 𝑚𝑥 + 𝑏 – This method will allow us to find the optimal slope (m) and the y-intercept (b) based on the data. – We will use a similar method here as we did for calculating standard deviation. SCIENTIFIC DATA ANALYSIS 48 STATISTICAL DATA ANALYSIS • Least Squares Method 𝑦 = 𝑚𝑥 + 𝑏 – To find the slope m, the following equation is used: 𝑚= 𝑆𝑆𝑥𝑦 𝑆𝑆𝑥 where Σ𝑥 Σ𝑦 𝑆𝑆𝑥𝑦 = Σ𝑥𝑦 − 𝑛 2 Σ𝑥 𝑆𝑆𝑥 = Σ𝑥 2 − 𝑛 and – To find the y-intercept b, the following equation is used: 𝑏 = 𝑦 − 𝑚𝑥 where 𝑦 is he mean of y and 𝑥 is he mean of x SCIENTIFIC DATA ANALYSIS 49 STATISTICAL DATA ANALYSIS X -data Time (seconds) Y-data Position (meters) 0.7 3.8 1.8 3.2 2.6 2.8 3.4 2.2 3.8 1.8 4.1 1.4 4.9 0.8 6.0 0.2 6.5 0 Σx = Σy = 𝑥= 𝑦= x2 xy Σx2 = Σxy = SCIENTIFIC INQUIRY AND ANALYSIS 50 STATISTICAL DATA ANALYSIS • Example: 1. From the example on the previous page find the slope: Σ𝑥 Σ𝑦 𝑆𝑆𝑥𝑦 = Σ𝑥𝑦 − 𝑛 2 Σ𝑥 2 𝑆𝑆𝑥 = Σ𝑥 − = 𝑛 𝑆𝑆𝑥𝑦 𝑚= 𝑆𝑆𝑥 = = 2. From the example on the previous page find the yintercept: 𝑏 = 𝑦 − 𝑚𝑥 3. Write the equations for line of least squares. 𝑦 = 𝑚𝑥 + 𝑏 SCIENTIFIC DATA ANALYSIS 51 STATISTICAL DATA ANALYSIS Graph 1: Movement of a Car 4.5 4 3.5 Position (meters) 3 2.5 2 y = -0.6974x + 4.419 R² = 0.9886 1.5 1 0.5 0 0 -0.5 1 2 3 4 5 6 7 Time (seconds) SCIENTIFIC INQUIRY AND ANALYSIS 52 STATISTICAL DATA ANALYSIS • Measuring the Spread of Data – There are three methods for measuring the spread of the data around the line of least squares: • Standard Error of Estimate • Coefficient of Correlation • Coefficient of Determination SCIENTIFIC DATA ANALYSIS 53 STATISTICAL DATA ANALYSIS • Standard Error of Estimate – In order to do this we look at how far away the y data point is away from the least squares line for each of the data points. – This method will calculate a value that is representative of spread of all of the data. – We will use values that were already calculated in figuring out the least squares line. SCIENTIFIC DATA ANALYSIS 54 STATISTICAL DATA ANALYSIS • Standard Error of Estimate 𝑆𝑒 = 𝑆𝑆𝑦 − 𝑚 𝑆𝑆𝑥𝑦 𝑛−2 – Why would it be n – 2? (In other words, why does n have to be >2) – Use the same method as before to find m, SSxy and SSx: 𝑆𝑆𝑦 = Σ𝑦 2 − Σ𝑦 2 𝑛 SCIENTIFIC DATA ANALYSIS 55 STATISTICAL DATA ANALYSIS X -data Time (seconds) Y-data Position (meters) 0.7 3.8 SSxy = 1.8 3.2 m= 2.6 2.8 3.4 2.2 3.8 1.8 4.1 1.4 4.9 0.8 6.0 0.2 6.5 0 Σx = Σy = y2 Previously Calculated Data Σy2 = SCIENTIFIC INQUIRY AND ANALYSIS 56 STATISTICAL DATA ANALYSIS • Example: 1. From the example on the previous page find the following: Σ𝑦 2 𝑆𝑆𝑦 = Σ𝑦 − 𝑛 2 = 2. From the above calculation and previous calculated data find the standard error of estimate: 𝑆𝑒 = 𝑆𝑆𝑦 − 𝑚 𝑆𝑆𝑥𝑦 = 𝑛−2 SCIENTIFIC DATA ANALYSIS 57 STATISTICAL DATA ANALYSIS • Linear Correlation Coefficient, r – So far, we have been able to figure the line of best fit by using the line of least squares (which is also known as the “least squares regression line of y on x”) – We then wanted to determine the quality of our line by using the standard error of estimate. – The problem with the standard error of estimate is that it has units of y; therefore, when looking at two different sets of data, you cannot say that one graph is better than other because the units may skew the result. – The linear correlation coefficient helps to alleviate this problem by calculating a number that is unitless and therefore independent of the units. SCIENTIFIC DATA ANALYSIS 58 STATISTICAL DATA ANALYSIS • Linear Correlation Coefficient, r 𝑆𝑆𝑥𝑦 𝑟= 𝑆𝑆𝑥 𝑆𝑆𝑦 – The value of r r Indication 0 There is no linear relationship of the data points 1 or -1 There is a perfect linear relationship between the x and y data points; all points lie on the least-squares line. Between 0 and 1 The x and y data points have a positive correlation (+ slope) Between 0 and -1 The x and y data points have a negative correlation (- slope) SCIENTIFIC DATA ANALYSIS 59 STATISTICAL DATA ANALYSIS X -data Time (seconds) Y-data Position (meters) Previously Calculated Data 0.7 3.8 SSxy = 1.8 3.2 SSx = 2.6 2.8 SSy = 3.4 2.2 3.8 1.8 4.1 1.4 4.9 0.8 6.0 0.2 6.5 0 SCIENTIFIC INQUIRY AND ANALYSIS 60 STATISTICAL DATA ANALYSIS • Example: 1. From the example on the previous page find the following: 𝑟= 𝑆𝑆𝑥𝑦 𝑆𝑆𝑥 𝑆𝑆𝑦 = 2. What does the value of r indicate about the correlation of the data points? SCIENTIFIC DATA ANALYSIS 61 STATISTICAL DATA ANALYSIS • Coefficient of Determination, r2 – Another way of looking at the quality of your data is to look at how far away some y-data point (y) is from the mean of the y-data (𝑦). This is simply the deviation. 𝑦−𝑦 . – The deviation is made up of two parts: • The first part indicates how far away the least squares line (yp) is from the mean of the y-data (𝑦). This is simply 𝑦𝑝 − 𝑦 , and this is known as the explained portion of the standard deviation. • The second part indicates how far away a particular y-data point (y) is from the least squares line (yp). This is simply 𝑦 − 𝑦𝑝 , and this is known as the unexplained portion of the standard deviation. SCIENTIFIC DATA ANALYSIS 62 STATISTICAL DATA ANALYSIS • Coefficient of Determination, r2 – Recall that when the deviation is squared we get the variance or variation. Based on the explanation before the variance has two parts: the explained variation and the unexplained variation. – The Coefficient of Determination is a ratio of the explained variation to the total variation and is simply calculated by taking the Correlation Coefficient (r) and squaring it. SCIENTIFIC DATA ANALYSIS 63 STATISTICAL DATA ANALYSIS • Coefficient of Determination, r2 – So what does r2 indicate? – Change r2 into a % – The % indicates what % of the variation of the y data is explained by the variation of the x data if we use the least squares line. – 100% − 𝑟 2 indicates what % of the variation of the y data is due to random chance or some other variable beside the x that may influence y. SCIENTIFIC DATA ANALYSIS 64 STATISTICAL DATA ANALYSIS • Example: 1. From the previous example find the Coefficient of Determination, r2: 𝑟2 = 𝑆𝑆𝑥𝑦 𝑆𝑆𝑥 𝑆𝑆𝑦 2 = 2. What does the value of r2 indicate about the explained and unexplained portions of the variation? SCIENTIFIC DATA ANALYSIS 65 STATISTICAL DATA ANALYSIS • Correlation vs Causation – Correlation refers to one variable changing as another variable changes. – Causation refers to one variable changing because of another variable changing. (Cause & Effect) – Just because there is a correlation between two variables does not mean there is a causation. SCIENTIFIC DATA ANALYSIS 66 DO NOW / HW Unit 2-1 Check • Have out your homework and do the following: Find the mode, median, mean and standard deviation. 60% 86% 94% 100% 63% 89% 94% 100% 66% 89% 94% 100% 74% 91% 94% 100% 74% 91% 97% 100% 77% 94% 97% 100% 100% SCIENTIFIC DATA ANALYSIS 67 HW Assignment 2-1 Check • 10, 12, 14, 18, 36, 37, pg. 449 – 50 10. 8.33 9 9 18. 14 4 12. 85.625 85.5 91 36. $233,071.43 $142,000 none 14. 2.77 2.9 2.9 37. $645,000 $213,242.66 SCIENTIFIC DATA ANALYSIS 68 DO NOW / HW Unit 2-2 Check • Have out your homework and do the following: Make a histogram of the following data in 7 classes. These were the top 32 quarterback ratings in the NFL in 2012. 108.0 99.1 90.7 87.4 83.3 81.2 77.4 72.6 105.8 98.7 90.5 87.2 82.6 79.8 76.5 72.2 102.4 97.0 88.6 86.2 81.6 79.1 76.1 66.9 100.0 96.3 87.7 85.3 81.3 78.1 74.0 66.7 SCIENTIFIC DATA ANALYSIS 69 DO NOW / HW Unit 2-2 Check RANGE : CLASSES: BAR WIDTH: BAR STARTING POINT: UPPER BAR RANGES: INTERVAL: BOUNDARY ADJUSTMENT: BOUNDARIES STARTING POINT: BOUNDARY RANGES: # OF QB'S 41.3 7.0 6 66.7 72.7 0.1 0.05 66.65 72.65 78.7 84.7 90.7 96.7 102.7 108.7 78.65 84.65 90.65 96.65 102.65 108.65 66.65 - 72.65 4 72.65 - 78.65 5 78.65 - 84.65 7 84.65 - 90.65 7 SCIENTIFIC DATA ANALYSIS 90.65 - 96.65 96.65 - 102.65 2 5 102.65 - 108.65 2 70 HW Assignment 2-2 SCIENTIFIC DATA ANALYSIS 71 EXPERIMENTAL DESIGN • Standard Deviation – Example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10). 𝑥 𝒙−𝒙 (𝒙 − 𝑥)2 1 1 – 6.5 = -5.5 30.3 2 2 – 6.5 = -4.5 20.3 7 7 – 6.5 = 0.5 0.3 9 9 – 6.5 = 2.5 6.3 10 10 – 6.5 = 3.5 12.3 10 10 – 6.5 = 3.5 12.3 Mean = 39/6 = 6.5 – s2 = 81.8 / 5 = 16.4 Σ = 81.8 s = 4.05 SCIENTIFIC DATA ANALYSIS 72 EXPERIMENTAL DESIGN • Standard Deviation – Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method x x2 1 1 2 4 7 49 9 81 10 100 10 100 Σx = 39 Σx2 = 335 – SSx = 335 – 392/6 = 81.5 s = 4.04 SCIENTIFIC DATA ANALYSIS 73