Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Warm up The following graphs show foot sizes of gongshowhockey.com users. What shape are the distributions? Calculate the mean, median and mode for one FreqSep5 FreqApr1 450 400 350 300 250 200 150 100 50 0 80 70 60 50 40 30 20 10 0 8 9 10 11 12 13 13+ 8 9 10 11 12 13 13+ FreqApr1 80 70 60 50 40 30 20 10 0 8 9 10 11 12 13 13+ FreqSep5 450 400 350 300 250 200 150 100 50 0 8 9 10 11 12 13 13+ Measures of Spread Chapter 3.3 – Tools for Analyzing Data I can: calculate and interpret measures of spread MSIP/Home Learning: p. 168 #2b, 3b, 4, 6, 7, 10 What is spread? Histogram data Count spread tells you how widely the data are dispersed The histograms have identical mean and median, but the spread is different 6 5 4 3 2 1 2 3 4 5 6 data 7 8 9 Histogram data 4 Count 7 3 2 1 2 4 6 sp 8 10 Why worry about spread? spread indicates how close the values cluster around the middle value less spread means you have greater confidence that values will fall within a particular range. Vocabulary spread and dispersion refer to the same thing 1) range = max - min a quartile is one of three numerical values that divide a group of numbers into 4 equal parts 2) the Interquartile Range (IQR) is the difference between the first and third quartiles IQR = Q3 – Q1 Quartiles Example 26 28 34 36 38 38 40 41 41 44 45 46 51 54 55 range = 55 – 26 = 29 Q2 = 41 Median Q1 = 36 Median of lower half of data Q3 = 46 Median of upper half of data IQR = Q3 – Q1 = 46 – 36 = 10 (contains 50% of data) if a quartile occurs between 2 values, it is calculated as the average of the two values Quartiles Example 26 28 34 36 38 38 40 41 44 45 46 51 54 55 range = 55 – 26 = 29 Q2 = 40.5 Median Q1 = 36 Median of lower half of data Q3 = 46 Median of upper half of data IQR = Q3 – Q1 = 46 – 36 = (contains 50% of data) A More Useful Measure of Spread Range is a very basic measure of spread. Interquartile range is a somewhat useful measure of spread. Standard deviation is more useful. To calculate it we need to find the mean and the deviation for each data point Mean is easy, as we have done that before Deviation is the difference between a particular point and the mean Deviation The mean of these numbers is 48 Deviation = (data) – (mean) The deviation for 24 is 24 - 48 = -24 -24 12 24 36 The deviation for 84 is 84 - 48 = 36 36 48 60 72 84 Standard Deviation deviation is the distance from the piece of data you are examining to the mean variance is a measure of spread found by averaging the squares of the deviation calculated for each piece of data Taking the square root of variance, you get standard deviation Standard deviation is a very important and useful measure of spread Example of Standard Deviation 26 28 34 36 mean = (26 + 28 + 34 + 36) / 4 = 31 σ² = (26–31)² + (28-31)² + (34-31)² + (36-31)² 4 σ² = 25 + 9 + 9 + 25 4 σ² = 17 σ = √17 = 4.1 Measure of Spread - Recap Measures of Spread are numbers indicating how spread out / consistent data is Smaller measure of spread = more consistent data 1) Range = (max) – (min) 2) Interquartile Range: IQR = Q3 – Q1 where Q1 = first half median Q3 = second half median 3) Standard Deviation Find mean (average) Find deviations (data – mean) Square all, average them - this is variance (#4) or σ2 Take the square root to get std. dev. σ Standard Deviation σ² (lower case sigma squared) is used to represent variance σ is used to represent standard deviation σ is commonly used to measure the spread of data, with larger values of σ indicating greater spread we are using a population standard deviation x x 2 i n Standard Deviation with Grouped Data Hours of TV 2 3 4 5 Frequency 2 6 6 2 grouped mean = (2×2 + 3×6 + 4×6 + 5×2) / 16 = 3.5 deviations: 2: 3: 4: 5: 2 – 3.5 = -1.5 3 – 3.5 = -0.5 4 – 3.5 = 0.5 5 – 3.5 = 1.5 σ² = 2(-1.5)² + 6(-0.5)² + 6(0.5)² + 2(1.5)² 16 σ² = 0.7499 σ = √0.7499 = 0.9 f i xi x 2 f i MSIP / Homework read through the examples on pages 164-167 Complete p. 168 #2b, 3b, 4, 6, 7, 10 you are responsible for knowing how to do simple examples by hand (~6 pieces of data) we will use technology (Fathom/Excel) to calculate larger examples have a look at your calculator and see if you have this feature (Σσn and Σσn-1) Normal Distribution Chapter 3.4 – Tools for Analyzing Data Learning goal: Determine the % of data within intervals of a Normal Distribution MSIP / Home Learning: p. 176 #1, 3b, 6, 8-10 Histograms Histograms may be skewed... Right-skewed Left-skewed Histograms ... or symmetrical Histogram Collection 1 5 4 Count 3 2 1 3 4 5 6 7 a 8 9 10 11 Normal? A normal distribution creates a histogram that is symmetrical and has a bell shape, and is used quite a bit in statistical analyses Also called a Gaussian Distribution It is symmetrical with equal mean, median and mode that fall on the line of symmetry of the curve A Real Example the heights of 600 randomly chosen Canadian students from the “Census at School” data set the data approximates a normal distribution Histogram 600 Student Heights 0.035 0.030 0.025 Density 0.020 0.015 0.010 0.005 100 120 Density = normalDensity 140 x mean 160 180 Heightcm s 200 220 240 The 68-95-99.7% Rule area under curve is 1 (i.e. it represents 100% of the population surveyed) approx 68% of the data falls within 1 standard deviation of the mean approx 95% of the data falls within 2 standard deviations of the mean approx 99.7% of the data falls within 3 standard deviations of the mean http://davidmlane.com/hyperstat/A25329.html Distribution of Data 99.7% 95% 68% X ~ N ( x, ) 2 34% 34% 0.15% 0.15% 13.5% 13.5% 2.35% 2.35% x - 3σ x - 2σ x - 1σ x x + 1σ x + 2σ x + 3σ Normal Distribution Notation X ~ N ( x, ) 2 The notation above is used to describe the Normal distribution where x is the mean and σ² is the variance (square of the standard deviation) e.g. X~N (70,82) describes a Normal distribution with mean 70 and standard deviation 8 (our class at midterm?) An example Suppose the time before burnout for an LED averages 120 months with a standard deviation of 10 months and is approximately Normally distributed. What is the length of time a user might expect an LED to last with 68% confidence? With 95% confidence? So X~N(120,102) An example cont’d 68% of the data will be within 1 standard deviation of the mean This will mean that 68% of the bulbs will be between 120–10 months and 120+10 So 68% of the bulbs will last 110 - 130 months 95% of the data will be within 2 standard deviations of the mean This will mean that 95% of the bulbs will be between 120 – 2×10 months and 120 + 2×10 So 95% of the bulbs will last 100 - 140 months Example continued… Suppose you wanted to know how long 99.7% of the bulbs will last? This is the area covering 3 standard deviations on either side of the mean This will mean that 99.7% of the bulbs will be between 120 – 3×10 months and 120 + 3×10 So 99.7% of the bulbs will last 90-150 months This assumes that all the bulbs are produced to the same standard Example continued… 99.7% 95% 34% 34% 13.5% 13.5% 2.35% 2.35% 90 months 100 months 120 months 140 months 150 months Percentage of data between two values The area under any normal curve is 1 The percent of data that lies between two values in a normal distribution is equivalent to the area under the normal curve between these values See examples 2 and 3 on page 175 Why is the Normal distribution so important? Many psychological and educational variables are distributed approximately normally: Normal distributions are statistically easy to work with height, reading ability, memory, IQ, etc. All kinds of statistical tests are based on it Lane (2003) Exercises Complete p. 176 #1, 3b, 6, 8-10 http://onlinestatbook.com/ References Lane, D. (2003). What's so important about the normal distribution? Retrieved October 5, 2004 from http://davidmlane.com/hyperstat/normal_distri bution.html Wikipedia (2004). Online Encyclopedia. Retrieved September 1, 2004 from http://en.wikipedia.org/wiki/Main_Page