Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Standard Deviation as a Ruler and the Normal Model Normal curve, z-scores and calculating probabilities The Shape and Distribution of Data Data sets consisting of physical measurements (heights, weights, lengths of bones, heart rates, blood pressures and so on) for adults of the same species and sex tend to follow a similar pattern. The pattern is that most individuals are clumped around the average, with numbers decreasing the farther values are from the average in either direction. 30 Sample Distribution of Men's Height in U.S. Sample Size = 200 25 Sample mean ( x)= 69.65 Sample standard deviation(s) = 3.36 Frequency 20 15 10 5 0 60.00 62.00 64.00 66.00 68.00 70.00 72.00 74.00 76.00 78.00 80.00 82.00 Men's Height - Sample Size n = 200 We use a theoretical distribution, instead of the sample distribution. This allows us to answer certain questions using only the mean and standard deviation of the data Example: The distribution of hemoglobin measures Hemoglobin: Substance in the blood that carries oxygen Swedish Men: Aged 35 - 50 Events that effects Hemoglobin levels Event Effect on hemoglobin Has a gene for producing a lot of hemoglobin Increase Just returned from a trip to the mountains Increase Has been eating a poor diet lately Decrease Recent illness Decrease If we assume that everyone has a 50:50 chance of each event and that each event leads to an increase or decrease by about the same amount, the histogram of hemoglobin levels will have a symmetric, normal shape. Distribution of car insurance damage claims The measurements follow a right skewed distribution. Majority of claims were below $5,000, but there were occasionally a few extremely high claims. Normal Distribution (or Normal Curve) 1. Naturally occurring distribution: Many populations of measurements follow approximately a normal curve: • Physical measurements within a homogeneous population – heights of male adults. • Standard academic tests given to a large group – SAT scores. Normal Distribution (or Normal Curve) Proportion of population of measurements falling in a certain range = area under curve over that range. Knowing only the mean and standard deviation, we can use the assumption of normality to answer certain questions. Example: The probability that a British man will be between 65 and 70 inches. The Standard Deviation as a Ruler The trick in comparing very different-looking values is to use standard deviations as our rulers. The standard deviation tells us how the whole collection of values varies, so it’s a natural ruler for comparing an individual to a group. As the most common measure of variation, the standard deviation plays a crucial role in how we look at data. Characteristics of a Normal Curve Symmetric Empirical and Bell-Shaped Rule For any normal curve, approximately … 68% of the values fall within 1 standard deviation of the mean in either direction 95% of the values fall within 2 standard deviations of the mean in either direction 99.7% of the values fall within 3 standard deviations of the mean in either direction Empirical Rule for Any Normal Curve 68% -1sd 95% +1sd -2 sd +2 sd 99.7% -3 sd +3 sd Standardizing with z-scores We compare individual data values to their mean, relative to their standard deviation using the following formula: x x z s We call the resulting values standardized values, denoted as z. They can also be called z-scores. Standardizing with z-scores Standardized values have no units. z-scores measure the distance of each data value from the mean in standard deviations. A negative z-score tells us that the data value is below the mean, while a positive z-score tells us that the data value is above the mean. Benefits of Standardizing Standardized values have been converted from their original units to the standard statistical unit of standard deviations from the mean. Thus, we can compare values that are measured on different scales, with different units, or from different populations. Standardized Score A “standardized score” is simply the number of standard deviations an individual falls above or below the mean for the whole group. Example Male heights have a mean of 70 inches and a standard deviation of 3 inches. Female heights have a mean of 65 inches and a standard deviation of 2 ½ inches. Thus, a man who is 73 inches tall has a standardized score of 1. Thought Question: What is the standardized score corresponding to your own height? Standardizing with z-scores Standardized Score (or z-score) is : observed value – mean standard deviation Note: The term z-score relates only to standard normal curves Example: Male heights have a mean of 70 inches and a Standard deviation of 3 inches. If my height is 71.5 inches, then my standardized score is 71.5 – 70/3 = 0.50 More on z-scores Standardizing data into z-scores shifts the data by subtracting the mean and rescales the values by dividing by their standard deviation. Standardizing into z-scores does not change the shape of the distribution. Standardizing into z-scores changes the center by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1. When is a z-score BIG? A z-score gives us an indication of how unusual a value is because it tells us how far it is from the mean. A data value that sits right at the mean, has a z-score equal to 0. A z-score of 1 means the data value is 1 standard deviation above the mean. A z-score of –1 means the data value is 1 standard deviation below the mean. Describing Populations with Theoretical Normal Models There is a Normal model for every possible combination of mean and standard deviation. We write N(μ,σ) to represent a Normal model with a mean of μ and a standard deviation of σ. We use Greek letters because this mean and standard deviation are not numerical summaries of the data. They are part of the model. They don’t come from the data. They are numbers that we choose to help specify the model. Such numbers are called parameters of the model. Describing Populations with Theoretical Normal Models When we standardize using theoretical normal model N(μ,σ), we still call the standardized value a z-score, and we write z y Heights of Adult Women in US N(μ = 65, σ=2.5) z What y is z-score for height of 68 inches? Using the Normal Model Once we have standardized, we need only one model: The N(0,1) model is called the standard Normal model (or the standard Normal distribution). Be careful—don’t use a Normal model for just any data set, since standardizing does not change the shape of the distribution. The Standard Normal Curve 1. The standard normal curve has a mean of 0 and standard deviation of 1. 2. We convert our data values to the their corresponding values on the standard normal curve. 3. This enables us to get exact probabilities of certain events. Using the Normal Model When we use the Normal model, we are assuming the population distribution is Normal. We cannot check this assumption in practice, so we check the following condition: Nearly Normal Condition:The shape of the sample data’s distribution is unimodal and symmetric. This condition can be checked with a histogram or a Normal probability plot. NHANES of 1976-1980 The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. Heights of adults, ages 18-24 women mean: 65.0 inches standard deviation: 2.5 inches men mean: 70.0 inches standard deviation: 2.8 inches NHANES of 1976-1980 Empirical Rule Women: mean: 65.0 inches, standard deviation: 2.5 inches 68% are between 62.5 and 67.5 inches [mean 1 std dev = 65.0 2.5] 95% are between 60.0 and 70.0 inches 99.7% are between 57.5 and 72.5 inches Thought Question Men: mean: 70.0 inches, standard deviation: 2.8 inches 95% are between A. [64.4, 75.6 ] B. [67.2,72.8] C. [61.6, 78.4 ] Using the Mean, Standard Deviation and the Normal Model With the Mean and Standard Deviation of the Normal Distribution we can determine: 1. At what percentile a given individual falls, if you know their value? Example: What proportion (or percentage) of men are less than 72.8 inches tall? 2. What proportion of individuals fall into any range of values? Example: What proportion (or percentage) of men are between 68 inches tall and 74 inches tall? 3. What value corresponds to a given percentile? Example: What height value is the 10th percentile for men ages 18 to 24? NHANES of 1976-1980 At what percentile a given individual falls, if you know their value? Q. What proportion (or percentage) of men are less than 72.8 inches tall? 68% (using the Empirical Rule) 1. 32% / 2 = ? 16% -1 standard deviations: height values: +1 70 72.8 (height values) -3 -2 -1 0 1 2 3 61.6 64.4 67.2 70 72.8 75.6 78.4 NHANES of 1976-1980 Q. What proportion (or percentage) of men are less than 68 inches tall? Note: Here we can’t use the Empirical Rule ? standard deviations: height values: 68 70 (height values) -3 -2 -1 0 1 2 3 61.6 64.4 67.2 70 72.8 75.6 78.4 NHANES: Calculate Standardized Score Q. What proportion of men are less than 68 inches tall? How many standard deviations is 68 from 70? standardized score = (observed value minus mean) / (std dev) = (68 70) / 2.8 = 0.71 The value 68 is 0.71 standard deviations below the mean 70. NHANES of 1976-1980 What proportion (or percentage) of men are less than 68 inches tall? ? 68 70 (height values) -0.71 0 (standardized values) What is the probability of a standardized-score less than -0.71? NHANES of 1976-1980 What proportion (or percentage) of men are less than 68 inches tall? 24% 68 70 (height values) -0.71 0 (standardized values) The proportion (or percentage) of men less than 68 inches tall is 0.24 (or 24%) Question What proportion of mean are 74 inches or shorter? (hint: the standardized score for 74 inches is 1.43) NHANES of 1976-1980 2. What proportion (or percentage) of individuals fall into any range of values? Q. What proportion (or percentage) of men are between 68 inches tall and 74 inches tall? First, draw and label a normal curve. standard deviations: height values: -3 -2 -1 0 1 2 3 61.6 64.4 67.2 70 72.8 75.6 78.4 NHANES of 1976-1980 Shade on the graph the range of heights between 68 and 74 inches. ? (height values) (standardized values) 68 70 74 -0.71 0 1.43 NHANES of 1976-1980 24% of the men are less than 68 inches tall and 92.4% of men are less than 74 inches tall What percentage are between 68 inches tall and 74 inches tall? 92.4% ? 24% (height values) 68 70 74 (standardized values) -0.71 0 1.43 NHANES of 1976-1980 Q. What proportion (or percentage) of men are between 68 and 74 inches tall? 92.4% 92.4% – 24% = 68.4% 24% (height values) (standardized values) 68 70 74 -0.71 0 1.43 The proportion (or percentage) of men between 68 inches tall and 74 inches tall is 0.684 (or 68.4%) NHANES of 1976-1980 3. What value corresponds to a given percentile? Q. What height value is the 10th percentile for men ages 18 to 24? 10% ? 70 (height values) Look up the closest percentile in the table. Find the corresponding standardized score. The value you seek is that many standard deviations from the mean. Normal Tables NHANES of 1976-1980 What height value is the 10th percentile for men ages 18 to 24? 10% ? -1.28 70 (height values) 0 (standardized values) NHANES of 1976-1980 Observed Value for a Standardized Score What height value is the 10th percentile for men ages 18 to 24? observed value = mean plus [(standardized score) (std dev)] = 70 + [(1.28 ) (2.8)] = 70 + (3.58) = 66.36 The value 66.36 is approximately the 10th percentile of the population. Summary 1. At what percentile a given individual falls, if you know their value? 2. What proportion of individuals fall into any range of values? 3. If possible, use Empirical Rule Otherwise calculate standardized score Look up normal tables to find percentile Calculate standardized score for lowest and highest value for range Look up normal tables to find percentile below both values Subtract the larger percentile from the smaller percentile What value corresponds to a given percentile? Look up the closest percentile in the table Find the corresponding standardized score The value you seek is that many standard deviations from the mean How to check data for Normality When you actually have your own data, you must check to see whether a Normal model is reasonable. Looking at a histogram of the data is a good way to check that the underlying distribution is roughly unimodal and symmetric. How to check data for Normality A more specialized graphical display that can help you decide whether a Normal model is appropriate is the Normal probability plot. If the distribution of the data is roughly Normal, the Normal probability plot approximates a diagonal straight line. Deviations from a straight line indicate that the distribution is not Normal. How to check data for Normality Nearly Normal data have a histogram and a Normal probability plot that look somewhat like this example: How to check data for Normality A skewed distribution might have a histogram and Normal probability plot like this: SAS - PROC UNIVARIATE title "Demonstrating MIDPOINT= Histogram Option"; proc univariate data=example.Blood_Pressure; id Subj; var SBP; histogram / normal midpoints=100 to 170 by 5; probplot / normal(mu=est sigma=est); run; PROC UNIVARIATE - HISTOGRAM SAS – Normal Probaility Plot Heart Rate Example Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR F 55 M 66 F 70 M 73 F 77 M 79 F 82 M 57 F 67 F 70 M 73 F 77 M 79 M 82 M 59 F 67 M 70 M 73 F 77 M 79 F 83 F 61 F 68 M 70 M 73 M 77 F 80 M 83 M 61 F 68 F 71 F 74 M 77 F 80 M 83 M 62 F 68 F 71 F 74 F 78 M 80 F 84 M 62 M 68 M 71 F 74 F 78 F 81 F 84 F 63 F 69 M 71 M 74 F 78 F 81 M 85 F 64 M 69 F 72 F 75 F 78 F 81 F 86 M 64 M 69 M 72 F 75 M 78 M 81 F 86 M 64 M 69 F 73 M 75 M 78 F 82 M 89 M 66 F 70 M 73 M 76 M 79 F 82 M 89 For the heart rate data for 84 adults: Mean HR = 74.0 bpm SD = 7.5 bpm Mean 1SD = 74.0 7.5 = 66.5-81.5 bpm 25 Mean 2SD = 74.0 15.0 = 59.0-89.0 bpm Frequency 20 15 10 Mean 3SD = 74.0 22.5 = 51.5-96.5 bpm 5 0 // 55 60 65 70 75 80 Heart Rate (BPM) 85 90 Heart Rate Example HR Data: 57/84 (67.9%) subjects are between mean ± 1SD 82/84 (97.6%) are between mean ± 2SD 84/84 (100%) are between mean ± 3SD 100 +3 SD 95 +2 SD 90 Heart rate (bpm) 85 + 1SD 80 Mean 75 70 -1 SD 65 -2 SD 60 55 -3 SD 50 45 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 Subject number Question