Download Class 3 Lecture: Descriptive Statistics 2

Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • First math problem set will be handed out in Lab on Monday… • Due September 20 Today’s Class: • The Mean (and relevant mathematical notation) • Measures of Dispersion Review: Variables / Notation • Each column of a dataset is considered a variable • We’ll refer to a column generically as “Y” Person 1 # Guns owned 0 2 3 3 0 4 1 5 1 The variable “Y” Note: The total number of cases in the dataset is referred to as “N”. Here, N=5. Equation of Mean: Notation • Each case can be identified a subscript • Yi represents “ith” case of variable Y • i goes from 1 to N • Y1 = value of Y for first case in spreadsheet • Y2 = value for second case, etc. • YN = value for last case Person 1 # Guns owned (Y) Y1 = 0 2 Y2 = 3 3 Y3 = 0 4 Y4 = 1 5 Y5 = 1 Calculating the Mean • Equation: N 1 Y   Yi N i 1 • 1. Mean of variable Y represented by Y with a line on top – called “Y-bar” • 2. Equals sign means equals: “is calculated by the following…” • 3. N refers to the total number of cases for which there is data • Summation (S) – will be explained next… Equation of Mean: Summation • Sigma (Σ): Summation – Indicates that you should add up a series of numbers N The things on top and bottom tell you how many times to add up Y-sub-i… AND what numbers to substitute for i. Y i  i 1 The thing on the right is the item to be added repeatedly Equation of Mean: Summation N Y  Y Y i i 1 1 2  Y3  Y4  Y5 • 1. Start with bottom: i = 1. – The first number to add is Y-sub-1 • 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3 • 3. Keep adding numbers until i = N – In this case N=5, so stop at 5 Equation of the Mean: Example 2 • Can you calculate mean for gun ownership? N Person 1 # Guns owned (Y) Y1 = 0 2 Y2 = 3 3 Y3 = 0 4 Y4 = 1 5 Y5 = 1 1 Y   Yi N i 1 • Answer: 1 Y  5  1 5 Properties of the Mean • The mean takes into account the value of every case to determine what is “typical” – In contrast to the the mode & median – Probably the most commonly used measure of “central tendency” • But, it is often good to look at median & mode also! • Disadvantages – Every case influences outcome… even unusual ones – Extreme cases affect results a lot – The mean doesn’t give you any information on the shape of the distribution • Cases could be very spread out, or very tightly clustered The Mean and Extreme Values • Extreme values affect the mean a lot: Case Num CD’s Num CD’s2 1 20 20 2 40 40 3 0 0 4 70 1000 Mean 32.5 265 Changing this one case really affects the mean a lot Example 1 • And, very different groups can have the same mean: 16 14 12 10 8 6 4 Std. Dev = 21.72 2 0 Mean = 101 N = 23.00 0 50 25 100 75 150 125 200 175 Number of CDs (Group 1) Example 2 6 5 4 3 2 1 Std. Dev = 67.62 Mean = 100.0 0 N = 23.00 0.0 50.0 25.0 100.0 75.0 150.0 125.0 200.0 175.0 Number of CDs (Group 2) Example 3 14 12 10 8 6 4 2 Std. Dev = 102.15 0 N = 23.00 Mean = 104 0 50 25 100 75 150 125 200 175 Number of CDs (Group 3) Interpreting Dispersion • Question: What are possible social interpretations of the different distributions (all with the same mean)? • Example 1: Individuals cluster around 100 • Example 2: Individuals distributed sporadically over range 0-200 • Example 3: Individuals in two groups – near zero and near 200 Measures of Dispersion • Remember: Goal is to understand your variable… • Center of the distribution is only part of the story • Important issue: • How “spread out” are the cases around the mean? – How “dispersed”, “varied” are your cases? – Are most cases like the “typical” case? Or not? Measures of Dispersion • Some measures of dispersion: • 1. Range – Also related: Minimum and Maximum • 2. Average Absolute deviation • 3. Variance • 4. Standard deviation Minimum and Maximum • Minimum: the lowest value of a variable represented in your data • Maximum: the highest value of a variable represented in your data • Example: In previous histograms about number of CDs owned, the minimum was 0, the maximum was 200. The Range • The Range is calculated as the maximum minus the minimum – In case of CD ownership, 200 - 0 = 200 • Advantage: – Easy • Disadvantage: – 1. Easily influenced by extreme values… may not be representative – 2. Doesn’t tell you anything about the middle cases The Idea of Deviation • Deviation: How much a particular case differs from the mean of all cases • Deviation of zero indicates the case has the same value as the mean of all cases – Positive deviation: case has higher value than mean – Negative deviation: case has lower value than mean • Extreme positive/negative indicates cases further from mean. Deviation of a Case • Formula: di  Yi  Y • Literally, it is the distance from the mean (Y-bar) Deviation Example Case Num CD’s 1 20 Deviation from mean (32.5) -12.5 2 40 7.5 3 0 -32.5 4 70 37.5 Turning the Deviation into a Useful Measure of Dispersion • Idea #1: Add it all up – The sum of deviation for all cases: • What is sum of the following? -12.5, 7.5, -32.5, 37.5 N d i i 1 • Problem: Sum of deviation is always zero – Because mean is the exact center of all cases – Cases equally deviate positively and negatively – Conclusion: You can’t measure dispersion this way Turning the Deviation into a Useful Measure of Dispersion • Idea #2: Sum up “absolute value” of deviation – Absolute value makes negative values positive N – Designated by vertical bars: • What is sum? -12.5, 7.5, -32.5, 37.5 • Answer: 90 d  i i 1 – These 4 cases deviate by 90 cds from the mean • Problem: Sum of Absolute Deviation grows larger if you have more cases… – Doesn’t allow comparison across samples Turning the Deviation into a Useful Measure of Dispersion • Idea #3: The Average Absolute Deviation – Calculate the sum, divide by total N of cases – Gives the deviation of the average case • Formula: N N  d  Y Y i AAD  i 1 N i  i 1 N Turning the Deviation into a Useful Measure of Dispersion • Digression: Here we have used the mean to determine “typical” size of case deviations – Originally, I introduce the mean as a way to analyze actual case values (e.g. # of CDs owned) – Now: Instead of looking at typical case values, we want to know what sort of deviation is typical • In other words a statistic, the mean, is being used to analyze another statistic – a deviation – This is a general principle that we will use often: statistics can help us understand our raw data and also further summarize our statistical calculations! Average Absolute Deviation • Example: Total Deviation = 90, N=4 – What is Average absolute deviation? – Answer: 22.5 • Advantages – Very intuitive interpretation: • Tells you how much cases differ from the mean, on average • Disadvantages – Has non-ideal properties, according to statisticians Turning the Deviation into a Useful Measure of Dispersion • Idea #4: Square the deviation to avoid problem of negative values – Sum of “squared” deviation – Divide by “N-1” (instead of N) to get the average • Result: The “variance”: N s  2 Y d i 1 N 2 i N 1   (Y  Y ) i 1 i N 1 2 Calculating the Variance 1 Case 1 Num CD’s (Y) 20 2 40 3 0 4 70 Calculating the Variance 2 Case 1 Num Mean CD’s (Y) (Y bar) 20 32.5 2 40 32.5 3 0 32.5 4 70 32.5 Calculating the Variance 3 Case 1 Num Mean Deviation CD’s (Y) (Y bar) (d) 20 32.5 -12.5 2 40 32.5 7.5 3 0 32.5 -32.5 4 70 32.5 37.5 Calculating the Variance 4 Case 1 Num Mean Deviation Squared CD’s (Y) (Y bar) (d) Deviation (d2) 20 32.5 -12.5 150 2 40 32.5 7.5 56.25 3 0 32.5 -32.5 1056.25 4 70 32.5 37.5 1406.25 Calculating the Variance 5 • Variance = Average of “squared deviation” – Average = mean = sum up, divide by N – In this case, use N-1 • Sum of 150 + 56.25 + 1056.26 + 1406.25 = 2668.75 • Divide by N-1 – N-1 = 4-1 = 3 • Compute variance: • 2668.75 / 3 = 889.6 = variance = s2 The Variance • Properties of the variance – Zero if all points cluster exactly on the mean – Increases the further points lie from the mean – Comparable across samples of different size • Advantages – 1. Provides a good measure of dispersion – 2. Better mathematical characteristics than the AAD • Disadvantages: – 1. Not as easy to interpret as AAD – 2. Values get large, due to “squaring” Turning the Deviation into a Useful Measure of Dispersion • Idea #5: Take square root of Variance to shrink it back down • Result: Standard Deviation – Denoted by lower-case s – Most commonly used measure of dispersion • Formula: N sY  s  2 Y ( Y  Y )  i i 1 N 1 2 Calculating the Standard Deviation • Simply take the square root of the variance • Example: – Variance = 889.6 – Square root of 889.6 = 29.8 • Properties: – – – – Similar to Variance Zero for perfectly concentrated distribution Grows larger if cases are spread further from the mean Comparable across different sample sizes Example 1: s = 21.72 16 14 12 10 8 6 4 Std. Dev = 21.72 2 0 Mean = 101 N = 23.00 0 50 25 100 75 150 125 200 175 Number of CDs (Group 1) Example 2: s = 67.62 6 5 4 3 2 1 Std. Dev = 67.62 Mean = 100.0 0 N = 23.00 0.0 50.0 25.0 100.0 75.0 150.0 125.0 200.0 175.0 Number of CDs (Group 2) Example 3: s = 102.15 14 12 10 8 6 4 2 Std. Dev = 102.15 0 N = 23.00 Mean = 104 0 50 25 100 75 150 125 200 175 Number of CDs (Group 3) Thinking About Dispersion • Suppose we observe that the standard deviation of wealth is greater in the U.S. than in Sweden… – What can we conclude about the two countries? • Guess which group has a higher standard deviation for income: Men or Women? Why? • The standard deviation of a stock’s price is sometimes considered a measure of “risk”. Why? • Suppose we polled people on two political issues and the S.D. was much higher for one • What are some possible interpretations? • What are some other examples where the deviation would provide useful information?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Class 3 Lecture: Descriptive Statistics 2