Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4/7/12 Empirical Loop Descriptive Statistics Inferential Statistics Collect Data Research Design Hypothesis Lecture Flow Chart Data Frequency Distributions 5-7 ft 3-5 ft Graphs 16 13 Measures of Central Tendency On average, these students are 5.2 ft tall. Chapter 3: Describing Data with Averages 1. Quantitative Data A. Mode Measures of Central Tendency B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal 1 4/7/12 Chapter 3: Describing Data with Averages 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal Mode: The value of the most frequent observation 4 4 4 8 4 8 2 6 4 12 8 8 2 6 5 3 4 8 4 8 Table 3.1: Terms in years of the last 20 U.S. Presidents Mode: The value of the most frequent observation B. Bimodal Figure 3.1 2 4/7/12 Chapter 3: Describing Data with Averages 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal Median: The value that is greater than or equal to 50% of the observations 0 2 2 3 5 Median: The value that is greater than or equal to 50% of the observations 0 2 2 3 5 Numbers of Siblings How to Calculate the Median 1. Order observations from least to most 0 2 2 3 5 Numbers of Siblings 2 5 2 3 0 3 4/7/12 How to Calculate the Median 1. Order observations from least to most 2. Find the middle position by adding one to the total number of observations and dividing by two 0 2 2 3 5 How to Calculate the Median 1. Order observations from least to most 2. Find the middle position by adding one to the total number of observations and dividing by two 3. If the middle position is a whole number, then the value at the middle position is the median. 0 2 2 3 5 Middle: (5+1)/2=3 Middle: (5+1)/2=3 How to Calculate the Median 4. If the middle position is NOT a whole number, then add the number immediately above and the number immediately below the middle position and divide by two. The result is the median. Compute the mode and median of the following sets of data A) 2, 2, 8 Median=(2+3)/2=2.5 B) 2, 3, 5, 5 0 2 2 3 5 6 C) 20.3, 22.7, 21.4 Middle: (6+1)/2=3.5 4 4/7/12 Compute the mode and median of the following sets of data A) 2, 2, 8 Mode=2, Median=2 B) 2,3,5,5 Mode=5, Median=4 C) 20.3, 21.4, 22.7 Mode=?, Median=21.4 Mean: What people usually think of as the “average.” X= ∑ xi n Chapter 3: Describing Data with Averages 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal Mean: What people usually think of as the “average.” 0 2 2 3 5 Numbers of Siblings 0+2+2+3+5=12 n=5 12/5=2.4 € 5 4/7/12 Mean: The “balance point” of a sample. Compute the mean of the following sets of data Figure 3.3 A) 2, 2, 8 B) 2, 2, 800 Compute the mean of the following sets of data A) 2, 2, 8 Mean=4 B) 2, 2, 800 Mean=268 Compute the mode, median, and mean of the following sets of data A) 2, 2, 8 Mode=2, Median=2, Mean=4 B) 2, 2, 800 Mode=2, Median=2, Mean=268 6 4/7/12 Chapter 3: Describing Data with Averages B. Bimodal 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal Symmetric Unimodal Distributions Bimodal/Multimodal Distributions B. Bimodal Mode=Median=Mean Modes 7 4/7/12 +/- Skewed Unimodal Distributions Self-Defense: Politics Proponents of the Bush administration’s 2003 tax cut proclaimed that on average, families receive a tax cut of $1000. Opponents of the cut countered that more than half of all families will receive a tax cut by less than $100. Source: Best, J. (2004) More Damned Lies and Statistics Chapter 3: Describing Data with Averages In 1984, the University of Virginia announced that its department of rhetoric and communications graduates’ MEAN STARTING SALARY was $55,000. This was highly skewed by the salary of one graduate, NBA center Ralph Sampson. Source: Gonick, L. & Woollcott, S. (1993) The Cartoon Guide to Statistics 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal 8 4/7/12 Nominal Data: Can only use mode. Ordinal Data: Can use mode or median. Table 3.5 Exercise: 20 College students were surveyed to determine where they would most like to spend their spring vacation: Daytona Beach (DB), San Diego (SD), South Padre Island (SP), Lake Havasu (LH) or Other (O). The results were as follows: DB DB SD LH DB DB 8 SD 4 LH 3 SD SP LH DB O SP 2 O SP SD DB LH O 3 DB SD DB O DB Find the Mode and (if possible) the Median 9 4/7/12 Lecture Flow Chart Chapter 3: Describing Data with Averages Data 1. Quantitative Data A. Mode B. Median C. Mean D. Which Average? 2. Qualitative & Rank Data A. Nominal vs. Ordinal Frequency Distributions Graphs Measures of Dispersion Measures of Central Tendency The standard deviation of COGS 14 student height is 1.1 ft. On average, these students are 5.2 ft tall. Chapter 4: Describing Variability I. Quantitative Data A. Range Measures of B.Variance Dispersion C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy 10 4/7/12 Range: The maximum value observed minus the minimum value observed Problems with Range • Its value is derived from only two observations, which means that it won’t replicate very well. • Its value depends on the size of the sample. Bigger samples will tend to have bigger ranges. -$3 -$1 0 $2 $2 Gambling Results max-min=$2-(-$3)=$2+$3=$5 Estimated Population Variance (i.e. estimated from your sample): s2 Chapter 4: Describing Variability I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy 2 s = ∑ (x i − x )2 i n −1 The mean ∑x x= i i n n=# of observations in the sample € Note: Textbook calls€s2 the “variance for sample” 11 4/7/12 Estimated Population Variance (i.e. estimated from your sample): s2 Problems with Variance • -$3 -$1 0 $2 $2 Gambling Results Its value is in “units squared.” 2 ∑ (x s2 = ∑ (x i − x) ∑x i x= n −1 − x )2 i s = 2 i n −1 i i n € -$3 -$1 0 $2 $2 2 2 s =4.5 dollars € € Estimated Population Standard Deviation (i.e. estimated from your sample): s s = s2 = ∑ (x i − x )2 i n −1 ∑x x= Sample Standard Deviation: s -$3 -$1 0 $2 $2 Gambling Results i i n s = s2 = ∑ (x i − x )2 i n −1 n=# of observations in the sample € € Note: Textbook calls this the “standard deviation for sample” € 12 4/7/12 Sample Standard Deviation: s ∑ (x s= i − x) 2 s= n Definition Formula n ∑ x i2 − (∑ x ) 2 i n2 Computation Formula € € Standard Deviation Not Enough For Perverted Statistician http://www.theonion.com/content/index/3625 Standard Deviation: • An intuitive interpretation: the standard deviation is APPROXIMATELY the average distance of the observations from the mean. • For many frequency distributions, a majority (approx. 68% for normal distributions) of all observations are within ± one standard deviation of the mean. • For many frequency distributions, a small minority (approx. 5% for normal distribution) of all observations are beyond ± two standard deviations of the mean. pg. 86 13 4/7/12 Sample, Population, Estimate of Population Chapter 4: Describing Variability I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy Sample Population Empirical Loop Descriptive Statistics Collect Data Mean: Research Design x= € Inferential Statistics Estimate of Population from a Sample ∑x i µ= n Estimate of Population Parameter from a Sample ∑x i N Population Parameter (for discrete data) € Hypothesis 14 4/7/12 Estimate of Population Parameter from a Sample Population Parameter (for discrete data) Variance: s2 = ∑ (x i − x )2 2 σ = i n −1 ∑ (x i − µ) 2 1 i N ∑ (x s= € i − x )2 € i ∑ (x σ= n −1 i -2 4 4 3 Scores on a Quiz (pts.) Standard Deviation: € Compute the range of the sample and estimate the population variance and standard deviation from the sample. − µ) 2 i Range=4-(-2)=4+2=6 pts. Mean=10/5=2 pts. Estimate of Pop.Var.=26/4=6.5 pts.2 Estimate of Pop. Std.=sqrt(6.5)=2.5 pts. N € Chapter 4: Describing Variability Outliers: very extreme observations I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy 15 4/7/12 Estimate of Pop. Std. Estimate of Pop. Std. -$3 -$1 $0 $1 $2 $2 -$3 -$1 $0 $1 $2 $2 Gambling Results σˆ = 1.9 Gambling Results -$3 € -$1 $0 $1 $2 $200 s=1.9 -$3 -$1 $0 $1 $2 $200 Luckier Gambling Results Luckier Gambling Results σˆ = 81.7 s=81.7 Interquartile Range: IQR Interquartile Range: IQR € 1. Sort the data. 2. Split the data in half. If you have an odd number of data points, throw out the median. 3. The median of the lower half is the lower quartile. The median of the upper half is the upper quartile. 4. IQR=upper quartile-lower quartile upper quartile-lower quartile=IQR -$3 -$1 0 $1 $2 $2 Gambling Results $2-(-$1)=$2+$1=$3 -$3 -$1 0 $1 $2 $200 Luckier Gambling Results $2-(-$1)=$2+$1=$3 16 4/7/12 Approx. Interquartile Range: IQR (via upper & lower hinge) Compute the interquartile range of these scores. 1. Sort the data. NOTE: THIS IS NOT 2. Split the data in half. If you have an odd EXACTLY THE throw SAMEout ASthe THE number of data points, IQR ALGORITHM IN THE median. TEXTBOOK!! (pg 92) 3. The median of the lower half is the lower quartile. The median of the upper half is the USE THE IQR ALGORITHM upper quartile. FROM LECTURE 1 -2 4 4 3 -4 7 0 -1 5 Scores on a Quiz (pts) IQR=4-(-1)=4+1=5 pts. 4. IQR=upper quartile-lower quartile Mean Median Standard Deviation Interquartile Range 17 4/7/12 Comparing Multiple Data Sets Chapter 4: Describing Variability I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy Boxplot (“refined” boxplot to be precise) Outliers Upper Quartile (UQ) Frequency Polygons Comparing Multiple Data Sets Positive outliers>UQ+1.5*IQR Negative outliers<LQ-1.5*IQR Maximum value that is not an outlier Median Lower Quartile (LQ) Minimum value that is not an outlier 18 4/7/12 Comparing Multiple Data Sets Chapter 4: Describing Variability I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range E. New Graphs II. Nominal & Ordinal Data A. Entropy (if data are normally distributed) Frequency Distribution Entropy (in bits): H H = −∑ f (x i )log 2 ( f (x i )) 4 f(x) is the relative frequency of value x € 6 4 6 0.4 0.6 19 4/7/12 Minimum Entropy: 10 0 1 0 0 bits Minimum Entropy: 0 0 9912 1 0 bits (When one outcome is much more probable, entropy (uncertainty) is lower.) Maximum Entropy (for two values): 5 5 0.5 0.5 1 bit (When any outcome is equally probable, entropy (uncertainty) is at its highest.) Maximum Entropy (for four values): 5 5 5 5 0.25 0.25 0.25 0.25 2 bits 4 bits 20 4/7/12 Maximum Entropy: H Compute the sample relative entropy of the following data. max ⎛ 1 ⎞ H max = −log 2 ⎜ ⎟ = log 2 k ⎝ k ⎠ Fr So Fr Fr Total # of Classes=k Jr Jr So Fr Given the # of alternatives, how big can entropy get? College Year € Relative Entropy: J Entropy=.5+.5+.5+0=1.5 bits Max Entropy=2 bits Relative Entropy=1.5/2=.75 H J= H max € Converting log10 into log2: log 2 x = Base of log doesn’t matter for relative entropy. Maximum Entropy: H log10 x log10 2 max H max € For example: log 2 0.5 = log10 0.5 −.301 = = −1 log10 2 .301 ⎛ 1 ⎞ = −log 2 ⎜ ⎟ = log 2 k ⎝ k ⎠ Total # of Classes=k € Relative Entropy: J J= H H max € € 21 4/7/12 Chapter 4: Describing Variability I. Quantitative Data A. Range B.Variance C. Standard Deviation 1. Sample, Population, or Estimate of Population D. Interquartile Range II. Nominal & Ordinal Data A. Entropy Free Online Software: This week’s homework: • In addition to Quiz 3! • Posted on WebCT • Due Monday 10/18 at start of lecture! • If you show your work, we can give you partial credit. • Graphs can be hand drawn but need to be legible. • It’s good to check answers with others, but do your own work. Free Online Software: Mean, Median, Estimated Population Standard Deviation: http://www.r-project.org/ • Examples of how to use R will be posted on the course calendar. • Go to labs for help http://www.physics.csbsju.edu/stats/cstats_NROW_form.html Google spreadsheets: http://docs.google.com Open Office (Software): http://www.openoffice.org/ 22 4/7/12 Lecture Flow Chart Percentages I spent 33% of yesterday asleep. Data Measures of Dispersion Measures of The standard deviation of COGS 14 student height is 1.1 ft. Central Tendency Percentages # of instances of a class total # of instances 7 female students =70% 10 students On average, these students are 5.2 ft tall. Percentages # of instances of a class total # of instances “The Base” Percentages A used car salesman originally lists a car on his lot for $1000. He puts a sale sticker on the car, announcing that he’s cut 20% off the list price. When you go to look at the car, he tells you that because he likes you so much, he’ll give you a further 10% discount. What’s the price he’s offering? $1000*20%=$200 $800*10%=$80 Total Cut=$280 (28% cut) 23 4/7/12 Correlations Smoker Statehood (n=1000 smokers) 1. 20% of the smokers are Californians? • Are two variables related? • Let’s quantify it! Hydration level (healthy = 0) Proportion of obstacles avoided on driving test 2. New Yorkers are twice as likely to smoke as Californians? Coffee consumption (cups per week) *Caution: All of the following correlations are fictional. Any resemblance to real correlations, living or dead, is entirely coincidental. Coffee consumption (cups per week) Relationship can be strong and negative 24 4/7/12 Yearly feet of snow in Arctic UCSD students Kilopirates (thousands of pirates) Hours of studying per week RANGE RESTRICTION can lower a correlation 25