Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hss2381a – stats and stuff The Normal Curve, part 1 No class on Thursday! Interdisciplinary Journal of Health Sciences • WANTED: Seeking applicants for the 2011-2012 editorial team • Students in both the English and French HSS streams are encouraged to apply. • Send an email expressing your interest in the position to [email protected], with your resume attached. • Successful candidates will be invited to a panel interview. • Deadline to apply: Wednesday, September 28th, 2011 Last time…. • We covered measures of central tendency: – Mode – Median – Mean • And two measures of variability: – Range – Interquartile Range Two More Measures of Variability • Standard deviation • Variance The Standard Deviation • Standard deviation (SD or σ): An index that conveys how much, on average, scores in a distribution vary • SDs are based on deviation scores (x), calculated by subtracting the mean from each person’s original score x=X-M Standard Deviation Interpretation • In a normal distribution, a fixed percentage of cases lie within certain distances from the mean: Example • We weigh 10 students and collect their weight in pounds: – 110 120 130 140 150 150 160 170 180 190 • What is the mean? (M) 150 For the lightest person, their weight is the mean – 40 For the heaviest person, their weight is the mean +40 What’s a deviation? • A “deviation” is how much each data point deviates from the mean – So for X1 the deviation is -40 – And for x10 the deviation is +40 • So what’s a “standard deviation”? • It’s some sort of measure of how much the “typical” data point deviates from the mean Let’s go back to our data… • Mean = 150 Data (weights Deviation from in pounds) Mean -40 110 -30 120 -20 130 -10 140 0 150 0 150 10 160 20 170 30 180 40 190 0 TOTAL Defining Standard Deviation • The sum of all deviation scores in a distribution always = 0 • to compute SDs, deviation scores must be squared (x2) before being summed • SD equation: SD = Square root of: Σx2 ÷ (N -1) Standard Deviation (cont’d) Weights (pounds): 110 120 130 140 150 150 160 170 180 190 Deviation scores (x) for M = 150: -40 -30 -20 -10 0 0 10 20 30 40 Squared deviation scores (x2): 1600 900 400 100 0 0 100 400 900 1600 Sum of squared deviation scores: 1600+900+400+100+0+0+100+400+900+1600 = 6000 SD = √(6000/(N -1) = SD = √(6000/(9) = 25.82 A little bit about notation σ “sigma” = standard deviation in the reference population s Lower case “s” = standard deviation in the sample The textbook uses “SD” for both Standard Deviation Interpretation • Provides a “standard”—the SD indicates the average amount of deviation of scores from the mean • Tells you how wrong, on average, the mean is as a summary of the overall distribution • An SD provides valuable information when the distribution is normal: – There are approximately three SDs above and below the mean in a normal distribution Standard Deviation Interpretation (cont’d) • In a normal distribution, a fixed percentage of cases lie within certain distances from the mean: SDs and Individual Scores • A person who scores one SD below the mean has a higher score than 16% of the cases (2.3% + 13.6%) • A person who scores one SD above the mean has a higher score than 84% of the cases (50.0% + 34.1%) Standard Deviation: Advantages • Takes all data into account in describing variability • Is more stable as a measure of variability than the range or IQR • Lends itself to computation of other measures often used in inferential statistics • Is helpful in interpreting individual scores when data are distributed approximately normally Standard Deviation: Disadvantages • Can be influenced by extreme scores • Not as “intuitive” or as easy to interpret as the range Variance • An important variability concept in inferential statistics, but not used descriptively • The variance = SD2 • In earlier example, SD2 = 25.822 = 666.67 • Not easily interpreted because it is not in units of original data—it is in units squared (here, pounds squared) More about notation σ “sigma” = standard deviation in the reference population s Lower case “s” = standard deviation in the sample σ2 “sigma squared” = variance in the reference population s2 Variance in the sample Formulae for Variance Population variance Sample variance Measurement Scales and Descriptive Statistics Scale Central Variability Tendency Index Index Nominal Mode -- Ordinal Median Range, IQR Interval and ratio Mean Standard deviation, Variance Relative Standing • Central tendency and variability indexes describe a distribution • There are also descriptive statistics to describe individual scores—i.e., their relative standing or position in a distribution: – Percentile ranks – Standard scores Percentiles • A percentile is one one-hundredth of a distribution • Quartiles divide a distribution into quarters • Deciles divide a distribution into tenths • Each percentile, quartile, etc. can be determined in relation to a score in a distribution Percentile Rank • A percentile rank is the location of a given score in the distribution—it communicates what percentage of cases fall at or below that value – Score What percentile rank? – Percentile What score? Percentiles and Outliers • Outliers are often defined in relation to percentiles • There are: – Mild outliers – Extreme outliers NOT what we’re talking about An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. -Grubbs (Wikipedia) In this course (as per the textbook), an outlier is a value that is >1.5 times the IQR Outliers: Formal Definition • A mild outlier is a score that is between 1.5 and 3.0 times the value of the IQR, below Q1 or above Q3 • An extreme outlier is a score that is greater than 3.0 times the value of the IQR, below Q1 or above Q3 Box Plots • A box plot (or box-and-whiskers plot) is a graphic depiction of a distribution that shows the median, the IQR, and the outer limits of values not considered outliers – Outlying cases can be shown on the box plot, with identifying information (e.g., an ID number) Traditionally… But for the purposes of this course (due to the textbook’s insistence)… The extent of the boxplot is NOT the range, but rather those data points that are NOT outliers Box Plots (cont’d) • • • • Bottom of “box” shows Q1 Top of “box” shows Q3 Horizontal line in box shows median “Whiskers” show outer limits of what is NOT an outlier – In SPSS, a circle O indicates value and ID of a mild outlier – An asterisk * is for an extreme outlier Box Plot Illustration – p52 Textbook Heart Rate Data: Q1 = 62 Q2 = 66 = Median Q3 = 68 “Whiskers” limits: 53, 77 Mild outliers: 50 (#106), 45 (#105) Extreme outliers: 40 (#104), 90 (#103), 95 (#102), 100 (#101) Box Plots Versus Histograms • Outliers can be seen in histograms, but box plots give more useful information about degree of extremity and ID numbers (Stolen from wikipedia) Standard Scores • Also called z-score or z-statistic or z-value or normal score • Is a measure of how far an observation is from the mean of its distribution • The z-score only has meaning if you know the parameters of the reference population • i.e.: μ and σ Standard Scores • Standard scores—another index of “relative standing” helpful in interpreting raw scores • A standard score (also called a z score) is a score expressed in standard deviation units, in relative distance from the mean Standard Scores (cont’d) • Standard score equation: z = (X – M) ÷ SD • That is, the mean is subtracted from an individual score, then divided by the SD • For example: M = 100, SD = 25, X = 125, z = 1.0 M = 100, SD = 25, X = 50, z = -2.0 How is this useful? • Very useful in standardized testing (like MCAT, GRE, SAT, etc) • Allows us to: – Calculate the probability of a score occurring within a normal distribution – Compare two scores that are from different normal distributions Calculating a Probability Using a z-score For a variable distributed normally (such as MCAT scores in Canada, a z-score of 1.96 will have 95% of observations falling within its range. Example • We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So…. – What is the lowest mark among those who were in the top 10% of performers? – (Why? Because law schools will only take the top 10% and need to know what mark to make their cut-off) Example • We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So…. We get the “1.282” by looking it up in a table, or using a zscore calculator http://www.fourmilab.ch/rpkp/experi ments/analysis/zCalc.html Using z-scores to compare tests • A student is in two classes, English and Math. • She got 70% in English and 70% in Math and wants to know which class she’s doing better in – Why isn’t the answer automatically “English”? Using z-scores to compare tests • A student is in two classes, English and Math. Using z-scores to compare tests Since these scores are from two different distributions, we need to standardise them into z-scores so that they can be directly compared. This gives us: Using z-scores to compare tests How do we interpret this? Z=0.67 suggests that the student performed 0.67 SDs above the mean in both classes. This makes her above average in both classes. But she’s doing equally well in both. (If we use a z-score calculator, we’d find out that z=0.67 means that she’s in the top 25.1% of the class.) Standard Scores (cont’d) • Standard scores have a mean of 0.0 and an SD of 1.0: • But z scores can be transformed mathematically to have any mean and SD • Most typical: – Mean = 500, SD = 100 (e.g., GRE, SAT) – Mean = 100, SD = 15 (e.g., IQ tests) – Mean = 50, SD = 50 (called T scores) The Normal Distribution • Central Limit Theorem: – Under “mild” conditions, a large number of any random variable will be distributed “normally” • For fun, go to: – http://www.math.csusb.edu/faculty/stanton/prob stat/clt.html – This is an “applet” that you keep clicking on. It produces a graph of a random variable. You will see that it always ends up being a Normal curve Properties of the Normal Distribution • About 68% of values drawn from a normal distribution are within one standard deviation ( σ )away from the mean • about 95% of the values lie within two standard deviations from the mean • about 99.7% are within three standard deviations • This fact is known as the 68-95-99.7 rule or the empirical rule or the 3-sigma rule 3-sigma rule Homework • P.57, A4, A5