Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PSYC 6130C UNIVARIATE ANALYSIS Prof. James Elder Introduction What is (are) statistics? • A branch of mathematics concerned with understanding and summarizing collections of numbers • A collection of numerical facts • Estimates of population parameters, derived from samples PSYC 6130, PROF. J. ELDER 3 What is this course about? • Applied statistics • Emphasizes methods, not proofs • Descriptive statistics • Inferential statistics PSYC 6130, PROF. J. ELDER 4 Fall Term Date Title Readings 10-Sep-08 Introduction Probability Descriptive statistics 1.1-1.3 5.1-5.5, 5.7 2.1,2.2,2.5,2.72.9,2.12,2.13 17-Sep-08 The normal distribution 3.1-3.4 24-Sep-08 Introduction to hypothesis testing t-tests 4 7 Notes Lab 1 Rosh Hashanah – No Classes 1-Oct-08 8-Oct-08 t-tests 7 Lab 2 15-Oct-08 Statistical power and effect size 8 Assignment 1 due 22-Oct-08 Correlation and regression 9 29-Oct-08 One-way independent ANOVA 11 5-Nov-08 Multiple comparisons 12.1-12.12 12-Nov-08 Multiple comparisons 12.1-12.12 19-Nov-08 Two-way ANOVA 13.1-13.11,13.14 Assignment 2 due 26-Nov-08 Review 3-Dec-08 Exam PSYC 6130, PROF. J. ELDER 5 Lab 3 Lab 4 Winter Term Date Title Readings Deadlines 7-Jan-09 Repeated measures ANOVA 14 14-Jan-09 Two-way mixed design ANOVA 14 Lab 5 Deadline for choosing project topic 21-Jan-09 Reading Week 28-Jan-09 Multiple regression 15 Lab 6 4-Feb-09 The general linear model 16 Assignment 3 due, drop date is Feb 1 11-Feb-09 The binomial distribution 5.6, 5.8-5.10 Lab 7 Reading Week – No Classes 18-Feb-09 25-Feb-09 Chi-square tests 4-Mar-09 Resampling and nonparametric techniques 18 11-Mar-09 Student Presentations 18-Mar-09 Student Presentations 25-Mar-09 Review 1-Apr-09 Exam PSYC 6130, PROF. J. ELDER 6 Lab 8 Assignment 4 due 6 Some Background (Howell Ch. 1) Variables and Constants • Constants are properties that never change (e.g., the speed of light in a vacuum ~3x108m/s). • Most physiological and psychological parameters of interest vary considerably – Between individuals (e.g., intelligence quotient) – Within individuals (e.g., heart rate) • Any variable whose variation is somewhat unpredictable is called a random variable (rv). PSYC 6130, PROF. J. ELDER 8 Scales of measurement • Nominal scale: values are categories, having no meaningful correspondence to numbers. PSYC 6130, PROF. J. ELDER 9 Scales of measurement • Ordinal scale: ordering is meaningful, but exact numerical values (if they exist) are not. PSYC 6130, PROF. J. ELDER 10 Scales of measurement • Interval scale: values are numerically meaningful, and interval between two values is meaningful. – Example: Celsius temperature scale. It takes the same amount of energy to raise the temperature of a gram of water from 20 °C to 21 °C as it does to raise it from 30 °C to 31 °C. • Ratio scale: ratio of two values is also meaningful. – Example: Kelvin temperature scale. A gram of H20 at 300 K has twice the energy of a gram of H20 at 150 K. – Ratio scales require a 0-point corresponding to a complete lack of the substance being measured. • Example: a gram of H20 at 0 K has no heat (particles are motionless). PSYC 6130, PROF. J. ELDER 11 Continuous vs Discrete Variables • A continuous variable may assume any real value within some range PSYC 6130, PROF. J. ELDER 12 Continuous vs Discrete Variables • A discrete variable may assume only a countable number of values: intermediate values are not meaningful. PSYC 6130, PROF. J. ELDER 13 Independent vs Dependent Variables • Experiments involve independent and dependent variables. – The independent variable is controlled by the experimenter. – The dependent variable is measured. – We seek to detect and model effects of the independent variable on the dependent variable. • Example: In a visual search task, subjects are asked to find the odd-man-out in a display of discrete items (e.g., a horizontal bar amongst vertical bars). – The number of items in the display is an independent variable. – Reaction time is the main dependent variable. – Typically, we observe a roughly linear relationship between the number of items and the reaction time. PSYC 6130, PROF. J. ELDER 14 Experimental vs Correlational Research • Experimental study: – Researcher controls the independent variable. – Seek to detect effects on the dependent variable. – Direction of causation may be inferred (but may be indirect). • Correlational study: – There are no independent or dependent variables. – No variables are under control of the researcher. – Seek to find statistical relationships (dependencies) between variables. – Direction of causation may not normally be inferred. PSYC 6130, PROF. J. ELDER 15 Correlational Studies: Examples PSYC 6130, PROF. J. ELDER 16 Populations vs Samples • In human science, we typically want to characterize and make inferences not about a particular person (e.g., Uncle Bob) but about all people, or all people with a certain property (e.g., all people suffering from a bipolar disorder). • These groups of interest are called populations. • Typically, these populations are too large and inaccessible to study. • Instead, we study a subset of the group, called a sample. • In order to make reliable inferences about the population, samples are ideally randomly selected. • The population properties of interest are called parameters. • The corresponding measurements made on our samples are called statistics. Statistics are approximations (estimates) of parameters. PSYC 6130, PROF. J. ELDER 17 Different Types of Populations and Samples • Outside of human science, populations do not necessarily refer to humans – e.g. populations may be of bees, algae, quarks, stock prices, pork belly futures, ozone levels, etc… • In clinical and social psychology you will often be conducting large-n studies on human populations. • In cognitive psychology, you will often be doing small-n withinsubject studies involving repeated trials on the same subject. – Here, you may think of the ‘population’ as being the infinite set of responses you would obtain were you able to continue the experiment indefinitely. – The sample is the set of responses you were able to collect in a finite number of trials (e.g., 5000) on the same subject. PSYC 6130, PROF. J. ELDER 18 Summation Notation i Xi Yi 1 1 2 2 2 1 3 2 1 … … … N 4 0 Then X Let X i Number of siblings for respondent i Yi Number of children for respondent i 1 N Xi N i 1 1 N Y Yi N i 1 where N Number of respondents in sample PSYC 6130, PROF. J. ELDER 19 Some Summation Rules N 1. Often abbreviate X i as i=1 2. ( X X i Yi ) Xi Yi i since (X1 Y1 ) (X2 Y2 ) (X1 X 2 ) (Y1 Y2 ) Associative property of addition Similarly, ( Xi Yi ) Xi Yi 3. C NC, where C is a constant, since adding C to itself N times yields N C's. 4. CX i C Xi since CX1 CX 2 C( X1 X 2 ) Multiplication is distributive over addition But note that 5. XiYi Xi Yi since X1Y1 X2Y2 (X1 X 2 )(Y1 Y2 ) X1Y1+X1Y2 X2Y1 X2Y2 PSYC 6130, PROF. J. ELDER 20 Summary • What is (are) statistics • Variables and constants • Scales of measurement • Continuous and discrete variables • Independent and dependent variables • Experimental and correlational research • Populations and samples • Summation Notation PSYC 6130, PROF. J. ELDER 21 Descriptive Statistics (Howell, Ch 2) Frequency Tables 1991 U.S. General Social Survey: Number of Brothers and Sisters Frequency Valid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 21 26 Total Missing DK NA Total Total PSYC 6130, PROF. J. ELDER 74 236 276 236 209 118 80 81 58 47 34 22 11 9 5 3 1 2 1 1 1 1505 4 8 12 1517 Percent Valid Percent Cumulative Percent 4.92 15.68 18.34 15.68 13.89 7.84 5.32 5.38 3.85 3.12 2.26 1.46 0.73 0.60 0.33 0.20 0.07 0.13 0.07 0.07 0.07 100.00 4.92 20.60 38.94 54.62 68.50 76.35 81.66 87.04 90.90 94.02 96.28 97.74 98.47 99.07 99.40 99.60 99.67 99.80 99.87 99.93 100.00 4.88 15.56 18.19 15.56 13.78 7.78 5.27 5.34 3.82 3.10 2.24 1.45 0.73 0.59 0.33 0.20 0.07 0.13 0.07 0.07 0.07 99.21 0.26 0.53 0.79 100.00 23 Bar Graphs and Histograms PSYC 6130, PROF. J. ELDER 24 Grouped Frequency Distributions Statistics Canada 2001 Census Age of Respondent X f <5 5-9 10 - 14 15 - 19 20 - 24 25 - 29 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69 70 - 74 75 - 79 80 - 84 85+ 581 661 740 701 689 674 731 903 930 838 746 608 434 383 345 288 174 97 PSYC 6130, PROF. J. ELDER • What are the apparent limits? • What are the real limits? 25 Percentiles and Percentile Ranks • Percentile: The score at or below which a given % of scores lie. • Percentile Rank: The percentage of scores at or below a given score PSYC 6130, PROF. J. ELDER 26 Linear Interpolation to Compute Percentile Ranks What if you have a 23-year-old respondent and would like to know her percentile rank? Statistics Canada 2001 Census Age of Respondent Let x age (percentile) y percentile rank Frequency Then the linear (affine) interpolation model is: y ax b Valid <5 5-9 10 - 14 15 - 19 20 - 24 25 - 29 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69 70 - 74 75 - 79 80 - 84 85+ Total There are 2 unknowns (a and b). If we have two data points near these unknowns, we can solve: y1 ax1 b y 2 ax2 b a y 2 y1 x2 x1 Thus y ax b ax y1 ax1 y1 a( x x1 ) y y1 y1 2 ( x x1 ) x2 x1 PSYC 6130, PROF. J. ELDER 27 581 661 740 701 689 674 731 903 930 838 746 608 434 383 345 288 174 97 10523 Percent 5.5 6.3 7.0 6.7 6.5 6.4 6.9 8.6 8.8 8.0 7.1 5.8 4.1 3.6 3.3 2.7 1.7 0.9 100.0 Cumulative Percent 5.5 11.8 18.8 25.5 32.0 38.4 45.4 54.0 62.8 70.8 77.9 83.6 87.8 91.4 94.7 97.4 99.1 100.0 Linear Interpolation to Compute Percentiles Statistics Canada 2001 Census Age of Respondent What if you want to know what the median age is? To compute percentiles, simply swap the x's and y's in the formula: x x1 x x1 2 ( y y1 ) y 2 y1 PSYC 6130, PROF. J. ELDER Frequency Valid <5 5-9 10 - 14 15 - 19 20 - 24 25 - 29 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69 70 - 74 75 - 79 80 - 84 85+ Total 28 581 661 740 701 689 674 731 903 930 838 746 608 434 383 345 288 174 97 10523 Percent 5.5 6.3 7.0 6.7 6.5 6.4 6.9 8.6 8.8 8.0 7.1 5.8 4.1 3.6 3.3 2.7 1.7 0.9 100.0 Cumulative Percent 5.5 11.8 18.8 25.5 32.0 38.4 45.4 54.0 62.8 70.8 77.9 83.6 87.8 91.4 94.7 97.4 99.1 100.0 Measures of Central Tendency • The mode – applies to ratio, interval, ordinal or nominal scales. • The median – applies to ratio, interval and ordinal scales • The mean – applies to ratio and interval scales Mean Median Mode AGE PSYC 6130, PROF. J. ELDER 29 37.1 37 41 The Mode • Defined as the most frequent value (the peak) • Applies to ratio, interval, ordinal and nominal scales • Sensitive to sampling error (noise) • Distributions may be referred to as unimodal, bimodal or multimodal, depending upon the number of peaks Mode = 41 PSYC 6130, PROF. J. ELDER 30 The Median • Defined as the 50th percentile • Applies to ratio, interval and ordinal scales • Can be used for open-ended distributions Median 37 PSYC 6130, PROF. J. ELDER 31 The Mean 1 N Population mean X i N i 1 1 N Sample mean X X i N i 1 • Applies only to ratio or interval scales • Sensitive to outliers X 37.1 PSYC 6130, PROF. J. ELDER 32 Properties of the Mean 1. Suppose a constant C is added (or subtracted) to every score in your sample: Xi Xi C Then the mean also increases (decreases) by C : X X C 2. Suppose every score in your sample is multiplied (divided) by a constant C : X i CX i Then the mean is also multiplied (divided) by C : X CX 3. ( X i X) 0 PSYC 6130, PROF. J. ELDER 33 Properties of the Mean (Cntd…) Least-squares property: the mean minimizes the sum of squared deviations: ( X X ) ( Xi X ) 2 i 2 X Proof: d ( X X ) has a minimum where i dX 2 d dX d2 2 ( X X ) 0 and ( X X ) 0 i i dX 2 ( X i X ) 2 ( X i X ) 0 X 2 2 1 Xi X N d2 2 ( X X ) 2N 0 i 2 dX PSYC 6130, PROF. J. ELDER 34 Measures of Variability (Dispersion) • Range – applies to ratio, interval, ordinal scales • Semi-interquartile range – applies to ratio, interval, ordinal scales • Variance (standard deviation) – applies to ratio, interval scales PSYC 6130, PROF. J. ELDER 35 Range • Interval between lowest and highest values • Generally unreliable – changing one value (highest or lowest) can cause large change in range. Range = 79 drinks PSYC 6130, PROF. J. ELDER 36 Semi-Interquartile Range • The interquartile range is the interval between the first and third quartile, i.e. between the 25th and 75th percentile. • The semi-interquartile range is half the interquartile range. • Can be used with open-ended distributions • Unaffected by extreme scores N Missing SIQ = 2.5 drinks Median Percentiles PSYC 6130, PROF. J. ELDER Valid 37 25 50 75 19769 6004 4 2 4 7 Population Variance and Standard Deviation X i is known as the deviation of sample i Thus SS ( Xi )2 is known as the sum of squared deviations. The population variance 2 is simply the mean squared deviation: 1 2 ( X i )2 N The population standard deviation is simply the square-root of the variance: 1 ( X i )2 N The standard deviation is particularly sensitive to outliers, due to the squaring operation. PSYC 6130, PROF. J. ELDER 38 Sample Variance and Standard Deviation X i X is known as the deviation of sample i Thus SS ( Xi X )2 is known as the sum of squared deviations. 1 ( X i X )2 N is a biased estimator of the population variance The mean squared sample deviation - it tends to underestimate 2 . A minor modification makes the sample variance s 2 unbiased: 1 s2 ( X i X )2 N 1 The corrected sample standard deviation is given by 1 ( X i X )2 N 1 s is not an unbiased estimator of , but is close enough for most purposes. s PSYC 6130, PROF. J. ELDER 39 Degrees of Freedom The degrees of freedom df is the number of independent measurements available for estimating a population parameter. The calculation of s 2 involves X . Knowing X and N 1 of the sample values allows us to infer the value of the remaining sample value. Thus only N 1 of the sample values are independent, and df N 1. PSYC 6130, PROF. J. ELDER 40 Computational Formulas for Variance The deviational formula for the sum of squares: SS ( Xi X )2 More efficient to use the computational formula: SS Xi 2 NX 2 Why are these equivalent? ( X i X )2 ( Xi2 2 Xi X X 2 ) Xi2 2 X Xi X 2 Xi2 2NX 2 NX 2 Xi2 NX 2 Thus s2 1 N 1 PSYC 6130, PROF. J. ELDER X 2 i NX 2 41 Properties of the Standard Deviation 1. Suppose a constant C is added (or subtracted) to every score in your sample: Xi Xi C Then the standard deviation does not change. PSYC 6130, PROF. J. ELDER 42 Properties of the Standard Deviation (cntd…) 2. Suppose every score in your sample is multiplied (divided) by a constant C : X i CX i Then the standard deviation is also multiplied (divided) by C : s Cs Proof: sold snew C 1 ( X i X )2 N 1 1 (CX i CX )2 N 1 1 ( X i X )2 N 1 Csold PSYC 6130, PROF. J. ELDER 43 Standard Deviation Example X 5.7 drinks s 5.8 drinks cf. SIQ = 2.5 drinks range = 79 drinks PSYC 6130, PROF. J. ELDER 44 Skew • The mean and median are identical for symmetric distributions. • Skew tends to push the mean away from the median, toward the tail (but not always) Median=3 Mean=6.7 PSYC 6130, PROF. J. ELDER 45 Skewness 3 ( X X ) i N Sample skewness = N 2 (N 1)s 3 • Properties of skewness – Positive for positive skew (tail to the right) – Negative for negative skew (tail to the left) – Dimensionless – Invariant to shifting or scaling data (adding or multiplying constants) PSYC 6130, PROF. J. ELDER 46 Dealing with Outliers • Trimming: – Throw out the top and bottom k% of values (k=5%, for example). – May be justified if there is evidence for confounding process interfering with the dependent variable being studied • Example: participant blinks during presentation of a visual stimulus • Example: participant misunderstands a question on a questionnaire. • Transforming – Scores are transformed by some function (e.g., log, square root) – Often done to reduce or eliminate skewness PSYC 6130, PROF. J. ELDER 47 Log-Transforming Data skewness=0.08 skewness=0.67 PSYC 6130, PROF. J. ELDER 48 End of Lecture 1 Sept 10, 2008 Kurtosis 4 N(N+1) ( X i X ) (N 1)2 Sample kurtosis = 3 4 (N-2)(N-3) (N 1)s (N 2)(N 3) kurtosis>0: leptokurtic (Laplacian) kurtosis=0: mesokurtic (Gaussian) kurtosis<0: platykurtic PSYC 6130, PROF. J. ELDER 50 Summary • Measures of central tendency • Measures of dispersion • Skew • Kurtosis PSYC 6130, PROF. J. ELDER 51