Download 01 Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Generalized linear model wikipedia , lookup

Regression analysis wikipedia , lookup

Taylor's law wikipedia , lookup

Least squares wikipedia , lookup

Transcript
PSYC 6130C UNIVARIATE ANALYSIS
Prof. James Elder
Introduction
What is (are) statistics?
• A branch of mathematics concerned with understanding
and summarizing collections of numbers
• A collection of numerical facts
• Estimates of population parameters, derived from
samples
PSYC 6130, PROF. J. ELDER
3
What is this course about?
• Applied statistics
• Emphasizes methods, not proofs
• Descriptive statistics
• Inferential statistics
PSYC 6130, PROF. J. ELDER
4
Fall Term
Date
Title
Readings
10-Sep-08
Introduction
Probability
Descriptive statistics
1.1-1.3
5.1-5.5, 5.7
2.1,2.2,2.5,2.72.9,2.12,2.13
17-Sep-08
The normal distribution
3.1-3.4
24-Sep-08
Introduction to hypothesis testing
t-tests
4
7
Notes
Lab 1
Rosh Hashanah – No Classes
1-Oct-08
8-Oct-08
t-tests
7
Lab 2
15-Oct-08
Statistical power and effect size
8
Assignment 1 due
22-Oct-08
Correlation and regression
9
29-Oct-08
One-way independent ANOVA
11
5-Nov-08
Multiple comparisons
12.1-12.12
12-Nov-08
Multiple comparisons
12.1-12.12
19-Nov-08
Two-way ANOVA
13.1-13.11,13.14 Assignment 2 due
26-Nov-08
Review
3-Dec-08
Exam
PSYC 6130, PROF. J. ELDER
5
Lab 3
Lab 4
Winter Term
Date
Title
Readings
Deadlines
7-Jan-09
Repeated measures ANOVA
14
14-Jan-09
Two-way mixed design ANOVA
14
Lab 5
Deadline for choosing project topic
21-Jan-09
Reading Week
28-Jan-09
Multiple regression
15
Lab 6
4-Feb-09
The general linear model
16
Assignment 3 due, drop date is Feb 1
11-Feb-09
The binomial distribution
5.6, 5.8-5.10
Lab 7
Reading Week – No Classes
18-Feb-09
25-Feb-09
Chi-square tests
4-Mar-09
Resampling and nonparametric techniques 18
11-Mar-09
Student Presentations
18-Mar-09
Student Presentations
25-Mar-09
Review
1-Apr-09
Exam
PSYC 6130, PROF. J. ELDER
6
Lab 8
Assignment 4 due
6
Some Background
(Howell Ch. 1)
Variables and Constants
• Constants are properties that never change (e.g., the
speed of light in a vacuum ~3x108m/s).
• Most physiological and psychological parameters of
interest vary considerably
– Between individuals (e.g., intelligence quotient)
– Within individuals (e.g., heart rate)
• Any variable whose variation is somewhat unpredictable
is called a random variable (rv).
PSYC 6130, PROF. J. ELDER
8
Scales of measurement
• Nominal scale: values are categories, having no
meaningful correspondence to numbers.
PSYC 6130, PROF. J. ELDER
9
Scales of measurement
• Ordinal scale: ordering is meaningful, but exact
numerical values (if they exist) are not.
PSYC 6130, PROF. J. ELDER
10
Scales of measurement
• Interval scale: values are numerically meaningful, and
interval between two values is meaningful.
– Example: Celsius temperature scale. It takes the same amount
of energy to raise the temperature of a gram of water from 20 °C
to 21 °C as it does to raise it from 30 °C to 31 °C.
• Ratio scale: ratio of two values is also meaningful.
– Example: Kelvin temperature scale. A gram of H20 at 300 K has
twice the energy of a gram of H20 at 150 K.
– Ratio scales require a 0-point corresponding to a complete lack
of the substance being measured.
• Example: a gram of H20 at 0 K has no heat (particles are
motionless).
PSYC 6130, PROF. J. ELDER
11
Continuous vs Discrete Variables
• A continuous variable may assume any real value within
some range
PSYC 6130, PROF. J. ELDER
12
Continuous vs Discrete Variables
• A discrete variable may assume only a countable
number of values: intermediate values are not
meaningful.
PSYC 6130, PROF. J. ELDER
13
Independent vs Dependent Variables
• Experiments involve independent and dependent variables.
– The independent variable is controlled by the experimenter.
– The dependent variable is measured.
– We seek to detect and model effects of the independent variable on the
dependent variable.
• Example: In a visual search task, subjects are asked to find the
odd-man-out in a display of discrete items (e.g., a horizontal bar
amongst vertical bars).
– The number of items in the display is an independent variable.
– Reaction time is the main dependent variable.
– Typically, we observe a roughly linear relationship between the number
of items and the reaction time.
PSYC 6130, PROF. J. ELDER
14
Experimental vs Correlational Research
• Experimental study:
– Researcher controls the independent variable.
– Seek to detect effects on the dependent variable.
– Direction of causation may be inferred (but may be indirect).
• Correlational study:
– There are no independent or dependent variables.
– No variables are under control of the researcher.
– Seek to find statistical relationships (dependencies) between
variables.
– Direction of causation may not normally be inferred.
PSYC 6130, PROF. J. ELDER
15
Correlational Studies: Examples
PSYC 6130, PROF. J. ELDER
16
Populations vs Samples
• In human science, we typically want to characterize and make
inferences not about a particular person (e.g., Uncle Bob) but about
all people, or all people with a certain property (e.g., all people
suffering from a bipolar disorder).
• These groups of interest are called populations.
• Typically, these populations are too large and inaccessible to study.
• Instead, we study a subset of the group, called a sample.
• In order to make reliable inferences about the population, samples
are ideally randomly selected.
• The population properties of interest are called parameters.
• The corresponding measurements made on our samples are called
statistics. Statistics are approximations (estimates) of parameters.
PSYC 6130, PROF. J. ELDER
17
Different Types of Populations and Samples
• Outside of human science, populations do not necessarily refer to
humans
– e.g. populations may be of bees, algae, quarks, stock prices, pork belly
futures, ozone levels, etc…
• In clinical and social psychology you will often be conducting large-n
studies on human populations.
• In cognitive psychology, you will often be doing small-n withinsubject studies involving repeated trials on the same subject.
– Here, you may think of the ‘population’ as being the infinite set of
responses you would obtain were you able to continue the experiment
indefinitely.
– The sample is the set of responses you were able to collect in a finite
number of trials (e.g., 5000) on the same subject.
PSYC 6130, PROF. J. ELDER
18
Summation Notation
i
Xi
Yi
1
1
2
2
2
1
3
2
1
…
…
…
N
4
0
Then X 
Let X i  Number of siblings for respondent i
Yi  Number of children for respondent i
1 N
 Xi
N i 1
1 N
Y   Yi
N i 1
where N  Number of respondents in sample
PSYC 6130, PROF. J. ELDER
19
Some Summation Rules
N
1. Often abbreviate
X
i
as
i=1
2.
( X
X
i
 Yi )   Xi  Yi
i
since (X1  Y1 )  (X2  Y2 )  (X1  X 2 )  (Y1  Y2 )  Associative property of addition
Similarly,
( Xi  Yi )   Xi  Yi
3.
C  NC, where C is a constant,
since adding C to itself N times yields N C's.
4.
CX
i
 C Xi
since CX1  CX 2  C( X1  X 2 )  Multiplication is distributive over addition
But note that
5.  XiYi   Xi Yi
since X1Y1  X2Y2  (X1  X 2 )(Y1  Y2 )  X1Y1+X1Y2  X2Y1  X2Y2
PSYC 6130, PROF. J. ELDER
20
Summary
• What is (are) statistics
• Variables and constants
• Scales of measurement
• Continuous and discrete variables
• Independent and dependent variables
• Experimental and correlational research
• Populations and samples
• Summation Notation
PSYC 6130, PROF. J. ELDER
21
Descriptive Statistics
(Howell, Ch 2)
Frequency Tables
1991 U.S. General Social Survey: Number of Brothers and Sisters
Frequency
Valid
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
21
26
Total
Missing DK
NA
Total
Total
PSYC 6130, PROF. J. ELDER
74
236
276
236
209
118
80
81
58
47
34
22
11
9
5
3
1
2
1
1
1
1505
4
8
12
1517
Percent
Valid Percent
Cumulative Percent
4.92
15.68
18.34
15.68
13.89
7.84
5.32
5.38
3.85
3.12
2.26
1.46
0.73
0.60
0.33
0.20
0.07
0.13
0.07
0.07
0.07
100.00
4.92
20.60
38.94
54.62
68.50
76.35
81.66
87.04
90.90
94.02
96.28
97.74
98.47
99.07
99.40
99.60
99.67
99.80
99.87
99.93
100.00
4.88
15.56
18.19
15.56
13.78
7.78
5.27
5.34
3.82
3.10
2.24
1.45
0.73
0.59
0.33
0.20
0.07
0.13
0.07
0.07
0.07
99.21
0.26
0.53
0.79
100.00
23
Bar Graphs and Histograms
PSYC 6130, PROF. J. ELDER
24
Grouped Frequency Distributions
Statistics Canada 2001 Census
Age of Respondent
X
f
<5
5-9
10 - 14
15 - 19
20 - 24
25 - 29
30 - 34
35 - 39
40 - 44
45 - 49
50 - 54
55 - 59
60 - 64
65 - 69
70 - 74
75 - 79
80 - 84
85+
581
661
740
701
689
674
731
903
930
838
746
608
434
383
345
288
174
97
PSYC 6130, PROF. J. ELDER
• What are the apparent limits?
• What are the real limits?
25
Percentiles and Percentile Ranks
• Percentile: The score at or below which a given % of
scores lie.
• Percentile Rank: The percentage of scores at or below a
given score
PSYC 6130, PROF. J. ELDER
26
Linear Interpolation to Compute Percentile Ranks
What if you have a 23-year-old respondent and
would like to know her percentile rank?
Statistics Canada 2001 Census
Age of Respondent
Let x  age (percentile)
y  percentile rank
Frequency
Then the linear (affine) interpolation model is:
y  ax  b
Valid <5
5-9
10 - 14
15 - 19
20 - 24
25 - 29
30 - 34
35 - 39
40 - 44
45 - 49
50 - 54
55 - 59
60 - 64
65 - 69
70 - 74
75 - 79
80 - 84
85+
Total
There are 2 unknowns (a and b). If we have two
data points near these unknowns, we can solve:
y1  ax1  b
y 2  ax2  b
a
y 2  y1
x2  x1
Thus y  ax  b
 ax  y1  ax1
 y1  a( x  x1 )
y  y1
 y1  2
( x  x1 )
x2  x1
PSYC 6130, PROF. J. ELDER
27
581
661
740
701
689
674
731
903
930
838
746
608
434
383
345
288
174
97
10523
Percent
5.5
6.3
7.0
6.7
6.5
6.4
6.9
8.6
8.8
8.0
7.1
5.8
4.1
3.6
3.3
2.7
1.7
0.9
100.0
Cumulative Percent
5.5
11.8
18.8
25.5
32.0
38.4
45.4
54.0
62.8
70.8
77.9
83.6
87.8
91.4
94.7
97.4
99.1
100.0
Linear Interpolation to Compute Percentiles
Statistics Canada 2001 Census
Age of Respondent
What if you want to know what the median age is?
To compute percentiles,
simply swap the x's and y's in the formula:
x  x1
x  x1  2
( y  y1 )
y 2  y1
PSYC 6130, PROF. J. ELDER
Frequency
Valid <5
5-9
10 - 14
15 - 19
20 - 24
25 - 29
30 - 34
35 - 39
40 - 44
45 - 49
50 - 54
55 - 59
60 - 64
65 - 69
70 - 74
75 - 79
80 - 84
85+
Total
28
581
661
740
701
689
674
731
903
930
838
746
608
434
383
345
288
174
97
10523
Percent
5.5
6.3
7.0
6.7
6.5
6.4
6.9
8.6
8.8
8.0
7.1
5.8
4.1
3.6
3.3
2.7
1.7
0.9
100.0
Cumulative Percent
5.5
11.8
18.8
25.5
32.0
38.4
45.4
54.0
62.8
70.8
77.9
83.6
87.8
91.4
94.7
97.4
99.1
100.0
Measures of Central Tendency
• The mode – applies to ratio, interval, ordinal or nominal
scales.
• The median – applies to ratio, interval and ordinal scales
• The mean – applies to ratio and interval scales
Mean Median Mode
AGE
PSYC 6130, PROF. J. ELDER
29
37.1
37
41
The Mode
• Defined as the most frequent
value (the peak)
• Applies to ratio, interval, ordinal
and nominal scales
• Sensitive to sampling error
(noise)
• Distributions may be referred to
as unimodal, bimodal or
multimodal, depending upon the
number of peaks
Mode = 41
PSYC 6130, PROF. J. ELDER
30
The Median
• Defined as the 50th percentile
• Applies to ratio, interval and
ordinal scales
• Can be used for open-ended
distributions
Median  37
PSYC 6130, PROF. J. ELDER
31
The Mean
1 N
Population mean    X i
N i 1
1 N
Sample mean X   X i
N i 1
• Applies only to ratio or interval
scales
• Sensitive to outliers
X  37.1
PSYC 6130, PROF. J. ELDER
32
Properties of the Mean
1. Suppose a constant C is added (or subtracted) to every score in your sample:
Xi  Xi  C
Then the mean also increases (decreases) by C :
X  X C
2. Suppose every score in your sample is multiplied (divided) by a constant C :
X i  CX i
Then the mean is also multiplied (divided) by C :
X  CX
3.
( X
i
 X)  0
PSYC 6130, PROF. J. ELDER
33
Properties of the Mean (Cntd…)
Least-squares property: the mean minimizes the sum of squared deviations:
( X
 X )  ( Xi  X )
2
i
2
X
Proof:
d
(
X

X
)
has
a
minimum
where
 i
dX
2
d
dX
d2
2
(
X

X
)

0
and
(
X

X
)
0
 i
 i
dX 2
 ( X i  X )  2 ( X i  X )  0  X 
2
2
1
 Xi  X
N
d2
2
(
X

X
)
 2N  0

i
2
dX
PSYC 6130, PROF. J. ELDER
34
Measures of Variability (Dispersion)
• Range – applies to ratio, interval, ordinal scales
• Semi-interquartile range – applies to ratio, interval,
ordinal scales
• Variance (standard deviation) – applies to ratio, interval
scales
PSYC 6130, PROF. J. ELDER
35
Range
• Interval between lowest and highest values
• Generally unreliable – changing one value (highest or
lowest) can cause large change in range.
Range = 79 drinks
PSYC 6130, PROF. J. ELDER
36
Semi-Interquartile Range
• The interquartile range is the interval between the first and third
quartile, i.e. between the 25th and 75th percentile.
• The semi-interquartile range is half the interquartile range.
• Can be used with open-ended distributions
• Unaffected by extreme scores
N
Missing
SIQ = 2.5 drinks
Median
Percentiles
PSYC 6130, PROF. J. ELDER
Valid
37
25
50
75
19769
6004
4
2
4
7
Population Variance and Standard Deviation
X i   is known as the deviation of sample i
Thus SS  ( Xi   )2 is known as the sum of squared deviations.
The population variance  2 is simply the mean squared deviation:
1
 2   ( X i   )2
N
The population standard deviation is simply the square-root of the variance:

1
( X i   )2

N
The standard deviation is particularly sensitive to outliers, due to the squaring operation.
PSYC 6130, PROF. J. ELDER
38
Sample Variance and Standard Deviation
X i  X is known as the deviation of sample i
Thus SS  ( Xi  X )2 is known as the sum of squared deviations.
1
( X i  X )2

N
is a biased estimator of the population variance
The mean squared sample deviation
- it tends to underestimate  2 .
A minor modification makes the sample variance s 2 unbiased:
1
s2 
( X i  X )2

N 1
The corrected sample standard deviation is given by
1
( X i  X )2

N 1
s is not an unbiased estimator of  , but is close enough for most purposes.
s
PSYC 6130, PROF. J. ELDER
39
Degrees of Freedom
The degrees of freedom df is the number of independent measurements
available for estimating a population parameter.
The calculation of s 2 involves X . Knowing X and N  1 of the sample values
allows us to infer the value of the remaining sample value. Thus only
N  1 of the sample values are independent, and df  N  1.
PSYC 6130, PROF. J. ELDER
40
Computational Formulas for Variance
The deviational formula for the sum of squares: SS  ( Xi  X )2
More efficient to use the computational formula: SS   Xi 2  NX 2
Why are these equivalent?
( X
i
 X )2  ( Xi2  2 Xi X  X 2 )
  Xi2  2 X  Xi   X 2
  Xi2  2NX 2  NX 2
  Xi2  NX 2
Thus
s2 
1
N 1
PSYC 6130, PROF. J. ELDER
 X
2
i
 NX 2

41
Properties of the Standard Deviation
1. Suppose a constant C is added (or subtracted) to every score in your sample:
Xi  Xi  C
Then the standard deviation does not change.
PSYC 6130, PROF. J. ELDER
42
Properties of the Standard Deviation (cntd…)
2. Suppose every score in your sample is multiplied (divided) by a constant C :
X i  CX i
Then the standard deviation is also multiplied (divided) by C :
s  Cs
Proof:
sold 
 snew 
C
1
( X i  X )2

N 1
1
(CX i  CX )2

N 1
1
( X i  X )2

N 1
 Csold
PSYC 6130, PROF. J. ELDER
43
Standard Deviation Example
X  5.7 drinks
s  5.8 drinks
cf. SIQ = 2.5 drinks
range = 79 drinks
PSYC 6130, PROF. J. ELDER
44
Skew
• The mean and
median are
identical for
symmetric
distributions.
• Skew tends to push
the mean away
from the median,
toward the tail (but
not always)
Median=3
Mean=6.7
PSYC 6130, PROF. J. ELDER
45
Skewness
3
(
X

X
)
 i
N
Sample skewness =
N  2 (N  1)s 3
• Properties of skewness
– Positive for positive skew (tail to the right)
– Negative for negative skew (tail to the left)
– Dimensionless
– Invariant to shifting or scaling data (adding or multiplying
constants)
PSYC 6130, PROF. J. ELDER
46
Dealing with Outliers
• Trimming:
– Throw out the top and bottom k% of values (k=5%, for example).
– May be justified if there is evidence for confounding process
interfering with the dependent variable being studied
• Example: participant blinks during presentation of a visual stimulus
• Example: participant misunderstands a question on a
questionnaire.
• Transforming
– Scores are transformed by some function (e.g., log, square root)
– Often done to reduce or eliminate skewness
PSYC 6130, PROF. J. ELDER
47
Log-Transforming Data
skewness=0.08
skewness=0.67
PSYC 6130, PROF. J. ELDER
48
End of Lecture 1
Sept 10, 2008
Kurtosis
4
N(N+1)  ( X i  X )
(N  1)2
Sample kurtosis =
3
4
(N-2)(N-3) (N  1)s
(N  2)(N  3)
kurtosis>0: leptokurtic (Laplacian)
kurtosis=0: mesokurtic (Gaussian)
kurtosis<0: platykurtic
PSYC 6130, PROF. J. ELDER
50
Summary
• Measures of central tendency
• Measures of dispersion
• Skew
• Kurtosis
PSYC 6130, PROF. J. ELDER
51