Download Statistics

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Analysis of variance wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Lincoln Jiang
Statistical Consultant
Western Michigan University
The Graduate College
Graduate Center for Research and Retention
Definition of Statistics
 Statistics is the art of making numerical conjectures
about puzzling questions.
--- Statistics Fourth Edition
by Freedman
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Basic Terms
 Variables
 Characteristics that can take on any number of different
values
 Values
 Possible numbers or categories that of a variable can
have
 Scores
 A particular person’s value on a variable
Types of Data
 Qualitative data
--nonnumeric
eg: types of material {straw, sticks, bricks}
 Quantitative
-- numeric
Discrete data
--numeric data that have a finite number of
possible values
eg: counting numbers, {1,2,3,4,5}
Continuous data
--numeric data that have a infinite number of
possible values
eg: Real numbers
Types of Scale
 Nominal---have no order and thus only gives names or labels to
various categories.
Variables assessed on a nominal scale are called categorical variables
 Ordinal---have order, but the interval between measurements is not
meaningful.
 Interval---have meaningful intervals between measurements, but
there is no true starting point (zero).
Eg: temperature with the Celsius scale
 Ratio---have the highest level of measurement. Ratios between
measurements as well as intervals are meaningful because there is a
starting point (zero).
Eg: length, time, plane angle, energy
EX
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Collecting Data
“Twenty-five percent of Americans doubt that the
Holocaust ever occurred.”
--- a news in 1993
 Census
 Sample Survey
Why Study Samples?
 Often not practical to study an entire population
 Instead, researchers attempt to make samples
representative of populations
 Random selection


Each member of population has an equal chance of being sampled
Good but difficult
 Haphazard selection


Take steps to ensure samples do not differ from the population in
systematic ways
Not as good but much more practical
Sample vs. Population
 Sample
 Relatively small number of
instances that are studied
in order to make
inferences about a larger
group from which they
were drawn
 Population
 The larger group from
which a sample is drawn
Sample vs. Population Examples
 Population
a. pot of beans
b. larger circle
c. histogram
Sample
a. spoonful
b. smaller circle
c. shaded
scores
Sampling Methods
 Simple Random Sampling
 Systematic sampling
 Stratified sampling
 Cluster sampling
 Other samplings: Quota sampling, Mechanical sampling and so on
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
After Collecting…….Before
Analyzing….
Frequency Tables
 Frequency table
 Shows how many times each value was used for a
particular variable
 Percentage of scores of each value
 Grouped frequency table
 Range of scores in each of several equally sized intervals
Steps for Making a
Frequency Table
1. Make a list of each possible
value, from highest to lowest
2. Go one by one through the data,
making a mark for each data
next to its value on the list
3. Make a table showing how many
times each value on your list
was used
4. Figure the percentage of data for
each value
A Frequency Table
Stress rating
Frequency
Percent,%
10
14
9.3
9
15
9.9
8
26
17.2
7
31
20.5
6
13
8.6
5
18
11.9
4
16
10.6
3
12
7.9
2
3
2.0
1
1
0.7
0
2
1.3
A Grouped Frequency Table
Stress rating interval
Frequency
Percent
10-11
14
9
8-9
41
27
6-7
44
29
4-5
34
23
2-3
15
10
0-1
3
2
Frequency Graphs
 Histogram
 Depicts
information from a
frequency table or a
grouped frequency
table as a bar graph
EX2
Shapes of Frequency Distributions
 Unimodal
 Having one peak
 Bimodal
 Having two peaks
 Multimodal
 Having two or more
peaks
 Rectangular
 Having no peaks
Symmetrical vs. Skewed Frequency
Distributions
 Symmetrical distribution
 Approximately equal numbers of observations above
and below the middle
 Skewed distribution
 One side is more spread out that the other, like a tail
 Direction of the skew
 Right or left (i.e., positive or negative)
 Side with the fewer scores
 Side that looks like a tail
Skewed Frequency Distributions
 Skewed right (b)
 Fewer scores right of the peak
 Positively skewed
 Can be caused by a floor effect
 Skewed left (c)
 Fewer scores left of the peak
 Negatively skewed
 Can be caused by a ceiling effect
Ceiling and Floor Effects
 Ceiling effects
 Occur when scores can go no
higher than an upper limit and
“pile up” at the top
 e.g., scores on an easy exam, as
shown on the right
 Causes negative skew
 Floor effects
 Occur when scores can go no
lower than a lower limit and pile
up at the bottom
 e.g., household income
 Causes positive skew
Kurtosis
 Degree to which tails of the distribution are “heavy” or
“light”
 heavy tails = higher Kurtosis(b)
 Light tails = lower Kurtosis(c)
 Normal distribution= Zero Kurtosis (a)
Measures of Central Tendency
 Central tendency = representative or typical value in a
distribution
 mean, the median and the mode
can measure central tendency.
 Mean
 Computed by


Summing all the scores (sigma, )
Dividing by the number of scores (N)
M 
X
N
Measures of Central Tendency
 Mean
 Often the best measure of central tendency
 Most frequently reported in research articles
 Think of the mean as the “balancing point” of the
distribution
Measures of Central Tendency
 Mode
 Most common single number in a distribution
 If distribution is symmetrical and unimodal, the mode =
the mean
 Typical way of describing central tendency of a nominal
variable
Measures of Central Tendency
 Median
 Middle value in a group of scores
 Point at which


half the scores are above
half the scores are below
 Unaffected by extremity of individual scores
 Unlike the mean
 Preferable as a measure of central tendency when a
distribution has some extreme scores
Measures of Central Tendency
 Examples of means as
balancing points of various
distributions
 Does not have to be a score
exactly at the median
 Note that a score’s distance
from the balancing point
matters in addition to the
number of scores above or
below it
Measures of Central Tendency
 Examples of means and
modes
Measures of Central Tendency
 Steps to computing the median
1. Line up scores from highest to lowest
2. Figure out how many scores to the middle


Add 1 to number of scores
Divide by 2
3. Count up to middle score


If there is 1 middle score, that’s the median
If there are 2 middle scores, median is their average
Ex3
Measures of Variation
 Variation = how spread out
data is
 Variance
 Measure of variation
 Average of each score’s
squared deviations
(differences) from the mean
Measures of Variation
 Steps to computing the variance
 1. Subtract the mean from each data
 2. Square each deviation value
xi  x
( xi  x)
 3. Add up the squared deviation scores
 4. Divide sum by the number of scores
2
 ( x  x)
2
i
2
(
x

x
)
 i
n
Ex4
Measures of Variation
 Standard deviation
 Another measure of variation, roughly the average
amount that scores differ from the mean
 Used more widely than variance
 Abbreviated as “SD”
 To compute standard deviation
 Compute variance
 Simply take the square root
 SD is square root of variance
 Variance is SD squared
SD  Variance
2
Two Branches of
Statistical Methods
 Descriptive statistics
 Summarize and describe a group of numbers such as the
results of a research study
 Inferential statistics
 Allow researchers to draw conclusions and inferences
that are based on the numbers from a research study,
but go beyond these numbers
The Normal Curve
 Often seen in social and behavioral science research
and in nature generally
 Particular characteristics
 Bell-shaped
 Unimodal
 Symmetrical
 Average tails
Bean Machine
Z Scores
 indicates how many standard deviations an
observation is above or below the mean
 If Z>0, indicate the data > mean
 If Z<0, indicate the data < mean
 Z score of 1.0 is one SD above the mean
 Z score of -2.5 is two-and-a-half SDs below the mean
 Z score of 0 is at the mean
(X  M )
Z
SD
Z Scores
 When values in a distribution are converted to Z
scores, the distribution will have
 Mean of 0
 Standard deviation of 1
 Useful
 Allows variables to be compared to one another
 Provides a generalized standard of comparison
Z Scores
 To compute a Z score,
subtract the mean from a
raw score and divide by
the SD
 To convert a Z score back
to a raw score, multiply
the Z score by the SD
and then add the mean
(X  M )
Z
SD
X  ( Z )( SD)  M
Ex5
Confidence Interval
 confidence interval (CI)
is a particular kind of
interval estimate of a
population parameter.
 How likely the interval is
to contain the parameter is
determined by the
confidence level
Animation
ex6
 "95% confidence interval"
Correlation
 A statistic for describing the relationship between two
variables
 Examples




Price of a bottle of wine and its quality
Hours of studying and grades on a statistics exam
Income and happiness
Caffeine intake and alertness
Graphing Correlations on a Scatter
Diagram
 Scatter diagram
 Graph that shows the degree and
pattern of the relationship between
two variables
 Horizontal axis
 Usually the variable that does the
predicting

e.g., price, studying, income, caffeine
intake
 Vertical axis
 Usually the variable that is predicted

e.g., quality, grades, happiness,
alertness
Graphing Correlations on a Scatter
Diagram
 Steps for making a
scatter diagram
1. Draw axes and assign
variables to them
2. Determine the range of
values for each variable
and mark the axes
3. Mark a dot for each
person’s pair of scores
Correlation
 Linear correlation
 Pattern on a scatter diagram is
a straight line
 Example above
 Curvilinear correlation
 More complex relationship
between variables
 Pattern in a scatter diagram is
not a straight line
 Example below
Correlation
 Positive linear correlation
 High scores on one variable
matched by high scores on
another
 Line slants up to the right
 Negative linear correlation
 High scores on one variable
matched by low scores on
another
 Line slants down to the right
Correlation
 Zero correlation
 No line, straight or otherwise,
can be fit to the relationship
between the two variables
 Two variables are said to be
“uncorrelated”
Correlation Review
a. Negative linear
correlation
b. Curvilinear correlation
c. Positive linear
correlation
d. No correlation
Correlation Coefficient
 Correlation coefficient, r, indicates the
precise degree of linear correlation
between two variables
 Computed by taking “cross-products”
of Z scores
 Multiply Z score on one variable by Z score
on the other variable
 Compute average of the resulting products
 Can vary from
 -1 (perfect negative correlation)
 through 0 (no correlation)
 to +1 (perfect positive correlation)

Z
Z
r
X
N
Y
Linear Correlation Examples
Correlation and Causality
 When two variables are
correlated, three possible
directions of causality
 X->Y
 X<-Y
 X<-Z->Y
 Inherent ambiguity in
correlations
 Knowing that two variables are
correlated tells you nothing
about their causal relationship
Prediction
 Correlations can be used to make predictions about
scores
 Predictor
 X variable
 Variable being predicted from
 Criterion
 Y variable
 Variable being predicted
 Sometimes called “regression”
Multiple Correlation and
Multiple Regression
 Multiple correlation
 Association between criterion variables and two or more
predictor variables
 Multiple regression
 Making predictions about criterion variables based on
two or more predictor variables
 Unlike prediction from one variable, standardized
regression coefficient is not the same as the ordinary
correlation coefficient
Proportion of Variance
Accounted For
 Correlation coefficients
 Indicate strength of a linear relationships
 Cannot be compared directly
 e.g., an r of .40 is more than twice as strong as an r of .20
 To compare correlation coefficients, square them
 An r of .40 yields an r2 of .16; an r of .20 an r2 of .04
 Squared correlation indicates the proportion of variance
on the criterion variable accounted for by the predictor
variable
 R-square
Most Commonly Used Statistical Techniques
Linear Regression (Predicts the value of one numerical
variable given another variable)
- How much does the maximum legibility distance of
Highway signs decrease when age is increased?
Data on winning bid price for 12
Saturn cars on eBaY in July 2002
• Simple linear regression is a data analysis
technique that tries to find a linear pattern
in the data.
•In linear regression, we use all of the data
to calculate a straight line which may be
used to predict Price based on Miles.
• Since Miles is used to predict Price, Miles
is called an `Explanatory (Independent)
Variable' while Price is called a `Response
(Dependent) Variable'.
•The slope of the line is -.05127, which means that predicted Price tends to drop 5
cents for every additional mile driven, or about $512.70 for every 10,000 miles.
•The intercept (or Y-intercept) of the line is $8136; this should not be interpreted
as the predicted price of a car with 0 mileage because the data provides information
only for Saturn cars between 9,300 miles and 153,260 miles
•We can now use the line to predict the selling price of a car with 60000 miles.
What is the height or Y value of the line at X=60000? The answer is
Most Commonly Used Statistical Techniques
 T-test (for the means)
- What is the mean time that college students watch TV
per day?
- What is the mean pulse rate of women?
Hypothesis Testing
 Procedure for deciding whether the outcome of a study
supports a particular theory or practical innovation
Core Logic of Hypothesis Testing
 Approach can seem curious or even backwards
 Researcher considers the probability that the
experimental procedure had no effect and that the
observed result could have occurred by chance alone
 If that probability is sufficiently low, researcher will…
 Reject the notion that experimental procedure had no effect
 Affirm the hypothesis that the procedure did have an effect
The Null Hypothesis and the
Research Hypothesis
 Null hypothesis (H0)
 Opposite of desired result
 Usually that manipulation had no effect
 Research hypothesis (H1)
 Also called the “alternative hypothesis”
 Opposite of the null hypothesis
 What the experimenter desired or expected all along—
that the manipulation did have an effect
One-tailed vs. Two-tailed
Hypothesis Tests
 Directional prediction
 Researcher expects experimental procedure to have an
effect in a particular direction
 One-tailed significance tests may be used
 Nondirectional prediction
 Research expects experimental procedure to have an
effect but does not predict a particular direction
 Two-tailed significance test appropriate
 Takes into account that the sample could be extreme at
either tail of the comparison distribution
One-tailed vs. Two-tailed
Hypothesis Tests
 Two-tailed tests
 More conservative than one-tailed tests
 Some believe that two-tailed tests should always be
used, even when an experimenter makes a directional
prediction
Significance Level Cutoffs for Oneand Two-Tailed Tests
 The .05 significance level
 The .01 significance level
Decision Errors
 When the right procedure leads to the wrong
conclusion
 Type I Error
 Reject the null hypothesis when it is true
 Conclude that a manipulation had an effect when in fact
it did not
 Type II Error
 Fail to reject the null when it is false
 Conclude that a manipulation did not have an effect
when in fact it did
P-value
 is the probability of obtaining a result at least as
extreme as the one that was actually observed,
assuming that the null hypothesis is true.
 Frequent misunderstandings
 For more details, please refer to Wikipedia.
Decision Errors
 Setting a strict significance level (e.g., p < .001)
 Decreases the possibility of committing a Type I error
 Increases the possibility of committing a Type II error
 Setting a lenient significance level (e.g., p < .10)
 Increases the possibility of committing a Type I error
 Decreases the possibility of committing a Type II error
Test Statistic
value computed from sample information
Basis for rejecting/ not rejecting the null hypothesis
used
Example:
to compute the p-value
T-test
 A t-test is most
commonly applied when
the test statistic would
follow a normal
distribution. When the
scaling term is unknown
and is replaced by an
estimate based on the
data, the test statistic
follows a Student's t
distribution.
t-test
 One-sample t test
 Two-sample t test
 Independent two-sample
 Dependent two-sample


Equal sample size, equal variance
Unequal sample size, equal variance
The Hypothesis Testing Process
2.
3.
4.
5.
6.
Restate the research question as a research
hypothesis and a null hypothesis about the
populations
Set the level of significance, .
Collect the sample and compute for the test statistic.
Assume Ho is true, compute the p-value.
If p-value < , reject Ho.
State your conclusion.

SUMMARY OF HYPOTHESIS TESTS
1.
Ex7,8
Most Commonly Used Statistical Techniques
 Analysis of Variance (testing differences of means for 2
or more groups)
- Is GPA related to where a student likes to sit (front,
middle, back)?
- Which internet search engine is the fastest?
Analysis of Variance
 Abbreviated as “ANOVA”
 Used to compare the means of more than two groups
 Null hypothesis is that all populations being studied
have the same mean
 Reject null if at least one population has a mean that
differs from the others
 Actually works by analyzing variances
Two Different Ways of Estimating
Population Variance
 Estimate population variance from variation within
each group
 Is not affected by whether or not null hypothesis is true
 Estimate population variance from variation between
each group
 Is affected by whether or not null hypothesis is true
Two Important Questions
1.
How to estimate population variation from variance
between groups?
2.
How is that estimate affected by whether or not the
null is true?
Estimate population variance from variation
between means of groups
 First, variation among means
of samples is related directly
to the amount of variation
within each population from
which samples are taken
 The more variation within
each population, the more
variation in means of samples
taken from those populations
 Note that populations on the
right produce means that are
more scattered
Estimate population variance from variation
between means of groups
 And second, when null is false
there is an additional source of
variation
 When null hypothesis is true
(left), variation among means of
samples caused by
 Variation within the populations
 When null hypothesis is false
(right), variation among means
of samples caused by
 Variation within the populations
 And also by variation among the
population means
Basic Logic of ANOVA
 ANOVA entails a
comparison between two
estimates of population
variance
 Ratio of between-groups
estimate to within-groups
estimate called an F ratio
 Compare obtained F value to
an F distribution
Between Groups
F
Within Groups
Assumptions of an ANOVA
 Populations follow a normal curve
 Populations have equal variances
 As for t tests, ANOVAs often work fairly well even
when those assumptions are violated
Rejecting the Null Hypothesis
 A significant F tells you that at least one of the
means differs from the others
 Does not indicate how many differ
 Does not indicate which one(s) differ
 For more specific conclusions, a researcher must
conduct follow-up t tests
 Problem: Lots of t tests increases the chances of
finding a significant result just by chance (i.e.,
increases chances beyond p = .05)
ANOVA (continue)
 Procedure that allows one to examine two or more
variables in the same study
 Efficient
 Allows for examination of interaction effects
 An ANOVA with only one variable is a one-way
ANOVA, an ANOVA with two variables is a two-way
ANOVA, and so on
Main Effects vs. Interactions
 A main effect refers to the effect of one variable,
averaging across the other(s)
 An interaction effect refers to a case in which the effect
of one variable depends on the level of another
variable
Main Effects vs. Interactions
Most Commonly Used Statistical Techniques
 Chi-square test of independence (Relationship of 2
categorical variables)
-With whom is it easier to make friends with?
- Does the opinion on legalization of marijuana depend
on one’s religion?
Chi-Square Tests
 Hypothesis testing procedure for nominal variables
 Focus on number of people/items in each category (e.g., hair
color, political party, gender)
 Compare how well an observed distribution fits an
expected distribution
 Expected distribution can be based on
 A theory
 Prior results
 Assumption of equal distribution across categories
Chi-Square Test for
Goodness of Fit
 Single nominal variable
 Degrees of freedom = number of categories minus 1
Chi-Square Statistic
 Compares observed frequency distribution to expected
frequency distribution
 Compute difference between observed and expected and
square each one
 Weight each by its expected frequency
 Sum them
(O  E )
 
E
2
2
Ex9
Chi-Square Distribution
 Compare obtained chi-square
to a chi-square distribution
 Does mismatch between
observed and expected
frequency exceed what would
be expected by chance alone?
Chi-Square Test for
Independence
 Two nominal variables
 Independence means no
relation between
variables
 To determine degrees of
freedom…
df  ( NColumn  1)( NRows  1)
 Contingency table
 Lists number of
observations for each
combination of
categories
 To determine expected
frequencies…
R
E    (C )
N
Most Commonly Used Statistical Techniques
 Correlation (Relationship of 2 numerical variables)
- Is there a connection between the average verbal SAT
and the percent of graduates who took the SAT in a
state?
Other Statistical Techniques
 Factor analysis (reducing independent variables which are highly
correlated)
 Cluster analysis (grouping observations with similar
characteristics)
 Correspondence Analysis (grouping the levels of 2 or more
categorical variables)
 Time Series Analysis
 And so on……..
Inference with highest confidence level
Definition of Statistics
 Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data.
---From Wikipedia
Presentation of Data
 FOR CATEGORICAL DATA
---Bar Chart
---Pie Chart
Presentation of Data
 FOR NUMERICAL DATA
--- Stem-and-Leaf Plot
--- Histogram
--- Boxplot
Overview of Statistical Techniques
Questions?
or
Comments ?
Upcoming Workshops
 10/26/2009 Overview of SPSS
 12/02/2009
Overview of SAS
How to lie with statistics
1. The Sample with Built-in Bias.
2. Well-Chosen Average.
3. The Gee-Whiz Graph.
4. Correlation and Causation.