Download Chapter 1 - dbmanagement.info

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Chapter 1
Introduction to Statistics
Section 1.1
Fundamental Statistical
Concepts
Objectives
• Explain the purpose of statistics.
• Decide what tasks to complete before you
analyze your data.
• Distinguish between populations and
samples.
What Is Statistics?
HEIGHT
54
5  10 
52
58
61
5 8 
6
55
5  11 
5
Descriptive Statistics
HEIGHT
55 58
5 52 54
MIN
5 11 6 
5  8  5  10   
5'7.3''
AVERAGE=5
5
6 1 
MAX
Inferential Statistics
5 5  5 8
5 5 2 54
MIN
5 11 6 
5  8  5  10   
AVERAGE=55' 57.3''

61
MAX
Defining the Problem
Before you begin any analysis, you should
complete certain tasks.
1. Outline the purpose of the study.
2. Document the study questions.
3. Define the population of interest.
4. Determine the need for sampling.
5. Define the data collection protocol.
Cereal Example
Rise
n
Shine
15 ounces
Defining the Problem
The purpose of the study is to determine
whether Rise n Shine cereal boxes contain 15
ounces of cereal.
The study question is whether the average
amount of cereal in Rise n Shine boxes is equal
to 15 ounces.
Population
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
Rise
n
n
Shine
Shine
Rise
Rise
n
n
Shine Shine
Rise
n
Shine
Rise
n
Shine
Rise
Rise
n
n
Shine
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Rise Shine
n
Shine
Sample
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Rise
Shine
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Simple Random Sampling
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
...
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Convenience Sampling
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
...
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Rise
n
Shine
Parameters and Statistics
Statistics are used to approximate population
parameters.
Population
Parameters
Sample
Statistics
Mean

x
Variance
2
s2
Standard
Deviation

s
Levels of Measurement
The two levels of measurement of data used in
this course are
• continuous
• discrete.
Describing Your Data
The goals when you are describing data are to
• screen for unusual data values
• inspect the spread and shape of continuous
variables
• characterize the central tendency
• draw preliminary conclusions about your
data.
Process of Data Analysis
Population
Random
Sample
Describe
Sample
Statistics
Make
Inferences
Section 1.2
Examining Distributions
Objectives
• Examine distributions of data.
• Explain and interpret measures of location,
dispersion, and shape.
• Use the MEANS and UNIVARIATE procedures
to produce summary statistics.
• Use the UNIVARIATE procedure to generate
stem-and-leaf, box-and-whisker, normal
probability plots and histograms.
Cereal Data Set
Rise
n
Shine
.
.
.
.
.
.
WEIGHT
ID NUMBER
.
.
.
.
.
.
.
.
.
.
.
.
Distributions
When you examine the distribution of values
for the variable WEIGHT, you can find out
• the range of possible data values
• the frequency of data values
• whether the data values accumulate in the
middle of the distribution or at one end.
FREQUENCY
Symmetric Distributions
WEIGHT
FREQUENCY
Skewed Distributions
WEIGHT
Normal Distribution
Examples of Normal Distributions
std 1.5
std 1.0
std 0.5
Measures of Central Tendency
The mean is the balancing point of your data.
15.02
14.98
15.01
15.00
14.99
Percentiles
th
FREQUENCY
40
Percentile
0
40%
60%
WEIGHT
FREQUENCY
FREQUENCY
Measures of Dispersion
15.00
15.00
WEIGHT
WEIGHT
Measures of Shape
Skewed
to Left
FREQUENCY
Symmetric
FREQUENCY
FREQUENCY
WEIGHT
Skewed
to Right
WEIGHT
WEIGHT
Measures of Shape
Light-tailed
Normal
Heavy-tailed
The MEANS Procedure
PROC MEANS DATA=SAS-data-set <options>;
VAR variables;
RUN;
The UNIVARIATE Procedure
PROC UNIVARIATE DATA=SAS-data-set<options>;
VAR variables;
ID variable;
HISTOGRAM variables / <options>;
PROBPLOT variables / <options>;
RUN;
Descriptive Statistics
This demonstration illustrates using the
MEANS and UNIVARIATE procedures to
calculate descriptive statistics for
continuous variables.
Graphical Displays of Distributions
PROC UNIVARIATE produces three kinds of plots
for examining the distribution of your data values:
• stem-and-leaf plots
• box-and-whisker plots
• normal probability plots.
PROC UNIVARIATE can also generate histograms
and graphically enhanced normal probability plots.
Stem-and-Leaf Plots
9 01338
8 0012347789
7 0013455667799
6 03568
5 8
4
3 9
2 0
1 4
Multiply Stem.Leaf by 10**1
Box-and-Whisker Plots
100||
90||80|
|70|
|60|
|50|
|
40||30|
|20|
|10|
max point 1.5 IQ units from box
+
75th percentile
50th percentile median
25th percentile
min point 1.5 IQ units from box
0
more than 1.5 IQ units from box
*
*
more than 3 IQ units from box
The mean is denoted by +.
Normal Probability Plots
. 3.
2.
......
.
.
.
.
.
..
.
.
.
.
..
.
..
..
.
.
.
.
.
.
.
..
.
..
.
.
.
.
.
..
.
..
..
.
.
.
.
.
.
..
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
..
...
..............
.
. . . 5.
.
.
4.
.
.......
.....
.
.
.
...
.
..
.
..
.
.
.
.
..
.
.
..
.
.
.
.
......
.
.
..
....
...
..
.
.
.
.
.
.
.
.........
.
1.
Examining Distributions
This demonstration illustrates using PROC
UNIVARIATE to generate stem-and-leaf,
box-and-whisker, normal probability plots
and histograms.
Section 1.3
Confidence Intervals
for the Mean
Objectives
• Explain and interpret the confidence intervals
for the mean.
• Explain the central limit theorem.
• Calculate confidence intervals using
the MEANS procedure.
Point Estimates
estimates
estimates
Variability among Samples
mean of 15.02
mean of 15.03
.
.
.
.
.
.
Standard Error of the Mean
A statistic that measures the variability of your
estimate is the standard error of the mean.
It differs from the sample standard deviation
because
• the sample standard deviation deals with the
variability of your data
• the standard error of the mean deals with the
variability of your sample mean.
Confidence Intervals
95% Confidence
(
|
|
)
5% Confidence
|(
|
)
Assumptions about
Confidence Intervals
The types of confidence intervals in this course
make the assumption that the sample means are
normally distributed.
Distribution of Sample Means
Weight
Mean of Weight
Normal Distribution
Useful Probabilities for Normal Distributions
68%
95%
99%







Confidence Intervals
Distribution of the Sample Means
95%
x
Central Limit Theorem
To satisfy the assumption of normality, you
can either
• verify that the population distribution is
approximately normal, or
• apply the central limit theorem.
The central limit theorem states that the
distribution of sample means is approximately
normal provided that the sample size is large
enough.
Central Limit Theorem
Confidence Intervals
This demonstration illustrates calculating
confidence intervals using PROC MEANS.
Section 1.4
Hypothesis Testing
Objectives
• Define some common terminology related
to hypothesis testing.
• Perform hypothesis testing using the
UNIVARIATE procedure.
• Compare the means of paired groups using
the TTEST procedure.
Judicial Analogy
Hypothesis
Significance Level
Collect Evidence
Decision Rule
Coin Example
H
T
T
H
H
Coin Analogy
Hypothesis
Significance Level
Collect Evidence
Decision Rule
Types of Errors
You used a decision rule to make a decision, but
was the decision correct?
ACTUAL
DECISION
Fair Coin
Not Fair Coin
Fair Coin
correct
Type II error
Not Fair Coin
Type I error
correct
Modified Coin Experiment
Which coins are fair?
55 Heads
45 Tails
40 Heads
60 Tails
p-value = .27
p-value = .04
63 Heads
37 Tails
15 Heads
85 Tails
p-value < .01
p-value < .01
Statistical Hypothesis Test
H o : equality
H 1 : difference
Set Hypothesis
Rise
n
Shine
15 oz.
Collect Data
set
Significance Level
p-value
p-value
Decision Rule
Comparing  and the p-Value
In general, you
• reject the null hypothesis if p < 
• fail to reject the null hypothesis if p  .
Performing a Test of Hypothesis
To test the null hypothesis H0:  = 0, SAS
software calculates the t statistic
( x   0)
t
sx
Two-Sided Test of Hypothesis
The test of hypothesis is two-sided if the null
is rejected when the actual value of interest is
either less than or greater than the hypothesized
value.
H0:   15.00
H1:   15.00
Two-Sided Test of Hypothesis
-3
-2
-1
0
T
1
2
3
One-Sided Test of Hypothesis
In many situations, you are only interested
in one direction. Perhaps you only want evidence
that the mean is significantly lower than fifteen.
For example, instead of testing
H0:  = 15 versus H1:   15
you test
H0:   15 versus H1:  < 15
One-Sided Test of Hypothesis
-3
-2
-1
0
T
1
2
3
Hypothesis Testing
This demonstration illustrates using PROC
UNIVARIATE to perform hypothesis testing.
Paired Samples
ADVERTISING
BEFORE
AFTER
Sales
Sales
The TTEST Procedure
PROC TTEST DATA=SAS-data-set;
CLASS variable;
VAR variables;
PAIRED variable*variable;
RUN;
Paired t-Test
This demonstration illustrates using PROC
TTEST to conduct a paired sample t-test.
Section 1.5
Two-Sample t-Tests
Objectives
• Recognize and validate the assumptions of
a two-sample t-test.
• Analyze two populations with the TTEST
procedure.
ni
or
M
Rise
n
Shine
ng
Cereal Example
Assumptions
Comparing Two Populations
2
1
Morning
Rise n Shine
• independent observations
• normally distributed data for each group
• equal variances for each group.
F Test for Equality of Variances
H0 :
2
1
=
2
H1 :
2
2
1
2
1
2
2
2
2
max(s , s )
F=
min(s , s )
2
1
=
2
2
Test Statistics and p-Values
F Test for equal variances:
H0: 12 = 22
Variance Test:
F’ = 1.51
DF = (3,3) Prob > F’ = 0.7446
t-Tests for equal means:
H0: 1 = 2
Unequal Variance t-test:
T = 7.4017
DF = 5.8
Prob > |T| = 0.0004
Equal Variance t-test:
T = 7.4017
DF = 6.0
Prob > |T| = 0.0003
Test Statistics and p-Values
F Test for equal variances:
H0: 12 = 22
Variance Test:
F’ = 15.28
DF = (9,4) Prob > F’ = 0.0185
t-Tests for equal means:
H0: 1 = 2
Unequal Variance t-test:
T = -2.4518
DF = 11.1 Prob > |T| = 0.0320
Equal Variance t-test:
T = -1.7835
DF = 13.0 Prob > |T| = 0.0979
Testing for Equality of Means
This demonstration illustrates using PROC
TTEST to test for the equality of means for
two groups.
Section 1.6
Output Delivery System
Objectives
• Introduce the Output Delivery System (ODS).
• Examine some simple statements in ODS.
• Use ODS to capture some specific
UNIVARIATE procedure output.
• Use ODS to generate a report in the HTML
format.
• Use ODS to generate data sets with specific
PROC UNIVARIATE output.
Output Delivery System
SAS
procedure
computes
results
Output
object
created in
ODS
ODS
converts data
component
into SAS
data set
ODS Statements
• TRACE
provides information about the output object
such as the name and path.
• LISTING
opens, manages, or closes the Listing
destination.
• OUTPUT
creates SAS data set from an output object.
Output Delivery System
This demonstration illustrates the Output
Delivery System by introducing some
simple concepts and building on that
knowledge.
Section 1.7
Exercises
Section 1.8
Chapter Summary
Section 1.9
Solutions to Exercises