Download Introductory Statistical Concepts

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Introductory Statistical Concepts
Disclaimer
– I am not an expert SAS programmer.
– Nothing that I say is confirmed or denied by Texas
A&M University.
2
Why Are We Here?
• Deming
– To Learn
– To Have Fun
Question: Who was Deming?
3
Poll: What type of organization do you
work for?
•
[PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.]
•
•
•
•
•
Business
Government
Education
Nonprofit
Other
4
Purpose of These Lectures
• A review of the statistical concepts used in most
of the SAS Analytics Lecture Series.
• We will look at questions such as the following:
–
–
–
–
–
What is the nature of statistical analyses?
Why are population parameters so important?
What is really being tested when you see a p-value?
Why does regression handle missing data so well?
What are residual analyses?
5
Descriptive Statistics
The Population
(Very important concepts)
Variable of Interest
The Distribution
Parameters
Mean
Median
Mode
Range
Variance
Etc
7
Learning Outcomes
• You will learn
– basic statistical concepts
– the definition of mean, median, mode and standard deviation
– the difference between populations and samples
– the difference between parameters and estimates
– about confidence intervals
– how to test a statistical hypothesis
– how to run a regression analysis
8
Parameters
• Characteristics of the variable of interest
• It is how we describe the variable of interest
• Parameters are unknown
9
Parameters
(Characteristics)
• Central Tendency
• Measures of Variability
• Mode
• Range
• Median
• Variance
• Mean
• Standard Deviation
Click Here for more information on Mode Mean Median
Click Here for an applet
10
Variability
Change in the Data
What is an Index ?
How SUNNY is SUNNY?
THE UV Index
Click Here
12
Air Quality Index
What Does It Mean?
13
DOW JONES INDUSTRIAL AVERAGE INDEX
What does 10,971.16 really mean?
What is “better” a DJIA of 10,000
Or a DJIA of 12,000?
14
Variability Index
• A Simple One
• Find the Largest Value
• Find the Smallest Value
• Let Range = R = Largest – Smallest
15
A More Complex Variation Index
• The Standard Deviation
 or S or s
• Statisticians use this index to indicate variability
• You will see it written as
• Widely available from SAS, Excel, and other statistical packages
16
Details of the More Complex Index
•
•
•
•
Example – Suppose that we observe the following three numbers
1 4 7
The mean of these number is:
( 1 +4+7)/3 = 4
•
•
We now subtract the mean from each number and square it
(1-4)*(1-4) + (4-4)*(4-4) +(7-4)*(7-4) = 18
•
The Standard Deviation = sqrt(18/2) = 3
17
What does this Mean?
• By itself , it may be confusing to some.
• Comparing populations, we can use it to say
which population varies the most.
• Let us look at an applet – Click Here
18
Using Graphs to Determine Variability
• Box Plot
• Click Here
400000
Total Violent Crime
300000
200000
100000
0
N=
35
35
CALIFORN
NEW_YORK
State
19
Distributions
Known Distribution
• With a known distribution, we know the
following:
– the shape
– the mean
– the variability (standard deviation)
– and/or some other information
21
Classical Distributions─Normal
22
Normal─Overlay
23
Classical Distributions─Uniform
24
Survey
• The following are called parameters of the
population:
– mean, median, mode
– variance, standard deviation, range, inter-quartile range
(IQR)
• In general, are these known or unknown?
– Known = yes (select using your seat indicator)
– Unknown = no (select using your seat indicator)
25
MPG─Histogram
Compare with
“true” values !
26
Simulated Sample
• In this example, we simulated taking a sample
of size 1000 from one population of cars
weighing 3000 pounds with a normal
distribution with mean=24 and standard
deviation=1.
• You can practice this after class.
27
Section 1.2
Populations and Samples
Objectives
– Understand the relationships between
• populations and samples
• parameters and estimates.
– Look at an overview of hypotheses testing.
29
Population
Parameters
Mean, Variance, Median,
Mode, Distribution, …
30
Example
• Mpg of American-made cars that weigh
between 2000 and 3500 pounds and were
built in the 1970s.
• Parameters – mean, variance, and so on
• In general, we do not know the parameters.
31
Purpose of Statistical Analyses
– Estimate the parameters. (Make guesses.)
• Example: What is the population mean?
– Test hypothesis about the parameters. (Ask
questions.)
• Example: Is the population mean=30mpg?
32
Role of Samples
– Taking a sample of the population enables you to
• make estimates of the population parameters
• answer the questions about the population
parameters.
33
Population and Sample
Parameters
Sample
Mean, Variance, Median,
Mode, Distribution, …
S
Sample mean
Sample variance
Inference:
Estimates
Test of hypotheses
34
Example: cars_american
• This is a sample of American-made cars that
weigh between 2000 and 3500 pounds and
that were built in the 1970s.
• We are interested in the mpg.
• Use summary statistics to analyze the data.
35
Results of Summary Statistics
36
Results of Histogram
continued...
37
Results of Histogram
38
Sampling Distribution
Applet
sampling_dist
• This demonstration illustrates how
to estimate and plot the sampling
distribution of various statistics.
39
View/Application Share: Demo:
Sampling Distributions Applet
•
[PlaceWare View/Application Share. Use PlaceWare > Edit Slide Properties... to edit.]
40
http://www.ruf.rice.edu/~lane/stat_si
m/sampling_dist/index.h...
•
[PlaceWare Web Page. Use PlaceWare > Edit Slide Properties... to edit.]
41
Confidence Intervals on the Population
Mean
• Level of Comfort
• 50% {21.57 to 22.21}
• 95% {20.96 to 22.82}
What does this mean?
• 99.9% {20.30 to 23.48}
42
Test That the Population Mean = 30
mpg
• Use t-test  One Sample t-test
• Requirements for running this test:
– Large n > 35
– Or leftovers are normal
• What is the p-value or sig value?
43
Testing Mean = 30
H o : mpg  30
H A : mpg  30
44
Conclusions of the Test
• Choose an alpha level, usually alpha=.05.
• If sig<alpha, then reject.
• Otherwise, fail to reject.
45
Sig and p-values
• When you see a sig value or p-value:
– You know that some hypothesis is being tested.
– You know whether or not the hypothesis is being
rejected.
– You probably do not know what the hypothesis
really is.
• Ask yourself these questions:
– What are the population parameters being tested?
– How is what is being tested related to those
parameters?
46
Requirements for Doing This Test
• Large n  n > 35
• Or leftovers are normally distributed.
• Use Histogram to test for normality.
47
Populations─Which Ones are Similar?
48
Populations─Which Ones are Similar?
• Take samples.
49
Take Samples
• Use the samples to answer this question:
• “Which populations are similar?”
• Statistical translations:
• “Which populations are similar?” is the same as asking…
• Are the following the same:
– distribution?
– mean?
– variance?
50
Background/Requirements
• Before we jump into the analysis, we must ask
the following questions:
– How many populations are there?
– How many population parameters are we
interested in and what are they?
– What tests do we want to do, and what are the
requirements for doing those?
– Are we using everything we “know?”
51
Example
• Suppose that we are interested in the mpg of
American
andCars
European cars.
How many
American
European
Cars
populations
Mpgare there?
Mpg
Distribution
Mean
Variance
Distribution
Mean
Variance
52
Poll: How many populations are there?
•
[PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.]
• One - MPG
• Two - American and European
• Depends on the sample size
53
Parameters
Population 1
Population 2
American Cars
European Cars
Variable of interest: mpg
Variable of interest: mpg
Distribution: Normal?
Distribution: Normal?
Mean:
Variance:
A

2
A
E
2
Variance:
E
Mean:
54
Analyses
1. We want to look at the distributions.
2. We want to estimate the parameters.
3. We want to answer these questions:
•
•
Are the populations means the same?
Are the population variances the same?
55
Example: Our Data Set car_am_eu
• Suppose that we are interested in the mpg of
American
andCars
European cars.
American
European Cars
Mpg
Distribution
Mean
Variance
Mpg
Distribution
Mean
Variance
Sample
Sample
56
Results from the Sample
continued...
57
Results
Tests of Normality
a
Miles per Gallon
Country of Origin
American
European
Kolmogorov-Smirnov
Statis tic
df
Sig.
.110
248
.000
.111
70
.033
a. Lilliefors Significance Correction
58
Box Plots
American
European
59
Histograms
American
European
60
Poll: Are the populations the same?
•
[PlaceWare Yes/No Poll. Use PlaceWare > Edit Slide Properties... to edit.]
• Yes
• No
61
Conclusion Based on Sample Numbers
and Graphs
• Easy -- Based on the samples, the
populations are different—no statistical
jargon
• But I must have a p-value for my boss, for
my paper, and so on.
62
Formal Tests
• The classical approach in determining whether
two populations are the same is to test to see
whether the two population means are equal.
• But first we check to see whether the two
2
2
population
variances
are
equal:
H
:



o
A
E
o
A
E
•
H :  
continued...
63
Formal Tests
• We use t-test  Two Sample.
Test 2
Test 1
64
Section 1.3
Simple Linear Regression
Objectives
– Identify the following:
•
•
•
•
•
•
the population parameters
the appropriate model
number of populations sampled
the correct hypotheses
what should be tested for normality
what “equal variances” means.
66
MPG Example
Weight = 3000
Weight = 2600
3
1

2
1
Weight = 2300
4

2
4
Take a sample of
size 1 from each
population!

2
3
Weight = 2900
2

2
2
67
Data
• We should be in deep trouble with one
sample from each population.
• We have eight unknown population
parameters.
• Can you name them?
• But what do we “know”?
68
Survey
• Name the population parameters.
69
Essential Part and Leftovers
• We want to “model” the data as follows:
• MPG = Essential Part + Leftover
• or
• MPG = Mean + Leftover
70
First, we "know" that
“Know” or Assumptions
2
 • First,
 we
 32 “know”
  42   that
2
1
2
2
First, we "know" that
 12   22   32   42   2
Second, each population mean is related to weight
Second,
each population mean is related to weight by
by• the
following:
the following:
Second,
each population mean is related to weight
by the following:
i  a  b * weighti
•
The
population
means
fall
on
a
straight
line!!
  a  b * weight
i
i
THE POPULATION MEANS FALL ON A STRAIGHT LINE!!!!
• How many unknowns are there now?
THE POPULATION MEANS FALL ON A STRAIGHT LINE!!!!
71
Poll: How many unknowns are there?
•
[PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.]
•
•
•
•
•
•
1
2
3
4
5
n
72
Graph
73
Observed, Essential Part, Leftover
74
The Official Regression Model
mpg = a + b*weight+leftover
mpg = a + b*weight+leftover
or = a + b*weight+leftover
mpg
or
mpg
mpg =
= aa +
+ b*weight+leftover
b*weight+error
or
•mpg
or
= a + b*weight+error
or
or = a + b*weight+error
mpg
or
mpg
mpg =
= aa +
+ b*weight+error
b*weight+
or
mpg
= a + b*weight+
or
oror = a + b*weight+
mpg
•or
mpg
a ++b*weight+
 
mpg =
=

*weight+
or
o
1
mpg
=  o + 1*weight+
or
mpg =  +  *weight+
o
The errors are “known”
to be normal with mean
0 and variance  2.
1
•mpg
or =  o + 1*weight+
75
Main Assumptions
• The means of the populations
fall on a straight
2

line.
2

• All of the variances are equal (
).
• The errors are “known” to be normal with
mean 0 and variance
.
76
Assumptions for Simple
Linear Regression
Appendix A
• This demonstration illustrates the
fundamental concepts of simple
linear regression.
77
View/Application Share: Demo:
Linear.doc
•
[PlaceWare View/Application Share. Use PlaceWare > Edit Slide Properties... to edit.]
78
The Principle of Least Squares:
How Can We Estimate the Unknown
Let leftover  mpg  (essential part)
Parameters?
Let
leftover

mpg

(essential
The
Principle
of
Least
Squares:
or
The Principle of Least Squares: part)
The Principle of Least Squares:
i
i
i
i
i
i
•or The Principle of Least Squares:
leftoveri  mpg i  (a+b*weight i )
leftover
(a+b*weight
Let
leftover
mpg
mpg
 (essential part)
i impg
i 
i ) or
Let
leftover
i
i i (essential part) i i
ri  mpg i  (a+b*weight i )
oror
•or or
ri  mpgiimpg
mpg
(a+b*weight
leftover
(a+b*weight
(a+b*weight
)
i)
leftover

i
i
i
i )i
Now
oror
• or
mpg
mpg
 (a+b*weight )
Choose a and b so that
ri riNow
i i (a+b*weight i )i
Choose a and b so that
r 2  r 2  r 2  r 2 is as small as possible.
1
2
3
4
•r 2 Now,
choose
a
and
b
so
that
is
as
2
2
2
r2  r3  r4 is as small as possible.
Now
Now
1
or
small as possible.
2
2
2
2
Choose
andbbsosothat
that
or
Choose
aaand
Minimize
(r

r

r

r
1
2
3
4 )
•
or
2
2
2
22
2
2
2
as
small
possible.
isras

r

r
)possible.
r12r1Minimize
r22r2r32r3(r
r42r14 is
small
as
2
3
4 as
• Minimize
.
oror
2 2  2r 2  2r 2  2r 2 )
Minimize
(r
Minimize (r1 1 r2 2 r3 3 r4 4 )
79
OUTPUT_0
80
OUTPUT
81
OUTPUT_1
82
OUTPUT_2
83
OUTPUT_3
84
OUTPUT_4
85
Missing Values
• Suppose that we want to estimate the mean
mpg when weight=2500.
• Predicted (Estimated) Mean MPG = 44.05 .0078*weight
• Why does this work?
86
Survey
• Can anyone explain why this works?
87
Conclusion
– Simple linear regression is very powerful.
– But it is based on assumptions (what we “know”).
– We need to check assumptions (residual
analyses).
88