• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Categorical variable wikipedia, lookup

Transcript
Chapter 7
Scatterplots, Association, and Correlation
Examining Relationships

Relationship between two variables

Examples:
•
•
•
•
Height and Weight
Alcohol and Body Temperature
SAT Verbal Score and SAT Math Score
High School GPA and College GPA
Two Types of Variables

Response Variable (Dependent)


Explanatory Variable (Independent)


Measures an outcome of the study
Used to explain the response variable.
Example: Alcohol and Body Temp
Explanatory Variable: Alcohol
 Response Variable: Body Temperature

Two Types of Variables

Does not mean that explanatory
variable causes response variable


It helps explain the response
Sometimes there are no true response
or explanatory variables
Ex. Height and Weight
 SAT Verbal and SAT Math Scores

Graphing Two Variables

Plot of explanatory variable vs. response
variable




Explanatory variable goes on horizontal axis (x)
Response variable goes on vertical axis (y)
If response and explanatory variables do not exist,
you can plot the variables on either axis.
This plot is called a scatterplot

This plot can only be used if explanatory and
response variables are both quantitative.
Scatterplots


Scatterplots show patterns, trends, and
relationships.
When interpreting a scatterplot (i.e.,
describing the relationship between two
variables) always look at the following:

Overall Pattern
• Form
• Direction
• Strength

Deviations from the Pattern
• Outliers
Interpreting Scatterplots

Form


Is the plot linear or is it curved?
Strength

Does the plot follow the form very closely or
is there a lot of scatter (variation)?
Interpreting Scatterplots

Direction
Is the plot increasing or is it decreasing?
 Positively Associated

• Above (below) average in one variable tends to
be associated with above (below) average in
another variable.

Negative Associated
• Above (below) average in one variable tends to
be associated with below (above) average in
another variable.
Example – Scatterplot

The following survey was conducted in
the U.S. and in 10 countries of Western
Europe to determine the percentage of
teenagers who had used marijuana and
other drugs.
Example – Scatterplot
Percent who have used
Country
Marijuana
Other Drugs
Czech Republic
22
4
Denmark
17
3
England
40
21
Finland
5
1
Ireland
37
16
Italy
19
8
North Ireland
23
14
Norway
6
3
Portugal
7
3
Scotland
53
31
United States
34
24
Example – Scatterplot
Percent who have used Marijuana vs Other Drugs
35
30
25
20
15
10
5
0
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
0
10
20
30
40
50
60
Example – Scatterplot

The variables are interchangeable in this
example.
In this example, Percent of Marijuana is
being used as the explanatory variable (since
it is on the x-axis).
 Percent of Other Drugs is being used as the
response since it is on the y-axis.

Example - Scatterplot
The form is linear
 The strength is fairly strong
 The direction is positive since larger
values on the x-axis yield larger values on
the y-axis

Example - Scatterplot
Negative
association

Outside
temperature
and amount
of natural gas
used
Gas

10
5
0
-5.0
.0
5.0
Temp
10.0
15.0
Correlation
The strength of the linear relationship
between two quantitative variables can be
described numerically
 This numerical method is called
correlation
 Correlation is denoted by r

Correlation

A way to measure the strength of the
linear relationship between two
quantitative variables.
1   ( x  x )( y  y ) 
r

n  1 
sx sy

Correlation

Steps to calculate correlation:
Calculate the mean of x and y
 Calculate the standard deviation for x and y
(x
x
)(yy
)
 Calculate
 Plug all numbers into formula


Correlation
Femur vs. Humerus
100
Humerus
80
60
40
20
0
0
10
20
30
40
Femur
50
60
70
80
Calculating r.
Femur (x)
 Humerus (y)

38 56 59 63 74
41 63 70 72 84
Set up a table with columns for x, y,
2
2
,
,
, and x  x
y y xx
y y

  
xxyy

,
Calculating r.
x  x  y  y  xxyy
2
2
y y
41
xx
-20
-25
400
625
500
56
63
-2
-3
4
9
6
59
70
1
4
1
16
4
63
72
5
6
25
36
30
74
84
16
18
256
324
288
290 330
0
0
686
1010
828
x
y
38
Calculating r

Recall: y   y
n

So,
290
x
 58
5
330
y
 66
5
Calculating r

Recall:
s

( y  y)
2
n 1
So,
sx 
686
 13.1
4
1010
sy 
 15.9
4
Calculating r.

Put everything into the formula:

x  x  y  y 

r
n  1s x s y
828

5  113.115.9 
 0.994
Properties of r
r has no units (i.e., just a number)
 Measures the strength of a LINEAR
association between two quantitative
variables


If the data have a curvilinear relationship, the
correlation may not be strong even if the data
Properties of r

r always ranges in values from –1 to 1
r = 1 indicates a straight increasing line
 r = -1 indicates a straight decreasing line
 r = 0 indicates no LINEAR relationship
 As r moves away from 0, the linear
relationship between variables is stronger

Properties of r
Changing the scale of x or y will not
change the value of r
 Not resistant to outliers
 Strong correlation ≠ Causation


Strong linear relationship between two
variables is NOT proof of a causal
relationship!

The following is some output from JMP
where I considered Blood Alcohol Content
and Number of Beers. The explanatory
variable is the number of beers. Blood
alcohol content is the response variable.
Bivariate Fit of BAC By Be ers
0.2
BAC
0.15
0.1
0.05
0
0
2
4
6
Beers
8
10
Summary of Fit
RSquare
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.803536
0.788424
0.02092
0.076
15