Download (b/w) two measurement or quantitative variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 10: Relationships between (b/w) two measurement or quantitative
variables
These variables that involve measuring or counting. E.g. age, weight, height, quiz
averages related to exam score
Deterministic Relationship: we have a perfect correspondence b/w the two
variables.
E.g. Centimeters to inches
Statistical Relationship: Four students have 92.2% average on quizzes. Exam
scores were 73.3%, 86.7%, 86.7%, and 96.7%.
A statistically significant relationship is a relationship observed in a sample that
would have been unlikely to occur if really there is no relationship in the larger
population.
If we can show a linear relationship exists we can employ regression methods to
construct a line.
1. Plot the two variables: make a scatter plot of exam scores and the quiz
averages.
2. To measure the strength and direction of a linear relationship between two
measurement variables we use correlation. The symbol used to represent
the sample correlation is r (lower case r).
Positive
Negative
Which of the following correlations indicate the strongest linear relationship?
A. -.9 (Stronger because it is closer to 1 or -1) Negative slope
B. 1.15 (beyond possible range)
C. .85 (second strongest because it is close to 1 or -1) Positive slope
D. 0 (no correlation)
Our quiz average and exam 1 the correlation was r=0.65
Next we run a Statistical test of the significance on the correlation. We want to look
for statistical significance.
Example: You are a suspect in a criminal investigation.
To begin with you are considered innocent. Assuming this is true, do you
believe it likely or unlikely to gather evidence against that assumption?
Probability Value (p-value): the probability the data produces a result assuming
the null hypothesis is true. The p-value is the probability that the sample data
produces a result given that the true or population correlation is zero. Therefore the
smaller the p-value the stronger the evidence against the true correlation being
zero.
1. Null Hypothesis, Ho
2. Legal example: Ho: Innocent
3. Class example: Ho: correlation is 0
4. Small p-values are signs of evidence against the Ho. We base the idea of small
by using a baseline value called Level of Significance.
5. The most common level of significance is 5% or 0.05.
6. If the p-value is less than 0.05 we decide to reject the null hypothesis and
conclude the opposite.
7. Correlation between quiz average and exam scores r= .650, P-value was
.0004.
8. (P-value < .05) We reject null hypothesis and conclude the population
correlation is not 0.
9. This would indicate then a linear relationship between quiz averages and
exam score.
10. Algebra: Y=mx+b, Y=mx(slope)+b -> y-intercept
11. Statistics: y = b0 + b1 x,
y = b0 + b1 x=> b0(y-intercept)+b1(slope)x
12. Exam: 33.1+0.60x
13. (Response variable) Exam= 33.1+0.60 quiz Avg x (explanatory variable)
14. Slope: for a unit increase in x, Y will increase/decrease its units by the value
of the slope.
What Exam score would we predict for an exam score of 70?
Exam = 33.1 + 0.60 * 70
=33.1 + 42
= 75.1
- Remember slope and correlation are directly relate.
o I.e. if slope is positive and so is correlation and vice versa.
- Correlation has a range: -1  r  1, where sign only reflects direction
- P-value: 0  p-value  1
o The smaller the p-value the stronger the evidence against null
hypothesis.
- Just because two variables are related does not mean one variable causes the
other to occur.
- Be aware of extrapolation (extending results beyond range of data used in
analysis.)
o E.S. Study don on age and weight for babies 0 to 22 months.
o Regression equation: Weight= 6 + 0.9 Age in months
o My age: weight= 6 + 0.9(552)= 503 pound
- When we reject a known hypothesis we have statistically significantly
evidence. This does not necessarily practical evidence. E.S. calf rises in
jumping.
- Outliers can raise issues with regression.
o There are two types of outliers:
 General- Point A

Influential- Point B
 Has big impact on regression,
-

When I provided correlation values did I ever use units?
o Correlation is unit-free so if we were to change the units on one
variable (e.g. change quiz average from % to decimal) the correlation
would be unchanged.
Ch 12 and 13: Comparing 2 Categorical Variables
What is you gender? M/F
Do you believe female’s sports should receive equal finances as men’s sports?
To “organize” responses data we create a table where we have “r” number of rows
fore one variable and “c” number of columns for the other variable where r and c are
the number of levels for each variable.
(“R x C” rows by columns)
For rows we put explanatory variables; response on columns
No
72
181
Females
Males
Yes
466
158
Instead of just looking at counts, we look at percents.
Females
Males
Total
No
72
181
253
Yes
466
158
624
Total
538
339
877
% No
13.38
53.39
100
What is the probability a student selected from survey is Yes? 624/877= .7115
% Yes
86.62
46.61
100
What proportion said yes? 624
What percentage said yes? 624/877 *100= 71.15%
What is the risk that a student said yes?
Risk: number in-group of interest (or the number with trait) divided by total
number.
624/877=.7115
What are the odds?
Odds: # in group of interest/# in group not of interest or # with trait / #
without trait.
624/253= 2.5 to 1 (simplified)
The odds of a student saying yes are 2.5 to 1.
Horseracing: Betting 2 to 1, trait of interest is loosing, out of 3 races you will loose
two times.
What is the probability, risk, and proportion that a male student surveyed says
yes?
# with trait (158) / total in group (339)= .4661, Percentage= 46.61%,
Odds= # with trait / # without trait= 158/181 (male yes/ male no) = 7/8, for every
15 males 7 will say yes, 8 will say no.
Relative Risk (RR) (“more likely”)
What is RR that females say yes to males say yes?
Risk female yes/ Risk male risk=?
Female risk= 466/538
Male risk=
158/339 (denominator risk known as baseline risk)
RR= 1.85 (meaning females are 1.85 time more likely to say yes than males.)
Increased Risk
IR= (change in risk/baseline risk)*100 OR (relative risk-1.0)*100
Odds Ratio (OR)
What is OR that females say yes to male say yes?
Female odds=
466/72
Male odds=
158/181
OR= 7.4 (meaning the odds females say yes is 7.4 times the odds males say
yes.)
The importance of knowing baseline risk:
Hypothetical: A study reports that females who binge drink are 10 times more
likely to develop liver disease than females who do not binge drink.
Without knowing baseline risk for females who do not binge drink in getting liver
disease we can’t make an educated interpretation of this relative risk of 10 times.
What if baseline risk was 1/10,000?
That make the risk of females that do binge drink getting liver disease 10 * 1/10,000
or 1/1000.
Ch 13
#1
Female
Male
No
269/50%
170/ 50%
Yes
269/ 50%
169/ 50%
Total
538
339
#2
Female
Male
No
160/ 30%
102/ 30%
Yes
371/ 70%
237/ 70%
Total
538
339
#3
Female
Male
No
72/ 13%
181/ 53%
Yes
466/ 86%
158/ 46%
Total
538
339
We conduct a Chi-square test of Independence
Ho: There is no relationship in the population between the two variables (i.e.
Independent)
Ho: There is a relationship in the population between the two variables (i.e.
dependent)
We get a p-value comparing to .05; if p-value less than .05 we reject Ho and conclude
a relationship exists between the two variables.
Step 1: Get data and create Observed Data table (chart#3)
Step 2: using this observed data, create an expected counts table under assumption
of interdependence. We find this by finding the row and column totals compared to
overall total.
Expected Table
Female
Male
No
538/253=877
339/253=877
Yes
538/624=877
339/624=877
Expected Count
No
Yes
Female
155.2 (28.85%)
382.8 (71.15%)
Male
97.8 (28.85%)
241.2 (71.15)
Step 3: Calculate a chi-square statistic by:
Chi-square=> X2= (observed-Expected)^2 / Expected
X2= (72-155.2)^2/155.2
+ (466-382.8)^2/382.8
+(181-97.8)^2/97.8
+(158-241.2)^2/242.2
Total= 161.18
Step 4: We get a p-value e.g. p-value is approximately .000
Step 5: Compare p-value to .05 e.g. .000 is less than .005, so we reject Ho and
conclude that relationship exists in the population between gender and attitude
about equal financial support.
(Can’t draw a cause because of no random assignment.)
(Random selection, the population is strictly that sample.)
A primary concern in tests of independence is the existence of confounding or
lurking variables. Over looking such variables can lead to a phenomenon known as
Simpsons Paradox, where the results reported from aggregate data differ from
results when the lurking confounding variable is included.
Classic example of Simpson’s Paradox
In 1972 Supreme Court ruled that racial disparities in assigning punishment in
capital cases be eliminated. From 1976 through 1977 a gentleman named Revolet
looked at capital case data in Florida.
Defendant
White
Black
Total
Death Penalty
No
141
149
290
Death Penalty
Yes
19
17
36
Total
%YES
160
166
11.9%
10.2%
Considered the lurking variable of victims race: Victim being White
Defendant
No
Yes
Total
% Yes
White
132
19
151
12.6%
Black
52
11
63
17.5%
Victim being Black
Defendant
No
White
9
Black
97
Yes
0
6
Total
9
103
No
141
149
%
0%
Chapter 16: Probability
Q: What is the probability that PSU men’s basketball team wins this Tuesday at Ohio
State?
10%, 25%, 37% All arrived by personal opinion.
My estimate is 24.5%. My probability was based on “relative frequency” In this case
PSU has won 27 out of 137 road games since joining Big 10.
Which of the following (or both) are true?
1. Law of large numbers: Over a long run period the expected mean or
proportion of the population would be experienced. E.g. flip a coin 1000
times; we would expect about 500 heads and tails.
2. Law of small numbers: If a pattern occurs. E.g. 6 tails in a row then we are
more likely to see a head on the next flip. Not True!
Probability is the chance that something is the case or that an event will occur.
Range of probabilities is 0 to 1.
Trial is one number of repetitions in an experiment/study.
Sample Space is the set all possible outcomes. E.g. coin flip->head, tail; die roll->
1,2,3,4,5,6
Subset of Sample Space: E.g. get a tail on a coin flip; E.g. get an even number when
die rolled.
Complement is a set of outcomes not in the event of interest. E.g. if event of interest
is “get a tail on coin flip” a complement is “not a tail” I.E. get a head. E.g. If event is
get an even number on a roll of die” I.E. we get a 2 or 4 or 6. Complement would be
1, 3, or 5.
Intersection is when two or more events overlap. i.e. share common outcomes. E.g.
let event E be get even number on die roll (2,4,6), Let S be event we get a 6.
Union: the event of all outcomes of two or more events. E.g. Let E be even numbers
on role of die, let f be get a 5, the union of events E and F would be (2,4,5,6)
Conditional Probability is the probability that event occurs given that another
event has already happened. E.g. Let E be event get an even number on roll of die.
Say we know the outcome is an even number, what is probability we got a 6? 1/3
Independent trials one event does not influence the other event. E.g. coin flip.
Mutually Exclusive of Disjoint Events: events that do not have any common
outcomes. Both events cannot occur at the same time. E.g. let E be even number on
roll of die, Let F be we get a 5.
Simple Events: are events of only one outcome. E.g. getting a 5.
Notation:
If we define two events as A, and B then
P(A): is probability that event A occurs
P(AC ) or P(A1): the complement of A
P(AorB) or P(AUB: The union of A and B
A stat class has 4 teaching assistants.
3 females( Lauren, Rona, Leila)
1 male (Tom)
Each Ta is on chare of one section, Let event W be the Ta is female and let event M
be TA is male.
1) Which event is a simple event? Event M
2) Are the events W and M mutually exclusive? Yes Since the event do not
overlap. I.E. share any common outcomes.
3) If two students, unknown to each other, randomly select a TA are their
choices independent? Yes
4) What is probability that a student randomly selects Lauren? ¼
5) You are told your section TA is female. What is probability that Lauren is
your TA? 1/3
Review
30 questions
22-24 from this exam
8 from last exam
From Exam 1
-Table 8.1, percentiles
-Interpretation, Emperical rule, Identifiing shape
-Random Assignment vs. Random Selection
Study quizzes, exam, and notes
From Exam 2
Survey: asked students their gender and if they have ever cheated while in a
committed relationship.
No
Yes
Female
59
30
89
Male
29
9
38
127
What is the risk of females cheating? 30/89
Odds? 30/59
From the table, is the probability a randomly selected student cheats a personal
opinion or a relative frequency? Relative frequency
What would probability be? 39/127
What is the odds ratio for men having cheated compared to women?
(9/29)/(30/59)
What is the relative risk of males to females?
(9/38)/(30/89)
The P-value: .262, what conclusion can be made? Failing to reject/not statistically
significant, we cannot say a relationship exists between gender and cheating.
Did this data come from an observational study or an experiment? Observational
study because of no random assignment.
What if the student’s surveyed were randomly selected? Still Observational Study.
If the results were statistically significant could we say gender causes the cheating
(less than .005)? No because we had no random assignment.
Which statistical method would be more appropriate for an analyzing the variables
weight and height? Regression or Chi-Squared?
Regression for quantitative/non categorical since both variables are measurement.
Chi-Squared for categorical.
Which following p-values represents the weakest statistical evidence?
1.5 (larger than one so its thrown out)
.9 (weakest)
0.01
0.001 (strongest)
-0.05 (Negative so its thrown out)
Which following values represents the strongest correlation?
1.5 (larger than one so its thrown out)
.9 strongest
0.01
0.001
0.08