Download P - Cannon School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
 Data
Analysis is the process of organizing,
displaying, summarizing, and asking questions
about data.
Definitions:
Individuals – objects (people, animals, things)
described by a set of data
Variable - any characteristic of an individual
Categorical Variable
– places an individual into
one of several groups or
categories.
Quantitative Variable
– takes numerical values for
which it makes sense to find
an average.
+
is the science of data.
Data Analysis Chapter 1
 Statistics
+
Why the distinction is important
 You
will receive NO credit (really!) on the AP exam if
you construct a graph that isn’t appropriate for that
type of data
Type of Variable
Appropriate
Graph
Categorical
Pie Chart, Bar
Graph
Quantitative
Dotplot, Stemplot,
Histogram

The purpose of a graph is to help us understand the data. After
you make a graph, always ask, “What do I see?”
How to Examine the Distribution of a Quantitative Variable
In any graph, look for the overall pattern and for striking
departures from that pattern.
Describe the overall pattern of a distribution by its:
•Shape
•Center
•Spread
Don’t forget your
SOCS!
Note individual values that fall outside the overall pattern.
These departures are called outliers.
+
Examining the Distribution of a Quantitative Variable
Displaying Quantitative Data

a Boxplot
+
 Construct
Consider our NY travel times data. Construct a boxplot.

10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Min=5
Q1 = 15
M = 22.5
Q3= 42.5
Max=85
Recall, this is
an outlier by the
1.5 x IQR rule
Describing Quantitative Data
Example
+
Transformations
Think about this like a
curve on a test. If I add
Think about this like a
 Adding, subtracting, multiplying and dividing all the numbers in
5 points to everyone’s
a test.
If I add the data.
acurve
data seton
is called
“transforming”
test, how will the
5 points to everyone’s
 ONE OF THE CENTRAL CONCEPTS FROM THIS CHAPTER
spread of the grades
test, how will the class
is knowing how transformations affect measures of center and
change?
average
change?
spread.
I WILL
TEST ON THIS. OFTEN.
A LOT.
Mean
Standard deviation
Adding or subtracting
Mean gets added or
subtracted same
amount
Standard deviation
doesn’t change
Multiplying or dividing
Mean gets multiplied
or divided by same
amount
Standard deviation
gets multiplied or
divided by same
amount
A z-score tells us how many standard deviations from the
mean an observation falls, and in what direction.
Definition:
If x is an observation from a distribution that has known mean
and standard deviation, the standardized value of x is:
x  mean
z
standard deviation
A standardized value is often called a z-score.
Jenny earned a score of 86 on her test. The class mean is 80
and the standard deviation is 6.07. What is her standardized

score?
x  mean
86  80
z

 0.99
standard deviation
6.07
Describing Location in a Distribution

Position: z-Scores
+
 Measuring
+
Mean and Median of a Density
Curve
 Symmetric:
Mean = Median
 Skewed
Left: Mean < Median
 Skewed
Right: Mean > Median
 The
median of a density curve is the equal-areas
point, where ½ of the area is to the left and ½ of the
area is to the right.
 The
mean of a density curve is the balance point,
where the curve would balance if it were made of
solid material.
All Normal distributions are the same if we measure in units
of size σ from the mean µ as center.
Definition:
The standard Normal distribution is the Normal distribution
with mean 0 and standard deviation 1.
If a variable x has any Normal distribution N(µ,σ) with mean µ
and standard deviation σ, then the standardized variable
z
x -

has the standard Normal distribution, N(0,1).

Normal Distributions

Standard Normal Distribution
+
 The
and Response Variables
Definition:
A response variable (the dependent variable)
measures an outcome of a study.
An explanatory variable (the independent
variable) may help explain or influence changes
in a response variable.
Note: In many studies, the goal is to show that
changes in one or more explanatory variables
actually cause changes in a response variable.
However, other explanatory-response
relationships don’t involve direct causation.
Scatterplots and Correlation
Most statistical studies examine data on more than one
variable. In many of these settings, the two variables play
different roles.
+
 Explanatory
Linear Association: Correlation
Linear relationships are important because a straight line is a simple
pattern that is quite common. Unfortunately, our eyes are not good
judges of how strong a linear relationship is.
Definition:
The correlation r measures the strength of the linear relationship
between two quantitative variables.
•r is always a number between -1 and 1
•r > 0 indicates a positive association.
•r < 0 indicates a negative association.
•Values of r near 0 indicate a very weak linear relationship.
•The strength of the linear relationship increases as r moves
away from 0 towards -1 or 1.
•The extreme values r = -1 and r = 1 occur only in the case of a
perfect linear relationship.
Scatterplots and Correlation
A scatterplot displays the strength, direction, and form of the
relationship between two quantitative variables.
+
 Measuring
about Correlation
1. Correlation makes no distinction between explanatory and
response variables.
2. r does not change when we change the units of measurement
of x, y, or both.
3. The correlation r itself has no unit of measurement.
Cautions:
• Correlation requires that both variables be quantitative.
•
Correlation does not describe curved relationships between variables,
no matter how strong the relationship is.
•
Correlation is not resistant. r is strongly affected by a few outlying
observations.
•
Correlation is not a complete summary of two-variable data.
Scatterplots and Correlation
How correlation behaves is more important than the details of
the formula. Here are some important facts about r.
+
 Facts
and Regression Wisdom
1. The distinction between explanatory and response variables is
important in regression.
Least-Squares Regression
Correlation and regression are powerful tools for describing
the relationship between two variables. When you use these
tools, be aware of their limitations
+
 Correlation
and Regression Wisdom
3. Correlation and least-squares regression lines are not resistant.
Definition:
An outlier is an observation that lies outside the overall pattern of
the other observations. Points that are outliers in the y direction but
not the x direction of a scatterplot have large residuals. Other
outliers may not have large residuals.
An observation is influential for a statistical calculation if removing
it would markedly change the result of the calculation. Points that
are outliers in the x direction of a scatterplot are often influential for
the least-squares regression line.
Least-Squares Regression
2. Correlation and regression lines describe only linear relationships.
+
 Correlation
Sampling at a School Assembly
Describe how you would use the following sampling methods
to select 80 students to complete a survey.

(a) Simple Random Sample

(b) Stratified Random Sample

(c) Cluster Sample
Sampling and Surveys

+
 Example:
Study versus Experiment
Definition:
Experiments
In contrast to observational studies, experiments don’t just
observe individuals or ask them questions. They actively
impose some treatment in order to measure the response.
+
 Observational
An observational study observes individuals and measures
variables of interest but does not impose treatment on the
individuals.
An experiment deliberately imposes some treatment on
individuals to measure their responses.
When our goal is to understand cause and effect, experiments are the
only source of fully convincing data.
The distinction between observational study and experiment is one of
the most important in statistics.
+
Completely Randomized
Comparative Experiment
Experimental
Units
Random Assignment
Group 1
Treatment 1
Compare Response
Group 2
Random
Assignment
illustrates the
principle of
RANDOMIZATION.
Choosing an
adequately large
sample ensures
REPLICATION.
Treatment 2:
Placebo
Having a control
group illustrates the
principle of
CONTROL.
for Experiments
In an experiment, researchers usually hope to see a difference
in the responses so large that it is unlikely to happen just
because of chance variation.

We can use the laws of probability, which describe chance
behavior, to learn whether the treatment effects are larger than
we would expect to see if only chance were operating.

If they are, we call them statistically significant.
Definition:
An observed effect so large that it would rarely occur by chance is
called statistically significant.
A statistically significant association in data from a well-designed
experiment does imply causation.
Experiments

+
 Inference
+
Rules of Probability
Probability Rules

 Basic
• For any event A, 0 ≤ P(A) ≤ 1.
• If S is the sample space in a probability model,
P(S) = 1.
• In the case of equally likely outcomes,
number of outcomes corresponding to event
P(A) 
total number of outcomes in sample space
A
• Complement rule: P(AC) = 1 – P(A)
• Addition rule for mutually exclusive events: If A
and B are mutually exclusive,
P(A or B) = P(A) + P(B).
Diagrams and Probability
Define events A: is male and B: has pierced ears.
Probability Rules
Recall the example on gender and pierced ears. We can use a Venn
diagram to display the information and determine probabilities.
+
 Venn
Probability and Independence
Definition:
Two events A and B are independent if the occurrence of one
event has no effect on the chance that the other event will
happen. In other words, events A and B are independent if
P(A | B) = P(A) and P(B | A) = P(B).
Example:
Are the events “male” and “left-handed”
independent? Justify your answer.
P(left-handed | male) = 3/23 = 0.13
P(left-handed) = 7/50 = 0.14
These probabilities are not equal, therefore the
events “male” and “left-handed” are not independent.
Conditional Probability and Independence
When knowledge that one event has happened does not change
the likelihood that another event will happen, we say the two
events are independent.
+
 Conditional
Conditional Probabilities
General Multiplication Rule
P(A ∩ B) = P(A) • P(B | A)
Conditional Probability Formula
To find the conditional probability P(B | A), use the formula
=
Conditional Probability and Independence
If we rearrange the terms in the general multiplication rule, we
can get a formula for the conditional probability P(B | A).
+
 Calculating
Who Reads the Newspaper?
What is the probability that a randomly selected resident who reads USA
Today also reads the New York Times?
P(A  B)
P(B | A) 
P(A)


P(A  B)  0.05
P(A)  0.40
0.05
P(B | A) 
 0.125
0.40
There is a 12.5% chance that a randomly selected resident who reads USA
Today also reads the New York Times.
Conditional Probability and Independence
In Section 5.2, we noted that residents of a large apartment complex can be
classified based on the events A: reads USA Today and B: reads the New
York Times. The Venn Diagram below describes the residents.
+
 Example:
+
Discrete and Continuous
Random Variables

There are two types of random variables: discrete and
continuous.

Discrete random variables have a finite (countable) number of
possible values.


The table of possible values of x and the associated probabilities is
called a PROBABILITY DISTRIBUTION.


They usually arise out of COUNTING something.
We use a histogram to graph a discrete random variable.
Continuous random variables take on all values in an interval of
numbers. So, there are an infinite number of possible values.

They usually arise out of MEASURING something.

The graph of a continuous RV is a density curve.
Young Women’s Heights
+
 Example:
Read the example on page 351. Define Y as the height of a randomly chosen
young woman. Y is a continuous random variable whose probability
distribution is N(64, 2.7).
What is the probability that a randomly chosen young woman has height
between 68 and 70 inches?
P(68 ≤ Y ≤ 70) = ???
68  64
2.7
 1.48
z
70  64
2.7
 2.22
z
P(1.48 ≤ Z ≤ 2.22) = P(Z ≤ 2.22) – P(Z ≤ 1.48)

= 0.9868 – 0.9306

= 0.0562
There is about a 5.6% chance that a randomly chosen young woman
has a height between 68 and 70 inches.
Transformations
Effect on a Random Variable of Multiplying (Dividing) by a Constant
Multiplying (or dividing) each value of a random variable by a number b:
•
Multiplies (divides) measures of center and location (mean, median,
quartiles, percentiles) by b.
•
Multiplies (divides) measures of spread (range, IQR, standard deviation)
by |b|.
•
Does not change the shape of the distribution.
Note: Multiplying a random variable by a constant b multiplies the variance
by b2.
Transforming and Combining Random Variables
How does multiplying or dividing by a constant affect a random
variable?
+
 Linear
Transformations
Effect on a Random Variable of Adding (or Subtracting) a Constant
Adding the same number a (which could be negative) to
each value of a random variable:
• Adds a to measures of center and location (mean,
median, quartiles, percentiles).
• Does not change measures of spread (range, IQR,
standard deviation).
• Does not change the shape of the distribution.
Transforming and Combining Random Variables
How does adding or subtracting a constant affect a random variable?
+
 Linear
Random Variables
Mean of the Difference of Random Variables
For any two random variables X and Y, if D = X - Y, then the expected value
of D is
E(D) = µD = µX - µY
In general, the mean of the difference of several random variables is the
difference of their means. The order of subtraction is important!
Variance of the Difference of Random Variables
For any two independent random variables X and Y, if D = X - Y, then the
variance of D is
D2  X2  Y2
In general, the variance of the difference of two independent random
variables is the sum of their variances.
Transforming and Combining Random Variables
We can perform a similar investigation to determine what happens
when we define a random variable as the difference of two random
variables. In summary, we find the following:
+
 Combining
Today
 Now,
we’ll learn about the basis for the binomial
calculations: the binomial formula.
n k
P( X  k )    p (1  p )n k
k 
Notice the = mark.
This is a combinatorial. It is
read “n choose k.”
p stands for the probability of
success. n represents the
number of observations. k is
the value of x of which you’re
asked to find the probability.
Comparison of Binomial to Geometric
Binomial
Geometric
Each observation has two outcomes
(success or failure).
Each observation has two outcomes
(success or failure).
The probability of success is the
same for each observation.
The probability of success is the
same for each observation.
The observations are all
independent.
The observations are all
independent.
There are a fixed number of trials.
There is a fixed number of
successes (1).
So, the random variable is how
many successes you get in n trials.
So, the random variable is how
many trials it takes to get one
success.
Mean & Variance of Geometric RV
 The
formula for the mean of a geometric RV is
1
X 
p
 The
formula for the variance of a geometric RV is
(1  p)
 
p2
2
X
 These
formulas are NOT given to you on the exam.
We’ve studied two large categories of RVs:
discrete and continuous
Among the discrete RVs, we’ve studied the binomial and geometric
The graph of a binomial RV can be skewed left, symmetric, or
skewed right, depending on the value of p.
The graph of a geometric RV is ALWAYS skewed right. Always.
Other discrete RVs can be given to you in the form of a table.
Among the continuous, we’ve studied the normal RVs.
To find probabilities of a normal RV, convert to a Z score and use
Table A.
Definition:
A parameter is a number that describes some characteristic of the
population. In statistical practice, the value of a parameter is usually
not known because we cannot examine the entire population.
A statistic is a number that describes some characteristic of a sample.
The value of a statistic can be computed directly from the sample data.
We often use a statistic to estimate an unknown parameter.
Remember s and p: statistics come from samples and
parameters come from populations
We write µ (the Greek letter mu) for the population mean and x (" x bar") for the sample mean. We use p to represent a population
proportion. The sample proportion pˆ ("p - hat" ) is used to estimate the
unknown parameter p.
What Is a Sampling Distribution?
As we begin to use sample data to draw conclusions about a
wider population, we must be clear about whether a number
describes a sample or a population.
+
Parameters and Statistics
+
PCFS
Sample Proportions
p = the proportion of
_____ who …
Sample Means
Parameter
µ = the mean …
Conditions
SRS
Random
SRS
np≥10 and n(1-p)≥10
Normality
Population ≥ 10n
Independence
Population ~ Normally
OR “since n> 30, the
CLT says the sampling
distribution of x-bar is
Normal.”
Population ≥ 10n
x  mean
z
std . dev
Formula
z
Sentence
x  mean
std . dev
+
Confidence Intervals –
puzzles for math nerds
statistic  critical value standard deviation of statistic
Parameter
Mean
Proportion
Statistic
p
p

x
Standard
Deviation
of the
Statistic

n
p(1  p)
n
Standard
Error of
the
Statistic
s
n
p(1  p)
n
+
Characteristics of the t-distributions
 They

are similar to the normal distribution.
They are symmetric, bell-shaped, and are centered
around 0.
 The
t-distributions have more spread than a
normal distribution. They have more area in the
tails and less in the center than the normal
distribution.

That’s because using s to estimate σ introduces more
variation.
 As
the degrees of freedom increase, the tdistribution more closely resembles the normal
curve.

As n increases, s becomes a better estimator of σ.
P-Values
The null hypothesis H0 states the claim that we are seeking evidence
against. The probability that measures the strength of the evidence
against a null hypothesis is called a P-value.
Definition:
The probability, computed assuming H0 is true, that the statistic would
take a value as extreme as or more extreme than the one actually
observed is called the P-value of the test. The smaller the P-value, the
stronger the evidence against H0 provided by the data.
 Small P-values are evidence against H0 because they say that the
observed result is unlikely to occur when H0 is true.
 Large P-values fail to give convincing evidence against H0 because
they say that the observed result is likely to occur by chance when H0
is true.
+
 Interpreting
+
Types of Errors
Truth about the population
Conclusion
based on
sample
H0 true
H0 false
(Ha true)
Reject H0
Type I error
Correct
conclusion
Fail to reject
H0
Correct
conclusion
Type II error
Type I Error: Reject the H0 when the H0 is true.
Type II Error: Fail to reject the H0 when the H0 is
false.
+
Recall: PCFS for a One Prop Z
Interval
Steps
One Proportion Z Interval
P
p = proportion of _________ who
_____________
C
Random: SRS
Normality:n p  10 and n(1  p)  10
Independence: pop ≥ 10n
F
One Prop Z Interval (_____, ______)
S
We are ______% confident that the
interval captures the true proportion of
__________ who _________.
+
PCFS for a One Prop Z Test
How many differences can you spot?
Steps
One Proportion Z Test
P
p = proportion of _________ who
_____________
H0: p = ______
Ha: p
______
C
Random: SRS
Normality: np0  10 and n(1  p0 )  10
Independence: pop ≥ 10n
F
One Prop Z Test
Z = ________
p = ________
S
Since p
α, we reject/fail to reject H0.
We conclude/cannot conclude that
__________________________.
+
Cautions
 When
conducting a matched-pairs t-test, you
have to be careful.
The parameter you’re studying is μD – the mean
DIFFERENCE. You also have to write which order you
subtracted.
 The boxplot you construct is the distribution of
DIFFERENCES.
 The name of the test (what you write in the formula
section) is specifically a MATCHED-PAIRS T-TEST.
 Your conclusion should state something about the
mean DIFFERENCE in ___...

Two Proportion Z Interval
+
Two Proportion Z Interval
Parameters
p1 = the proportion of ____ who _____
p2 = the proportion of ____ who _____
Notice
nowindependent
there are
two
of pretty
Conditionsthat
• Two
random
samples
(or
random assignment to treatment groups)
much everything.
 
Two parameters: pn1 pand
p
 10; n2 1  p   10
Two sample sizes: n and n
• Normality:
n1 p1  10; n1 1  p1  10
2
2
2
2
1
2
• Independence: Population 1 ≥ 10n1 and
populations
Population 2 ≥ 10n2
Two
Two
statistics:
p-hat
p-hat
1 and
2
Formula
Two Prop
Z Interval
(______,
______)
And the sentence talks about the
Sentence
We
___________% confident that the
difference
inare
proportions.
interval (____, _____) captures the true
difference in proportion of _______ and
________ who ______________.
+
42
Why this gets difficult
 Hypothesis
tests or confidence intervals in isolation
aren’t too hard.
 For example, if you know in your homework you’re
only constructing Two Prop Z Intervals, that
shouldn’t be too hard.
 It’s
often very hard to tell the difference between
matched pairs data and two sample data
 On each of the following slides, determine if the
data are single sample, matched pairs, or two
sample settings.
Footer Text
5/3/2017
+
Example 1

+
44
Example 2

Footer Text
5/3/2017
+
45
Example 3

Footer Text
5/3/2017
+
Example 4

+
47
Example 5

Footer Text
5/3/2017
Inference Summary
Means
Proportions
(Hypothesis Test and
Confidence Intervals)
(Hypothesis Test and
Confidence Intervals)
One-sample t
procedures
Matched pairs t
procedures
Two-sample t
procedures
One Proportion Z
Procedures
Two Proportion Z
Procedures
χ2 tests
Comparison Chart
Goodness of Fit
Equal
Proportions
Association
Hypotheses
H0: The sample distribution matches the
hypothesized distribution.
Ha: The sample distribution does not
match the hypothesized distribution.
H0: p1 = p2 = p3 = …
Ha: Not all of the proportions
are equal.
H0: There is no relationship
between the two variables.
Ha: There is some relationship
between the two variables.
Conditions
SRS
All of the expected counts are at least 5.
Independent SRSs or random
assignment
All of the expected counts are
at least 5.
SRS
All of the expected counts are
at least 5.
Test
Enter observed counts in L1. Enter
expected counts in L2 (you may have to
calculate these. Use χ2 GOF command.
df = # of categories minus 1.
Enter observed counts in [A].
Use the χ2 Test command.
*LOOK AT THE EXPECTED
COUNTS IN [B] TO CHECK
CONDITIONS!
Enter observed counts in [A].
Use the χ2 Test command.
*LOOK AT THE EXPECTED
COUNTS IN [B] TO CHECK
CONDITIONS!
So, the difference between the equal proportions test and
the association test is ARE WE COMPARING SEVERAL
POPULATIONS or DID THE DATA ARISE BY CLASSIFYING
OBSERVATIONS INTO CATEGORIES?