Download Lecture 13 11302016

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
MAT 135
Introductory Statistics and Data Analysis
Adjunct Instructor
Kenneth R. Martin
Lecture 13
November 30, 2016
Agenda
• Housekeeping
– HW #8
– Readings
– Final Exam
Confidential - Kenneth R. Martin
Housekeeping
• HW #8 – Due Monday, December 5, noon,
electronically
• HW #8 – Solution posted December 5, 1pm
Confidential - Kenneth R. Martin
Housekeeping
•
•
•
•
•
•
•
•
•
Read, Chapter 1.1 – 1.4
Read, Chapter 14.1 – 14.2
Read, Chapter 10.1
Read, Chapter 2
Read, Chapter 3
Read, Chapter 4
Read, Chapter 5
Read, Chapter 6
Read, Chapter 8
Confidential - Kenneth R. Martin
Housekeeping
• Final Exam
– Wednesday, December 7
– Open book, open notes
Confidential - Kenneth R. Martin
Continuous vs. Discrete vs. Attribute Data
Continuous
infinite # of possible measurements in a continuum
Discrete:
Count
Discrete:
Ordinal
0
0
1
1
4
3
2
“low”/“small”/“short”
Discrete:
Nominal or Group A
Categorical
Attribute:
Binary
2
Group B
3
4
5
7
6
5
6
“medium” / “mid”
Group C
Group D
7
8
8
Group E
10
Group F
“good”/“go”/”group #2
defines TWO groups - no order
Confidential - Kenneth R. Martin
9
10
“high”/”large”/”tall”
defines several groups - no order
“bad”/“no-go”/”group #1”
9
Probability - Review
Theorem 1:
•
Probability occurs between 0 - 1
–
Probability of 1.000 means an event is certain to occur
–
Probability of 0 means the event is certain to NOT occur.
Confidential - Kenneth R. Martin
Probability - Review
Theorem 2:
If, P(H) = Probability of H occurring
Then
P(not H) = 1.000 - P(H)
or
P(H) = 1.000 - P(H)
Confidential - Kenneth R. Martin
Statistics
Histogram – until it begins to resemble a smooth
polygon or curve.
Confidential - Kenneth R. Martin
Probability - Review
Definition, Theorem 5:
•
Correspondingly, the total area under a continuous
probability distribution (normal curve) is equal to
1.000 also. However, the tails of the curve never
touch the x-axis. Thus, area can be used to estimate
probabilities.
Confidential - Kenneth R. Martin
Statistics
Cumulative Density Function – Cross Section
f(X) = PDF
+∞
f(X)
∫f(X) dx = 1.000
-∞
• Sum under entire
curve = 1.000
X
Confidential - Kenneth R. Martin
Statistics
Continuous Probability Distribution (aka. CRV)
•
A function of a Continuous Random Variable that describes
the likelihood the variable occurs at a certain value within a
given set of points by the integral of its density (prob. density)
function (i.e. corresponding area under f(x) curve).
–
We shall calculate CRV over ranges
Confidential - Kenneth R. Martin
Statistics
Probability Density Function (cont. prob. dist.)
f(X) = PDF
= p(x≤b) - p(x≤a)
= F(b) - F(a)
f(X)
= Entire area under
curve to section(b)
minus Entire area under
curve to section(a)
• Sum under entire
curve = 1.0
 Curve typically read
left to right
a
b
Confidential - Kenneth R. Martin
X
Statistics
Cumulative Density Function
f(X) = PDF
t
P(X<t)=∫f(X) dx = F(t)
-∞
f(X)
t
F(t)
X
Confidential - Kenneth R. Martin
Statistics
Cumulative Density Function
f(X) = PDF
F(t) + R(t) = 1.0
f(X)
R(t)
F(t)
t
Confidential - Kenneth R. Martin
X
Statistics
Normal Curve
•
AKA, Gaussian distribution of CRV.
•
Mean, Median, and Mode have the approx. same value.
–
Associated with mean () at center and dispersion ()
X  N(,) [when a random variable x is distributed normally]
–
–
•
Observations have equal likelihood on both sides of mean
*** When normally distributed, Mean is used to describe Central
Tendency
The graph of the associated probability density function
is called “Bell Shaped”
Confidential - Kenneth R. Martin
Statistics
Various Normal Curves
Confidential - Kenneth R. Martin
Statistics
Standardized Normal Value
• There are an infinite combination of mean and SD’s for normal
curves.
– Thus, the shapes of any two normal curves will be different.
• To find the area under any normal curve, we can use the two
methods previously described (rectangles or integration).
– Or, we can use the Standard Normal Approach, thus using
tables to find the area under the curve, and thus
probabilities.
Standard Normal Distribution:
N (0,1)
Confidential - Kenneth R. Martin
Statistics
Standardized Normal Value
• Standard Normal Distribution has a Mean=0 and a SD=1
• Standard Normal Transformation (z-Transformation), converts
any normal distribution with any mean and any SD to a
Standard Normal Distribution with mean 0 and SD 1
• Standard Normal Distribution is distributed in “z-score” units,
along the associated x-axis. Z-score specifies the number of
SD units a value is above or below the mean (i.e. z = +1
indicates a value 1 SD above the mean).
• A formula is used to convert your mean and SD to a z-score.
Confidential - Kenneth R. Martin
Statistics
Normal Curve - Distribution of Data
Confidential - Kenneth R. Martin
Statistics
Standard Normal Curve - Distribution of Data (z-scores)
Confidential - Kenneth R. Martin
Statistics
Normal Curve - Distribution of Data
Confidential - Kenneth R. Martin
Statistics
Standard Normal Distribution (z-scores)
Confidential - Kenneth R. Martin
Statistics
Standardized Normal Value
Confidential - Kenneth R. Martin
Statistics
Normal distribution example
Confidential - Kenneth R. Martin
Statistics
Standard Normal Distribution example
Confidential - Kenneth R. Martin
Statistics
Standardized
Normal Table
Confidential - Kenneth R. Martin
Statistics
Standardized
Normal Table
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
A medical device catheter must have a diameter of 12.50
mm, with a tolerance of 0.05 mm, to function properly. If the
process is centered at 12.50 mm, and a dispersion of
0.02mm, what percent of catheters must be scrapped and
what percent can be reworked ? How can the process center
be changed to eliminate the scrap ? What is the associated
rework percentage ?
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Standardized Normal Value
Example:
Lightbulb burnout time is estimated by monitoring
50 bulbs. Xbar = 60 days; s = 20 days.
***Assume the average and sample SD represent
the population, thus  & . Assume normal dist.
How many bulbs work 100 or more days ?
See Example:
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
Confidential - Kenneth R. Martin
Statistics
Example
-∞
Confidential - Kenneth R. Martin
+∞
Inferential Statistics & Sampling Distributions
Confidential - Kenneth R. Martin
Inferential Statistics & Sampling Distributions
Confidential - Kenneth R. Martin
Hypothesis Testing
• Hypothesis – a statement or proposed explanation
for an observation, phenomenon, or a problem
that can be tested.
• Hypothesis Testing – a method for testing a
hypothesis about a parameter in a population,
using data measured in a sample.
Confidential - Kenneth R. Martin
Hypothesis Testing
Hypothesis testing helps us decide if the evidence is
sufficiently strong to determine if a sample statistic would be
selected if the hypothesis regarding the population were true.
Confidential - Kenneth R. Martin
Hypothesis Testing – Role and Purpose
• To provide an OBJECTIVE BASIS for evaluating the evidence in our
data
• To help us determine if what we THINK WE SEE in the graphical
displays is STRONGLY SUPPORTED by the data
• To quantify the RISK that our conclusions might be incorrect
• Hypothesis tests help us answer the practical question:
Is there a real difference between :
– the mean (average) of two or more groups
– the spread (variation) in one group and the spread in another group
– the proportion of defects in one group and proportion of defects in
another group
– the average count (or rate of occurrence) in one group and average
count in another group
Confidential - Kenneth R. Martin
Hypothesis Testing – Role and Purpose
POPULATION
SAMPLE
Sampling
Scheme
Measure
Hypothesis Testing helps
determine if what we see in
the sample is likely to be true
for the whole population
Confidential - Kenneth R. Martin
Data!
Hypothesis Testing – 4 Steps
• Step 1: State the null and alternative Hypothesis
• Null Hypothesis (H0) – a statement about the population
parameter (such as the mean) that is assumed to be true
• Starting point, which we will test, to determine if null is likely to be
true or not. There is not a difference between (2) parameters.
• Example: Children in the U.S. watch an average of 30 hours of TV
per week. Ho: µ=30
• Alternative Hypothesis (Ha) - statement that contradicts the
null hypothesis
• We think the null is wrong, Ha allows us to state what we think is
wrong. There is a difference between (2) parameters.
• Example: Children in the U.S. watch more or less than 30 hours of
TV per week. Ha: µ≠30
• In any case, can predict Ha to be <, > or ≠ H0
Confidential - Kenneth R. Martin
Hypothesis Testing – 4 Steps
• Step 2: Set the criteria for a decision
• Done by stating the level of significance
• Criterion of judgment upon which a decision is made regarding the
value stated in a null hypothesis
• Typically the level is set at 5% in research studies
• Based on the probability of obtaining a statistic measured
in a sample if the value stated in the null hypothesis were
true
• When the probability of obtaining a sample mean is less than 5%, if
the null hypothesis were true, we conclude the sample selected is
too unlikely and reject the null hypothesis
Confidential - Kenneth R. Martin
Hypothesis Testing – 4 Steps
• Step 3: Compute the test statistic
• The value of test statistic can be used to make a
decision regarding the null hypothesis
• A mathematical formula that identifies how a sample outcome is
from the value stated in the null hypothesis
• It helps determine how likely the sample outcome is if the
population mean stated in the null is true
• The larger the value of the test statistic, the further a
sample mean deviates from the population mean stated
in null hypothesis
Confidential - Kenneth R. Martin
Hypothesis Testing – 4 Steps
• Step 4: Make a decision
• Based on the probability of obtaining a sample outcome,
given that the value stated in the null hypothesis is true
(represented by p value)
1. Reject the null hypothesis - the sample mean is associated with
a low probability of occurrence if the null is true
• p value <.05; “reached significance”
2. Retain the null hypothesis - the sample mean is associated with
high probability of occurrence when null is true
• p value >.05; “failed to reach significance”
Confidential - Kenneth R. Martin
Hypothesis Testing
• HO represents our “assumed working hypothesis”
(even if we don’t really think it’s true!)
• WHY? “Burden of proof” is placed on HA.
– i.e. Need to have strong evidence that
HA is true before we will “believe” it.
– HA sometimes called the “research
hypothesis” or the “research claim”
• Two possible outcomes:
– Reject HO and accept HA (“statistically
significant” results)
– Fail to reject HO (“not statistically
significant” results)
• Reject HO only if the data provides highly
convincing evidence that HO is false
• How convincing? Typically look for
at least 95% confidence that HO is false
Confidential - Kenneth R. Martin
Hypothesis Testing
When a citizen is placed on trial
for a given crime, the U.S. legal
system operates on the
following principle:
“The defendant is presumed
innocent until proven guilty
beyond a reasonable doubt.”
Under such an approach, what
is the null hypothesis, and what
is the alternative hypothesis?
Confidential - Kenneth R. Martin
Hypothesis Testing
• All statistical tests calculate something called a “P-value”
• 1 – “P-value” = A Confidence Level we have that H0 is false (and
therefore that HA is true)
• “P-value” = probability that the observed result is due to a random
chance (under the null hypothesis)
• Decision rule: We will reject H0 only if the P-value is less
than a chosen threshold (often .05, or 5%)
– Assures that we have at least 95% confidence that HA is true.
• Want more confidence? Specify a lower threshold for the Pvalue
– Threshold P-value = significance level (a level)
– Lower threshold values means…
 Higher confidence when we reject HO
 More difficult to reject HO
When P-value is “Low”,
the “Null must Go”
Confidential - Kenneth R. Martin
Hypothesis Testing - Summary
p-value
- Probability that the observed behavior can be
explained purely by random variation.
Significance Level / Producer’s Risk = a
- Threshold which your p-value must be below to
reject the null.
- Represents the risk assumed for “incorrectly
rejecting the null”, or detecting a difference when
one does not actually exist.
Consumer’s Risk = b
- Represents the risk assumed for “incorrectly, not
rejecting the null”, or not detecting a difference
when one actually exists.
Confidence Level (of test) = 1 – a
- Confidence you have in rejecting the Ho, or
claiming that a difference exists “going into” the
test. When rejecting null, the actual confidence in
your conclusion is 1 – p (value).
Power (of test) = 1 - b
- The probability that the test will detect a
difference (result in a p-value less than your a)
when there is truly a difference, for a given
“practical difference” and standard deviation.
- You decide the magnitude of the “practical”
difference you want to detect.
Confidential - Kenneth R. Martin