Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MAT 135 Introductory Statistics and Data Analysis Adjunct Instructor Kenneth R. Martin Lecture 13 November 30, 2016 Agenda • Housekeeping – HW #8 – Readings – Final Exam Confidential - Kenneth R. Martin Housekeeping • HW #8 – Due Monday, December 5, noon, electronically • HW #8 – Solution posted December 5, 1pm Confidential - Kenneth R. Martin Housekeeping • • • • • • • • • Read, Chapter 1.1 – 1.4 Read, Chapter 14.1 – 14.2 Read, Chapter 10.1 Read, Chapter 2 Read, Chapter 3 Read, Chapter 4 Read, Chapter 5 Read, Chapter 6 Read, Chapter 8 Confidential - Kenneth R. Martin Housekeeping • Final Exam – Wednesday, December 7 – Open book, open notes Confidential - Kenneth R. Martin Continuous vs. Discrete vs. Attribute Data Continuous infinite # of possible measurements in a continuum Discrete: Count Discrete: Ordinal 0 0 1 1 4 3 2 “low”/“small”/“short” Discrete: Nominal or Group A Categorical Attribute: Binary 2 Group B 3 4 5 7 6 5 6 “medium” / “mid” Group C Group D 7 8 8 Group E 10 Group F “good”/“go”/”group #2 defines TWO groups - no order Confidential - Kenneth R. Martin 9 10 “high”/”large”/”tall” defines several groups - no order “bad”/“no-go”/”group #1” 9 Probability - Review Theorem 1: • Probability occurs between 0 - 1 – Probability of 1.000 means an event is certain to occur – Probability of 0 means the event is certain to NOT occur. Confidential - Kenneth R. Martin Probability - Review Theorem 2: If, P(H) = Probability of H occurring Then P(not H) = 1.000 - P(H) or P(H) = 1.000 - P(H) Confidential - Kenneth R. Martin Statistics Histogram – until it begins to resemble a smooth polygon or curve. Confidential - Kenneth R. Martin Probability - Review Definition, Theorem 5: • Correspondingly, the total area under a continuous probability distribution (normal curve) is equal to 1.000 also. However, the tails of the curve never touch the x-axis. Thus, area can be used to estimate probabilities. Confidential - Kenneth R. Martin Statistics Cumulative Density Function – Cross Section f(X) = PDF +∞ f(X) ∫f(X) dx = 1.000 -∞ • Sum under entire curve = 1.000 X Confidential - Kenneth R. Martin Statistics Continuous Probability Distribution (aka. CRV) • A function of a Continuous Random Variable that describes the likelihood the variable occurs at a certain value within a given set of points by the integral of its density (prob. density) function (i.e. corresponding area under f(x) curve). – We shall calculate CRV over ranges Confidential - Kenneth R. Martin Statistics Probability Density Function (cont. prob. dist.) f(X) = PDF = p(x≤b) - p(x≤a) = F(b) - F(a) f(X) = Entire area under curve to section(b) minus Entire area under curve to section(a) • Sum under entire curve = 1.0 Curve typically read left to right a b Confidential - Kenneth R. Martin X Statistics Cumulative Density Function f(X) = PDF t P(X<t)=∫f(X) dx = F(t) -∞ f(X) t F(t) X Confidential - Kenneth R. Martin Statistics Cumulative Density Function f(X) = PDF F(t) + R(t) = 1.0 f(X) R(t) F(t) t Confidential - Kenneth R. Martin X Statistics Normal Curve • AKA, Gaussian distribution of CRV. • Mean, Median, and Mode have the approx. same value. – Associated with mean () at center and dispersion () X N(,) [when a random variable x is distributed normally] – – • Observations have equal likelihood on both sides of mean *** When normally distributed, Mean is used to describe Central Tendency The graph of the associated probability density function is called “Bell Shaped” Confidential - Kenneth R. Martin Statistics Various Normal Curves Confidential - Kenneth R. Martin Statistics Standardized Normal Value • There are an infinite combination of mean and SD’s for normal curves. – Thus, the shapes of any two normal curves will be different. • To find the area under any normal curve, we can use the two methods previously described (rectangles or integration). – Or, we can use the Standard Normal Approach, thus using tables to find the area under the curve, and thus probabilities. Standard Normal Distribution: N (0,1) Confidential - Kenneth R. Martin Statistics Standardized Normal Value • Standard Normal Distribution has a Mean=0 and a SD=1 • Standard Normal Transformation (z-Transformation), converts any normal distribution with any mean and any SD to a Standard Normal Distribution with mean 0 and SD 1 • Standard Normal Distribution is distributed in “z-score” units, along the associated x-axis. Z-score specifies the number of SD units a value is above or below the mean (i.e. z = +1 indicates a value 1 SD above the mean). • A formula is used to convert your mean and SD to a z-score. Confidential - Kenneth R. Martin Statistics Normal Curve - Distribution of Data Confidential - Kenneth R. Martin Statistics Standard Normal Curve - Distribution of Data (z-scores) Confidential - Kenneth R. Martin Statistics Normal Curve - Distribution of Data Confidential - Kenneth R. Martin Statistics Standard Normal Distribution (z-scores) Confidential - Kenneth R. Martin Statistics Standardized Normal Value Confidential - Kenneth R. Martin Statistics Normal distribution example Confidential - Kenneth R. Martin Statistics Standard Normal Distribution example Confidential - Kenneth R. Martin Statistics Standardized Normal Table Confidential - Kenneth R. Martin Statistics Standardized Normal Table Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example A medical device catheter must have a diameter of 12.50 mm, with a tolerance of 0.05 mm, to function properly. If the process is centered at 12.50 mm, and a dispersion of 0.02mm, what percent of catheters must be scrapped and what percent can be reworked ? How can the process center be changed to eliminate the scrap ? What is the associated rework percentage ? Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Standardized Normal Value Example: Lightbulb burnout time is estimated by monitoring 50 bulbs. Xbar = 60 days; s = 20 days. ***Assume the average and sample SD represent the population, thus & . Assume normal dist. How many bulbs work 100 or more days ? See Example: Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example Confidential - Kenneth R. Martin Statistics Example -∞ Confidential - Kenneth R. Martin +∞ Inferential Statistics & Sampling Distributions Confidential - Kenneth R. Martin Inferential Statistics & Sampling Distributions Confidential - Kenneth R. Martin Hypothesis Testing • Hypothesis – a statement or proposed explanation for an observation, phenomenon, or a problem that can be tested. • Hypothesis Testing – a method for testing a hypothesis about a parameter in a population, using data measured in a sample. Confidential - Kenneth R. Martin Hypothesis Testing Hypothesis testing helps us decide if the evidence is sufficiently strong to determine if a sample statistic would be selected if the hypothesis regarding the population were true. Confidential - Kenneth R. Martin Hypothesis Testing – Role and Purpose • To provide an OBJECTIVE BASIS for evaluating the evidence in our data • To help us determine if what we THINK WE SEE in the graphical displays is STRONGLY SUPPORTED by the data • To quantify the RISK that our conclusions might be incorrect • Hypothesis tests help us answer the practical question: Is there a real difference between : – the mean (average) of two or more groups – the spread (variation) in one group and the spread in another group – the proportion of defects in one group and proportion of defects in another group – the average count (or rate of occurrence) in one group and average count in another group Confidential - Kenneth R. Martin Hypothesis Testing – Role and Purpose POPULATION SAMPLE Sampling Scheme Measure Hypothesis Testing helps determine if what we see in the sample is likely to be true for the whole population Confidential - Kenneth R. Martin Data! Hypothesis Testing – 4 Steps • Step 1: State the null and alternative Hypothesis • Null Hypothesis (H0) – a statement about the population parameter (such as the mean) that is assumed to be true • Starting point, which we will test, to determine if null is likely to be true or not. There is not a difference between (2) parameters. • Example: Children in the U.S. watch an average of 30 hours of TV per week. Ho: µ=30 • Alternative Hypothesis (Ha) - statement that contradicts the null hypothesis • We think the null is wrong, Ha allows us to state what we think is wrong. There is a difference between (2) parameters. • Example: Children in the U.S. watch more or less than 30 hours of TV per week. Ha: µ≠30 • In any case, can predict Ha to be <, > or ≠ H0 Confidential - Kenneth R. Martin Hypothesis Testing – 4 Steps • Step 2: Set the criteria for a decision • Done by stating the level of significance • Criterion of judgment upon which a decision is made regarding the value stated in a null hypothesis • Typically the level is set at 5% in research studies • Based on the probability of obtaining a statistic measured in a sample if the value stated in the null hypothesis were true • When the probability of obtaining a sample mean is less than 5%, if the null hypothesis were true, we conclude the sample selected is too unlikely and reject the null hypothesis Confidential - Kenneth R. Martin Hypothesis Testing – 4 Steps • Step 3: Compute the test statistic • The value of test statistic can be used to make a decision regarding the null hypothesis • A mathematical formula that identifies how a sample outcome is from the value stated in the null hypothesis • It helps determine how likely the sample outcome is if the population mean stated in the null is true • The larger the value of the test statistic, the further a sample mean deviates from the population mean stated in null hypothesis Confidential - Kenneth R. Martin Hypothesis Testing – 4 Steps • Step 4: Make a decision • Based on the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true (represented by p value) 1. Reject the null hypothesis - the sample mean is associated with a low probability of occurrence if the null is true • p value <.05; “reached significance” 2. Retain the null hypothesis - the sample mean is associated with high probability of occurrence when null is true • p value >.05; “failed to reach significance” Confidential - Kenneth R. Martin Hypothesis Testing • HO represents our “assumed working hypothesis” (even if we don’t really think it’s true!) • WHY? “Burden of proof” is placed on HA. – i.e. Need to have strong evidence that HA is true before we will “believe” it. – HA sometimes called the “research hypothesis” or the “research claim” • Two possible outcomes: – Reject HO and accept HA (“statistically significant” results) – Fail to reject HO (“not statistically significant” results) • Reject HO only if the data provides highly convincing evidence that HO is false • How convincing? Typically look for at least 95% confidence that HO is false Confidential - Kenneth R. Martin Hypothesis Testing When a citizen is placed on trial for a given crime, the U.S. legal system operates on the following principle: “The defendant is presumed innocent until proven guilty beyond a reasonable doubt.” Under such an approach, what is the null hypothesis, and what is the alternative hypothesis? Confidential - Kenneth R. Martin Hypothesis Testing • All statistical tests calculate something called a “P-value” • 1 – “P-value” = A Confidence Level we have that H0 is false (and therefore that HA is true) • “P-value” = probability that the observed result is due to a random chance (under the null hypothesis) • Decision rule: We will reject H0 only if the P-value is less than a chosen threshold (often .05, or 5%) – Assures that we have at least 95% confidence that HA is true. • Want more confidence? Specify a lower threshold for the Pvalue – Threshold P-value = significance level (a level) – Lower threshold values means… Higher confidence when we reject HO More difficult to reject HO When P-value is “Low”, the “Null must Go” Confidential - Kenneth R. Martin Hypothesis Testing - Summary p-value - Probability that the observed behavior can be explained purely by random variation. Significance Level / Producer’s Risk = a - Threshold which your p-value must be below to reject the null. - Represents the risk assumed for “incorrectly rejecting the null”, or detecting a difference when one does not actually exist. Consumer’s Risk = b - Represents the risk assumed for “incorrectly, not rejecting the null”, or not detecting a difference when one actually exists. Confidence Level (of test) = 1 – a - Confidence you have in rejecting the Ho, or claiming that a difference exists “going into” the test. When rejecting null, the actual confidence in your conclusion is 1 – p (value). Power (of test) = 1 - b - The probability that the test will detect a difference (result in a p-value less than your a) when there is truly a difference, for a given “practical difference” and standard deviation. - You decide the magnitude of the “practical” difference you want to detect. Confidential - Kenneth R. Martin