Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Thoratec Workshop in Applied Statistics for QA/QC, Mfg, and R+D Part 1 of 3: Basic Statistical Concepts Instructor : John Zorich www.JOHNZORICH.COM [email protected] Part 1 was designed for students who know high-school algebra but who have never had a college-level statistics course. John Zorich's Qualifications: 20 years as a "regular" employee in the medical device industry (R&D, Mfg, Quality) ASQ Certified Quality Engineer (since 1996) Statistical consultant and trainer (since 1999) for many companies, including Siemens Medical, Boston Scientific, Stryker, and Novellus Instructor in applied statistics for Ohlone College, Silicon Valley Polytechnic Institute, and KEMA/DEKRA Past instructor in applied statistics for UC Santa Cruz Extension, ASQ Silicon Valley Biomedical Group, & TUV . Publisher of 9 commercial, formally validated, statistical application Excel spreadsheets that have been purchased by over 80 companies, world wide. Applications include: Reliability, Normality Tests & Normality Transformations, Sampling Plans, SPC, Gage R&R, and Power. You’re invited to “connect” with me on LinkedIn. Objectives PART 1 (today's topics): Obtain an understanding of BASIC Statistics in general (its vocabulary, methods, & uses) as needed to understand Parts 2 and 3. PART 2 (not today's topics): Learn INTERMEDIATE statistical applications, and tests as needed to understand Part 3; also includes "reliability calculations", power calculations, and sample size determinations. PART 3 (not today's topics): Become familiar with commonly used ADVANCED Statistical applications (Reliability Plotting, Sampling plans, SPC, Process Capability calculations, Equipment control). Self-teaching & Reference Texts RECOMMENDED by John Zorich Clements: Handbook of Statistical Methods in Manufacturing Kaminsky et. al.: Statistics and Quality Control for the Workplace Mlodinow: The Drunkard’s Walk --- How Randomness Rules Our Lives Motulsky: Intuitive Biostatistics NIST Engineering Statistics Internet Handbook, at... http://www.itl.nist.gov/div898/handbook/index.htm Philips: How to Think about Statistics Main Topics in Today's Workshop • • • • • • • • • • • • Regulatory Requirements Population vs. Sample Parameter vs. Statistic Probability Law of Large Numbers Distributions (Charting and Graphing) Binomial Distribution Hypergeometric Distribution Normal Distribution Central Limit Theorem Standard Deviation and Standard Error Linear Regression & Correlation Coefficients Regulatory Requirements ISO 9001:2008 (8.1), and ISO 13485:2003(8.1) " The organization shall plan and implement the monitoring, measurement, analysis and improvement processes needed to demonstrate conformity [to requirements]....This shall include determination of applicable methods, including statistical techniques, & the extent of their use." 21CFR820.250 (FDA) " Where appropriate, each manufacturer shall establish and maintain procedures for identifying valid statistical techniques required for establishing, controlling, and verifying the acceptability of process capability and product characteristics." (as used in this class...) Sample means part of a Population The sample could be the part of an individual batch or lot that was purchased or produced; you inspect the sample prior to applying an "approved" label on the entire batch or lot. "Representative Sample": a sample represents the population --- it is typically not a "Random Sample" but rather is usually taken evenly from thruout the population (e.g., a few items taken from each box in the batch). "Sample size" can be anything from 1 to over 1,000,000. The term "one sample" or "a single sample" means the entire sample, no matter what the sample size is. (as used in this class...) Statistic is a mathematical summary value calculated from data taken from a Sample. All of the following are statistics: Avg thickness of every 100th cable produced last week. Range of thicknesses in that sample Median thickness in that sample. Parameter is a mathematical summary value calculated from data taken from the entire Population; that is, every data point in the entire population (e.g., average thickness of all cables produced last week). "Statistics" as a science is the mathematical analysis of "statistics", not of parameters. Statistics is the science of using "statistics" to guesstimate "parameters". As a group, let's discuss... Which are parameters and which statistics? 1. Baseball "stats" Answer: Parameters, because baseball "stats" are calculated using all the data. 2. United States Census data Answer: Some are statistics, because they are just a sample of the population (this is the preference of Democrats) whereas others are parameters because we attempt to count the entire US population (this is the preference of Republicans). 3. Average age of the people in this class Depends -- is this class a population or sample? Probability (as used in this class) means... The same as "chance" or "odds", but not based on a hunch or intuition or what has historicly occurred. • The following statements are not using probability in the sense we mean here today: He'll probably come home before 9pm. They'll probably win tonight's game. They haven’t won a game in 6 weeks---they’re due!! • Those are examples of “Adverbial Probability” (see Inductive Logic, by Hibbens, 1896, chapter 15) • Instead, the Science of Statistics uses... “Mathematical Probability”. “Mathematical Probability” is the same as the "theoretical expected frequency", that is, the # of times one type of event would happen (if no cheating occurs) divided by the total number of all possible equi-probable events; e.g.,... Probability 1:1 = "Fifty-Fifty" = 1 / 2 = 0.50 = 50 % Those terms (above) all mean the same thing. They all mean that you have the same chance at winning as you have at losing, as opposed to... 1 / 4 = 0.25 = 25 % chance or odds of... 1 / 10 = 0.10 = 10 % probability of... 1 / 3 = 0.3333 (rounded) = 33.33 % 1 / 6 = 0.1667 (rounded) = 16.67 % (By definition...) Probability / chance / likelihood... never can exceed 1.00 = 100%, and never can be less than 0.00 = 0% Probability 0.20 PROBABILITY OF ROLLING A GIVEN NUMBER ON 1 TOSS OF 1 DIE The NULL HYPOTHESISdfdfis that the DIE is "honest". 0.18 0.16 Probability 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 1 2 3 4 5 Number Observed on the 1 Die 6 Probability 0.20 PROBABILITY OF OBSERVING A GIVEN SUM ON 1 TOSS OF 2 DICE dfdfkjfkdjf;lskdjff The NULL HYPOTHESIS is that the dice are "honest". Probability 0.15 0.10 0.05 0.00 2 3 4 5 6 7 8 9 10 11 12 Sum of Numbers Observed on 2 Dice Probability 0.6 PROBABILITY OF OBSERVING HEADS ON FLIP OF 1 COIN The NULL HYPOTHESIS isdfdfthat the COIN is "honest". Probability 0.5 0.4 0.3 0.2 0.1 0.0 0 1 Number of heads observed Probability PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS The Null Hypothesis is that coins are dfdf honest, i.e. probability of heads = 0.50 0.20 When tossing 30 honest coins, the "true" average is 15 heads, but by chance we may see some other result. Probability 0.15 0.10 0.05 0.00 0 5 10 15 20 25 Number of observed HEADS 30 Probability of Independent Events The MULTIPLICATIVE RULE: The probability of Event A happening and Event B and Event C (assuming that they are independent events), is the multiplication of their probabilities: Pa x Pb x Pc (where Pa is the probability of Event A, and so on). -- Class Exercise -Let's try answering these questions: The MULTIPLICATIVE RULE (examples): The chance of rolling 2 dice and obtaining a 5 on both of them is... 1 / 6 x 1 / 6 = 1 / 36 = 0.028 = 2.8% The probability of flipping a coin 4 times and obtaining "heads" every time is... 1 / 2 x 1 / 2 x 1 / 2 x 1 / 2 = 1 / 16 = 0.062 = 6.2% Let's try that (flipping 4 coins & counting heads) The likelihood of drawing 3 good parts from a lot of 100 million parts, 99% of which are good, is... 0.99 x 0.99 x 0.99 = 0.9703 = 97.03% Probability The MULTIPLICATIVE RULE (corollary): Conditional Probability: If the probability changes after each sampling event, then the separate probabilities are not identical, because they are "conditional" not "independent"; e.g. What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)? 1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700 The probability of a given draw is "conditional" based upon what happened in the previous draw. ( do not use: 99/100 x 99/100 x 99/100 = 0.9703 ) Probability of Independent Events Assuming that only one event can happen at a time, the sum of the probabilities of all possible events equals 1.000 exactly. On a single die, only one number appears face up at a time. Therefore P1 + P2 + P3 + P4 + P5 + P6 = 1.00, where P1 is the probability of the #1 being face up, etc. The ADDITIVE RULE: The probability of Event A happening or Event B or Event C, assuming that only one event can possibly happen at a time, is the sum of their probabilities: Pa + Pb + Pc (where Pa is the probability of Event A, and so on) --- in this case, there are assumed to be other possible events, i.e., Pc, Pd, Pe, etc. -- Class Exercise -Let's try answering these questions: The ADDITIVE RULE (examples): The chance of rolling 2 dice and obtaining a total of either 2 or 12 is 1 / 36 + 1 / 36 = 2 / 36 = 0.056 = 5.6% The probability of flipping 4 coins and obtaining either all heads or all tails is 1 / 16 + 1 / 16 = 2 / 16 = 0.125 = 12.5% (based upon our example a few slides ago) Likelihood that an n = 1 sample is out-of-spec if taken from a lot with 2% out-of-spec high & 5% out low is... 0.02 + 0.05 = 0.07 = 7% Probability PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS The Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50 (assuming the coins are honest !!) 0.40 The probability of getting 3 or more heads in a single toss of 4 coins is about 30% = 0.30 = the approximate sum of the individual histogram bar probabilities for getting 3 heads or 4 heads ( 0.25 + 0.05 = 0.30 ) 0.35 Calculation of each of these probabilities is simple to do by enumeration (presenter has demo file) Probability 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 25 Number of observed HEADS 30 Probability PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS (assuming the coins are honest !!) Probability 0.15 0.10 0.05 Calculation of each of these probabilities is done the same way as for 4 coins, but is much more tedious because there are over a million possibilities!! The probability of getting 22 or more heads in a single toss of 40 coins is about 30% ( ≈ the sum of the individual histogram bar probabilities of 22 and above on the X-axis) 0.00 0 5 10 15 20 25 Number of observed HEADS 30 ( 0.10 + 0.075 + 0.06 + 0.03 + 0.02 + 0.01 + 0.005 = 0.30 ) t-Test of Null Hypothesis Probability Probability or Frequency Y-axis = Probability or Frequency Null Hypothesis: True Average is not greater than the Specific If the number of possible result values is "infinite" or very large, then the probability histogram is One-ta more conveniently represented the Dis by a smooth curve (such as this equal t one) rather than a histogram like in previous slides. For example: individual weights Distrib Always think of of thousands of coins, or the differe this area under individual avg weights of Sampl the curve as filled thousands of samples Specif with histogram bars taken from a very large assum that we are too cheap to print. population of coins. Hypoth 0 6 X-axis = measured values, Sample Average minus Specification increasing in magnitude, from left to right Probability or Frequency = Frequency Y-axis = Probability t-Test of Null Hypothesis Probability Null Hypothesis: True Average is not greater than the Specification The probability of getting a measurement equal to or greater than value "A" on the X-axis is exactly 0.30 = the fraction of the area under the curve that is to the right of that point on the X-axis (the redshaded area equals 30% of the area under the entire curve). e.g., x-axis = widths of cables made last week A X-axis = measured values, increasing in magnitude, from left to right In the language of calculus, the red area is the integral of the distribution function, from "A" to infinity. Y-axis = Probability = Frequency Probability or Frequency t-Test of Null Hypothesis Probability Null Hypothesis: True Average is not greater than the Specification e.g., x-axis = widths of cables made last week The probability of getting a measurement equal to or greater than value "B" on the X-axis is exactly 0.05 ( = the fraction of the area under the curve that is to the right of that point on the X-axis) (the red-shaded area equals 5% of the area under the entire curve). B X-axis = measured values, increasing in magnitude, from left to right We will use this concept many times in Day 2. Do you understand it completely? (Let's examine it with Instructor's Excel files) The "Law of Large Numbers" (per JZ) This "law", generalized, is somewhat self evident, and was known in principle to Archimedes over 2 millennia ago. It applies to calculated "statistics", such as averages & standard deviations, and says nothing about the distribution of raw data. Possibly a better name for this law is the one used over 100 years ago: The “Law of Tendency” ---“…the law of tendency is that the larger the number of instances, the greater [= better ] will be the approximation to an accurate and definite result.” (quote from pg 240 of Inductive Logic, 1896 by J.G. Hibbens, Scribner & Sons) This quote shows that the “Law of Large Numbers” is part of our common language, but is unfortunately often applied incorrectly. It is misapplied here because the "Law of Large ("big") Numbers" itself has nothing to do with "statistical significance” (we will discuss "statistical significance” in Day 2 of this workshop). DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS (chart from "Law of Large PER Numbers.xls" Student files) ( 250 SAMPLE AVERAGES EACH SAMPLEin SIZE ) 130 SAMPLE AVERAGE Parameter = 100 100 Each mark on each line represents the avg of a different random sample taken from a uniformly distributed population, 75 to 125. 70 1 3 5 7 9 11 13 15 17 19 21 23 25 27 SAMPLE SIZE Law of Large Numbers translates (in this example) as... The larger the sample size, the closer the calculated value is likely to be to 100 = the population value (i.e., the closer the statistic is likely to be to the population parameter). 29 -- Class Exercise -If this population has an average value of 100, the average value of a SMALL sample from this population will, in the long run, be smaller or larger than 100 ? This (below) is bimodal Will the average value of a LARGE sample, in the long run, be larger or smaller than 100? Answer: In the long run, both small & large samples will be close to 100. In the long run, samples avgs equal population avgs, no matter what the sample size or population shape (that is, in the long run, statistics = parameter). Graphical Methods used to Describe Variability Number Line . . .. ..... ... . . 400 450 500 550 600 The small red squares graphically depict the variability, or the "distribution", of the data. Histograms and Line Charts Bar Charts and Line Charts REASONS WHY CUSTOMERS RETURNED CHINA PLACE SETTINGS ORDERED OVER THE INTERNET FROM ZTC 30 This is shown here only to "complete" a survey of types of charts. We won't mention Pareto charts in the rest of the workshop. 25 25 20 17 15 12 10 4 5 2 1 EN IV G SO N ED N O R EA A N G C H U A Q N G R O M N TI IN D TY R LO N G W R O W W R O N G PR C O O D K EN U C T 0 BR O NUMBER OF RETURNS IN JAN 2005 Pareto Chart -- Class Exercise -If a population distribution looks bimodal, the distribution of data in a SMALL sample from that population will, on average, look like...what? This (below) is bimodal The distribution of a LARGE sample from that population will, on average, look like...what? Answer: On average, both samples will look ≈ bimodal. On average, samples look like the parent population, no matter what the sample size. "Binomial Distribution" Histogram PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS The Null Hypothesis is that coins are dfdf honest, i.e. probability of heads = 0.50 0.20 The "binomial distribution" describes frequencies when there are only 2 possible outcomes, (e.g., head or tails on a coin, or a vote for or against a proposed law). Probability 0.15 0.10 0.05 0.00 0 5 10 15 20 25 Number of observed HEADS 30 30 coins at a time The formula for the "Binomial Distribution" is used to calculate, e.g., the probability of 26 heads appearing on a toss of 30 coins. Part of the formula includes the following calculation: 26 x 25 x 24 x 23 x ....... 4 x 3 x 2 x 1 = ??? = (approximately) 400 Million x Billion x Billion Prior to computers, such calculations were "impossible", except by idiot savants ( = the first "computers" --- they were actually sought after and well paid !) Calculation of Binomial Probability How to easily calculate the height of a single bar in a Binomial Distribution Probability Histogram… (MSExcel function) =binomdist(N,S,B,false) N = Number of heads observed in a given toss of coins S = Sample size = number of coins per toss B = Probability of getting heads on a single coin = 0.5 false = (tells Excel to give probability of single histogram bar) e.g., =binomdist(11,30,0.5,false) = 0.0509 (check that value vs. histogram a couple slides ago) Binomial distributions are symmetrical when probability = 0.500, but skewed when probability is any other value (the farther from 0.500, the more extreme is the skewness --- see next slide). "Binomial Distribution" Histogram IF A DICE HAD 10 SIDES, ONE OF WHICH HAD A STAR ON IT, dfdf PROBABILITY OF MULTIPLE STARS FACE UP IN TOSS OF 30 DICE 0.25 This situation is modeled by the Binomial distribution because we are looking at only 2 possible outcomes: Star or Not-a-Star. The probability of a star coming face up is = 1 / 10 = 10%. The corresponding binomial histogram has a peak at 30 x 10% = 3, but is not symmetrical (it is skewed to the right). Probability 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 25 Number of Stars Face Up in Toss of 30 Dice 30 "Hypergeometric Distribution" The "Binomial distribution" describes frequencies of independent events, where the probability of one result is NOT influenced by a previous result (e.g., coin tosses --- reference the "multiplicative rule" of probability calculation, discussed previously). The "Hypergeometric distribution" looks almost identical to the Binomial, but describes frequencies where the probability of one result is influenced by a previous result, and therefore are NOT independent (e.g., sampling from a lot of 100 parts, only 99 of which are good --- reference the "multiplicative rule" "corollary", discussed previously). The Hypergeometric Distribution is very difficult to calculate by hand, but... The MS Excel function of the probability for the "Hypergeometric distribution" is... =hypgeomdist(N,S,D,P) N = Observed number of items in the Sample that exhibit the sought-after characteristic (e.g., 7 "good" parts) S = Sample size (e.g., 8 parts) D = # of items in the Population that exhibit the soughtafter characteristic (e.g., 99 “good” parts ) P = Population Size (e.g., 100 parts in the lot) "Hypergeometric Distribution" (back in the discussion on "probability" we asked...) What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)? Back then, we calculated it like so: 1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700 Now we can use the hypergeometric Excel function instead: =hypgeomdist( 3, 3, 99, 100 ) = 0.9700 If we had instead used the binomial Excel function, we would have obtained this wrong answer: =binomdist( 3, 3, 0.99, false ) = 0.9703 ( which equals 99/100 x 99/100 x 99/100 ) Binomial vs. Hypergeometric Formula As long as sample size is not more than 1% of lot size, the two formulae give the "same" result. For example... SmplSize = 10, LotSize = 1000 (= Sample is 1% of Lot) =hypgeomdist( 10, 10, 990, 1000 ) = 0.904 =binomdist( 10, 10, 0.99, false ) = 0.904 SmplSize = 100, LotSize = 1000 (= Sample is 10% of Lot) right =hypgeomdist( 100, 100, 990, 1000 ) = 0.347 wrong =binomdist( 100, 100, 0.99, false ) = 0.366 FYI: MS Excel cannot calculate every combination of Hypergeometric values --- for example... =hypgeomdist( 135, 135, 9900, 10000 ) = #NUM! =binomdist( 135, 135, 0.99, false ) = 0.258 Examples of Normal Distributions The single most-used distribution in statistical analysis is the Normal distribution. Each of these "normal" curves describes a population that has the same average value, but different degrees of variability within the population. ( X-axis is in the same units as the raw data. Y-axis is count, i.e., # of observed items of a given X-value.) Examples of Normal Distributions X-axis is in “standard units” (which we will discuss later). Y-axis is count, i.e., # of observed items of a given X-value.) "Normal Distribution" equation The equation for what we now call the "Normal distribution histogram" was discovered around 1730, as a way to simplify calculation of the Binomial distribution; only power & square root tables were needed (rather than idiot savants). The Normal distribution histogram has the "same" shape as the Binomial when sample size is large and the probabilities of the outcomes are exactly 50:50 (for example, a histogram describing the various possible number of heads in a toss of a 10,000 coins). The larger the sample (e.g., the more coins), the closer the Normal histogram shape is to the Binomial histogram shape. "Normal Distribution" equation Independently re-discovered ≈ 1800 by 2 astronomers (Gauss & Laplace); nowadays, sometimes called the " Gaussian curve " They used it to describe the distribution of errors in measurements; it became known as the " error curve "... ...because errors in measurements act like a binomial situation, that is, a very precise measurement can be only one of two possibilities, namely either greater than the true value or less than the true value (ignoring the remote possibility of being exactly equal to the true value). Renamed the " Normal Distribution " around 1900 after it was discovered that the "error curve" closely described the typical (i.e., the normal) distribution of many biological values (e.g., heights of humans, weights of walruses, lengths of lizards). "Normal Distribution Histogram" Y = # of items expected at X (divide by N to get probability) N = # of items examined (e.g., 225 people) This equation your "student" files i= width of eachlooks singleintimidating, bar ( = lengthbut of interval) on histogram (for binomial & other discreet distributions, i = 1 ) for you, contain a spreadsheet that does the calculations X = x-axis of a given histogram andmidpoint then automatically createsbar the histogram! μ = average or expected value of all N items σ = standard deviation of all N items (we'll explain in a few minutes what a "standard deviation" is) If a histogram of your measurement data does not mimic the histogram created by this equation, then your data may actually not be "normal" !! Let's examine "Student Normal Histogram.xls" "Normal (quantity) Histogram" Normal QUANTITY Distribution HISTOGRAM This was created using the Normal Distribution Histogram equation, with N = 225, i = 0.1, Avg = 5.5, & StdDev = 0.33. This could represent the distribution of a heights of 225 randomly selected people. The sum of all these bars = N = 225. 30 QUANTITY 25 20 15 10 5 0 4.0 4.5 5.0 5.5 6.0 6.5 7.0 "Normal (probability) Histogram" Normal PROBABILITY Distribution HISTOGRAM 0.14 This was created from the previous chart by dividing each quantity by N. The sum of all these bars = 1.000, no matter what the sample size is ( N = 225, or N = 1,000,000,000 ). PROBABILITY 0.12 0.10 0.08 0.06 0.04 0.02 0.00 4.0 4.5 5.0 5.5 6.0 6.5 7.0 "Normal (probability) Curve" ddf This was created from the previous chart by drawing a smooth line from top to top of each bar, and then deleting the bars. The sum of the area under this curve is defined as = 1.000 Normal PROBABILITY Curve 0.14 PROBABILITY 0.12 0.10 0.08 0.06 0.04 0.02 0.00 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Always view such curves as really a histogram whose bars we are too cheap to print. The "Central Limit Theorem" (The text above is a scanned image from Bowker & Lieberman, Engineering Statistics, 2nd ed., p. 100) Let's examine STUDENT file: Central Limit Theorem.xls CENTRAL LIMIT THEOREM translates as... for any population of raw data with any shaped distribution... in regards to the distribution of a large number of statistics taken repeatedly from the population (e.g., averages, ranges, standard deviations, etc.)... the distribution of the statistics will look more+more "normal“ (“bell” shaped) the larger+larger is the sample size; that is true because the value of a statistic will be somewhere near the parameter, either larger or smaller than it (ignoring the unlikely event of equaling the parameter); i.e., it has a binomial distribution, which as we saw before, is modeled by the "Error Curve", which in modern-times is called the "Normal distribution". the distribution of the statistics will never "be" Normal, except in cases when N is very large and the raw data population distribution is "normal". Often, the distribution of statistics is “ t ” shaped, as we will see in Day 2 of this course. Distribution of Sample Avgs. vs. Population Theoretical distribution of thousands of individual avgs taken from the population. Shape is due to Central Limit Theorem. Width is due to Law of Large Numbers. DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS ( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE ) SAMPLE AVERAGE 130 100 70 1 3 5 7 9 11 13 15 17 SAMPLE SIZE 19 21 23 25 27 29 Let's look at this in more detail using MS Excel Numerical Expressions Range ( 1848 ? ) Standard Deviation ( 1893 ? ) Standard Error ( 1897 ? ) Another important term is the " Mean ", which is another way to say the "average". "Mean" in that sense was coined about 1750. What is an "average" ? About a hundred years ago, "average" usually meant the "median" ("the median home price in Dallas is..."). However, in more modern times, the word "average", by itself, always refers to the sum of all the values, divided by the number of values (i.e., the "arithmetic mean"): + + + + = Value#1 Average = Value#2 Sum of all Values / N Value#3 ( etc. ) Value#N Sum of all Values What is a "range" ? The "range" of a set of numbers refers to the difference between the largest and smallest value in that set: Range = Largest Value – Smallest Value The range of height of people in this room is approximately...? (different for women than men...? ) What is a "range" ? . . .. ..... ... . . 400 450 500 550 600 This "number line" uses small red squares to graphically depict the variability, that is, the "distribution", of the data in a small sample. The width of the difference between the value on the far left-hand side and the value on the far righthand side is the "range". In this data, the range looks to be about 200 units. “Standard” calculations Standard XXX (the mathematical definition, for population parameter) ∑ ( Xi – Mean )2 # of data points in the Mean Standard XXX (when using a sample to guess what the population parameter Standard XXX is) ∑ ( Xi – Mean )2 # of data points in the Mean, minus Y Y = whole number, greater than zero; value depends on which "standard" statistic is being calculated. Standard Deviation & Standard Error Standard XXX (from a previous slide) XXX = "Deviation" (that is "Standard Deviation") when talking about raw data (e.g., heights of humans, and lengths of lizards). XXX = "Error" (that is, "Standard Error") when talking about calculated values (i.e., Statistics), for example: -- sample means ("Standard Error of the Mean"), or -- sample standard deviations ("Standard Error of the Standard Deviation"). 100 Samples per data point (each point = average Std Dev of all 100 ). Random samples taken from normal population with Std Dev = 10.0 Standard Deviation Calculated 12 10 8 Standard Deviation (n-1) 6 Standard Deviation (n) Other random samples would produce differently shaped curves; but, on average, the "n" curve would be farther away (on the low side) from the true value than the "n-1" curve. That is, the "n-1" statistic is a better estimator of the parameter than the "n" one. 4 2 0 1 10 100 1000 10000 Sample Size This is another example of how to think about the "Law of Large Numbers"; that is, the larger the sample size, the closer (on average) the "statistic" is to the "parameter". (revisited) Distribution of Sample Avgs. vs. Population Theoretical distribution of thousands of individual avgs taken from the population. As was stated on a previous slide, a distribution (of raw data or of statistics such as "averages") is "normal" if it's histogram mimics a "normal" one. Said differently, a distribution is "normal" if its distribution has characteristics that mimic that of the "normal probability curve", such as... +/– 1 StdXXX from Avg = 68.3 % of area under curve +/– 2 StdXXX from Avg = 95.5 % of area under curve +/– 3 StdXXX from Avg = 99.7 % of area under curve as seen in next few slides... Areas under the "normal" curve The darkened area equals 68.3 % of the area under the curve. 70 80 90 100 110 120 130 Areas under the "normal" curve The darkened area equals 95.5 % of the area under the curve. 70 80 90 100 110 120 130 Areas under the "normal" curve If a population with Avg =100, StdXXX = 10 is believed to be Normally distributed, then... (1 – 0.9973) / 2 of population (≈ 0.135%) is predicted to be below X = 70 70 80 The darkened area equals 99.73 % of area under curve. This +/− 3 interval is used extensively in “Statistical Process Control” (SPC). 90 100 110 120 130 Areas under the "normal" curve This +/− 2.58 interval is used extensively in “Gage R&R” and other “Metrology” methods. The darkened area equals 99.0 % of the area under the curve. 70 80 90 100 110 120 130 Areas under the "normal" curve , then... 10 +/– 1.96 Std XXX equals 95.00% of area under curve If Standard XXX is 0.045 --- 0.040 0.035 PROBABILITY This +/− 1.96 interval is used in some Reliability calculations & in some tests of “Significance”. The darkened area equals 95.0 % of the area under the curve. 0.030 0.025 0.020 0.015 0.010 0.005 0.000 67 70 73 80 80 87 90 93 100 100 110113 107 120 120 130 133 127 This is called a " Z " Table In a normal distribution, +/– Z std deviations from the Parameter Avg encompasses 2 x A of the population of numbers. +/– 1.96 standard deviations equals 2 x 0.4750 = 95.0% of the area under the normal curve +/– 3.00 standard deviations equals 2 x 0.4987 = 99.7% of the area under the normal curve Class exercise: Estimation of Std Dev Assuming this ≈ normal distribution of raw data, approximately what is the Std Deviation? Almost all of distribution is ≈ Mean +/– 30. If "normal", then 30 ≈ 3 StdDevs; therefore StdDeviation ≈ 10 70 80 90 100 110 120 130 Class exercise: Estimation of Std Error Assuming this ≈ normal distribution of Smpl Avgs, approximately what is the ≈ Standard Error? Almost all of distribution is ≈ Mean +/– 15. If "normal", then 15 ≈ 3 Std Errors; therefore Std Error ≈ 5 85 90 95 100 105 110 115 Calculating a "standard error" Any statistic from a single sample will likely not be identical to the parameter. For example, you can expect a sample mean to be off by some unknown amount from the population mean, i.e. to have some amount of "error". The "standard" amount of error to expect is called the "standard error". The theoretical definitions of two important standard errors are: Std Error of Mean = Std Dev of all possible (or at least a very large number of) sample averages (of a single sample size) taken from a Population. Std Error of StdDev = Std Dev of all possible (or at least a very large number of) "n-1" std deviations (of a single sample size) taken from a Population. Calculating a "standard error" Avg#1 Avg#2 Avg#3 Avg#4 etc. Avg#N ------------Std Dev of Avgs = Std Error of the Mean StdDev#1 StdDev#2 StdDev#3 StdDev#4 etc. StdDev#N ---------------- Std Dev of StdDevs = Std Error of the Std Deviation Practical formula for "Std Error of Mean" Standard Error of the (sample) Mean ( estimated from 1 sample ) Sample Standard Deviation Sample Size . Linear Regression & the Correlation Coefficient What is the meaning of a Linear Regression Correlation Coefficient? In 2009, a billion dollar manufacturing company submitted to a government regulatory agency a report from a product technical file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99). The regulatory personnel knew that such a claim is nonsense, and they so officially requested a literature or text book reference that explained such a rationale. After a few rounds of emails and re-writings of the report (and still no literature reference) the company consulted a professional statistician, who recommended using a different statistical method to prove equivalency. Understanding Linear Regression & the Correlation Coefficient X (class = grade Y (minutes of study before in school) 2 3 5 6 being easily distracted) 4.2 5.9 10.4 11.5 Is there a linear relationship between class (= grade) in school and tendency toward distraction? How strong is it? How consistent is the relationship (that is, what is the degree of co-relation (more commonly called "correlation")? Let's use Excel to find out! Understanding Linear Regression & the Correlation Coefficient 14 12 y = 1.91x + 0.36 R2 = 0.9897 10 8 6 4 2 0 0 1 2 3 4 5 6 • This is a "linear regression plot" of the data. • The "regression coefficient" is 1.91 • The "correlation coefficient" is " R " or " r " , that is, " r " = the square root of 0.9897 = 0.995 7 Understanding Linear Regression & the Correlation Coefficient 14 y= 12 1.91x + 0.36 10 R2 = 0.9897 8 6 4 2 0 0 1 2 3 4 5 6 7 • Linear regression puts the "best" straight line thru a plot of X vs. Y data points. • The "regression coefficient" (= 1.91 = the slope of this line) tells use how STRONG the relationship is. Understanding Linear Regression & the Correlation Coefficient 14 12 y = 1.91x + 0.36 R2 = 0.9897 10 8 6 4 2 0 0 1 2 3 4 This is an example of "Reliability Plotting", which is discussed in 6 7 Day5 3 of this workshop. • The linear regression equation (e.g., Y = 1.91X + 0.36 ) allows us to predict the Y value for a nearby X value. • CLASS EXERCISE: What Y value do we expect at X = 1.0 ANSWER: ( 1.91 times 1.0 ) + 0.36 = 2.27 Understanding Linear Regression & the Correlation Coefficient MS Excel Spreadsheet functions... linear regression coefficient =SLOPE( known_y's, known_x's ) correlation coefficient --- same result given by either... =CORREL( known_y's, known_x's ) or... =CORREL( known_x's, known_y's ) Notice that the function formula for the slope cares about which data set is X and which is Y, but the formula for the correlation coefficient does not. Are Correlation Coefficients the same if data sets are the same except for magnitude...??? 1000 0.955 800 0.955 600 0.955 400 200 YES !! r 0 0 5 10 15 20 Does the Correlation Coefficient increase Understanding Linear Regression & the in sizeCorrelation with additional data points...?? Coefficient 1000 0.955 800 0.962 600 0.971 400 r 200 NO !! 0 0 5 10 15 20 Does a large Correlation Coefficient indicate that the data is truly linear...??? 1000 0.955 800 0.955 600 0.955 400 200 NO !! (notice how the the 0 0 5 lower- most 2 data sets show a slight curve) (the solid black lines 10straight, not curved) 15 are all r 20 If the data is closeLinear to the line, is the Correlation Understanding Regression & the Coefficient always large...??? Correlation Coefficient 1000 0.955 NO !! 800 600 0.791 400 slight slope to this lowest regression line 200 0.064 0 0 5 10 15 20 r 85 Does a large Correlation Coefficient indicate that Understanding Linear Regression & the the X,Y data have a strong relationship (i.e., that the regression coefficient is large)...?? Correlation Coefficient 1000 0.955 NO !! 800 600 0.955 400 200 0.955 slight slope & 2 dots per point in this lowest regression line 0 0 5 10 15 20 r Understanding Linear Regression & the Correlation Coefficient |r r| Se Sy There are at least a dozen different formulas for the Correlation Coefficient. The instructor considers this the best formula for teaching the meaning of Correlation. The next few slides explain it.... Understanding Linear Regression & the Correlation Coefficient Ye is calculated from the linear regression equation that is used to draw the "straight line" thru the data: Ye = y = aX + b 14 12 y = 1.91x + 0.36 R2 = 0.9897 10 8 6 4 2 0 0 1 2 3 4 5 6 7 The square root of 0.9897 = r = 0.995 = correlation coefficient (this chart & equation were produced by MS Excel) Understanding Linear Regression & the Correlation Coefficient Continuing with the data and equation from the previous slide: This equation from previous slide observed X 2 3 5 6 Std Dev observed Y 4.2 5.9 10.4 11.5 3.505 = Sy Ye = 1.91 ( X ) + 0.36 Ye 4.18 6.09 9.91 11.82 3.487 = Se r = 3.487 / 3.505 = 0.995 = same as on previous slide (this is not a trick; it is just one of many mathematically identical formulas for calculating the magnitude of “ r ”) Understanding Linear Regression & the Correlation Coefficient |r r| Se Sy The absolute value of the Correlation Coefficient : Correlation Coefficient is the ratio of 2 standard deviations: The numerator is the smallest possible standard deviation that can be expected in the Y data points ( = Se ), and the denominator is the observed standard deviation in the Y data points ( = Sy ). If the observed data were closer to the linear regression line, then Sy would be smaller and then the Se / Sy ratio would be closer to 1.000. THE CORRELATION COEFFICIENT THEREFORE IS A MEASURE OF VARIABILITY, OF HOW CONSISTENTLY THE PLOTTED DATA TRACKS TO THE LINEAR REGRESSION LINE. Understanding Linear Regression & the Correlation Coefficient |r r| Se Sy The Correlation Coefficient is... the fraction of the observed Y data variation ( = Sy, the std deviation of the observed Y values) that is explainable by a linear relationship between X and Y ( the variation “associated with” or “caused by” that linear relationship is Se, the std deviation of the predicted Y values). The rest of the variation in the data is definitely due to something else (e.g., poor measurement equipment, poor measurement technique, other factors, random error, or... the fact that the data are NOT linearly related !!). Assuming Y is dependent X, what is the&source Understanding LinearonRegression the (the "cause") of the variation in Y-values? Correlation Coefficient fsf 1000 0.955 800 Sometimes there is no "cause" (e.g., correlation between arm-length and leg-length). 600 Almost no variation in Y is "caused" by relationship between X & Y (something else is the "cause", such slope to this lowest regression line as assay variation slight or measurement error). 400 200 0.064 0 0 5 10 15 r 92 20 What is the meaning of a (linear regression) Correlation Coefficient? • The correlation coefficient is... an indicator of predictability in the data on the Y axis. • It represents... the fraction of the variation in the Y-data that can be explained by an hypothesized linear relationship between X and Y. • If that hypothesis is false, i.e., if the relationship between X and Y is not truly linear, then the Correlation Coefficient is meaningless. r= (stdev solid dots) (stdev hollow dots) As mentioned earlier, in 2009, a billion dollar company submitted to a regulatory agency a report in a tech file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99). Have you learned enough to explain why that is nonsense? 15 6 STRESSED STRESSED y = 0.5076x - 0.078 4 2 y = 0.5076x - 0.078 R2 = 0.9937 10 5 R2 = 0.9937 0 0 0 5 10 UNSTRESSED 15 0 5 10 UNSTRESSED 15 Conclusion to: Understanding Linear Regression & the Correlation Coefficient: 1000 0.955 800 600 400 0.955 0.955 200 0 • Just because Excel lets you put a Linear Regression line thru 0 15 data points does not5 mean the 10 data is a straight line. 20 • Just because the Correlation Coefficient is large does not mean you have a straight line. • You must use your judgment to determine if the line is straight, and if "yes", then and only then can you use the Linear Regression Equation and Correlation Coefficient to help you 96 evaluate the relationship between your X and Y values. How to implement what you learned today? A new language (and some of its vocabulary) is primarily what you learned today. Like any language, you must speak it if you are to learn it well. Read your company's SOP (or ??) on statistical techniques. Ask to read some of the validation protocols and validation reports that relate to your work, and study their "statistics" section (or it might be called the "data analysis" section). Ask your boss to explain statistical statements made in meetings, reports, or SOPs.