* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Handout on Chapter 3
Survey
Document related concepts
Transcript
Learning Objectives Definition: Statistics is a science, which deals with the collection of data, analysis of data, and making inferences about the population using the information contained in the sample. Population: A finite or infinite collection of measurements or individuals that comprises the totality of all possible measurements within the context of a particular statistical study. Sample: A sample is a subset of measurements selected from the population of interest. 1|Page An example of Population and Sample A nationwide survey was conducted to determine which issues were of greatest concern among Americans. Each responded in the survey was randomly selected according to a sampling plan reflecting the proportion of individuals in categories defined by several demographic variables such as age, sex, income and geographic region. Participants were asked to specify the national problem that caused them the most concern. Some typical responses were poverty, drug abuse, unemployment, and the federal budget deficit. (a) What is the response that will be measured in this survey? (b) Define the population of interest to the experimenter. (c) Describe the sampling procedure used by the experimenter. (d) What demographic groupings might the experimenter consider as subpopulation within the main population to be studied concerning their response to the survey? 3.1 Describing Variation Some variation in the process is unavoidable. Because, two units of product by the same manufacturing process are not identical. Statistics is a science of analyzing data and drawing inferences by taking variation in the data into account. 3.1.1 The Stem-and-Leaf Plot (stem plot) Suppose we have a set of data denoted by x1 , x 2 , …., x n and each number of x i consists of at least two digits. To construct stem plot, we divide each number x i into two parts: A stem consisting of one or more of leading digits and a leaf, consisting of the remaining digits. Example 3.1, page 64: A sample of the cycle time in days to process and pay employee health insurance claims in a large company are given in Table 3.1. The data and stem plot are presented below: 2|Page Figure 3.2 also called a run chart. 3.1.2 The Histogram Bar charts that depict data on a single measured characteristic are called histograms. The bars are formed by dividing up the horizontal scale into a collection of classes and then counting the class frequencies with which the measurements fall into these classes. A histogram represents a visual display of the data and very useful to describe the shape of the data distribution. The shape of the histogram could be symmetric or skewed (left skewed or right skewed). 3|Page Example 3.2, page 67: The thickness of a metal layer on 100 silicon wafers resulting from a chemical vapor deposition (CVD) process in a semiconductor planet and presented in Table 3.2. Construct a histogram for this data. Construction of a Histogram • Group values of the variable into bins (or classes, groups), then count the number of observations that fall into each bin • Plot frequency (or relative frequency) versus the values of the variable Shape of the layer thickness data? Reasonably symmetric or bell shaped 4|Page 3.1.3 Numerical Summary of Data Statistic: Any number or summary measure, calculated form a set of sample data is called a statistic. Statistic is a function of sample observations. Sample Average: Suppose x1 , x 2 , …., x n are the observations in a sample. The most important measure of central tendency in the sample is the sample average (or sample mean). x= x1 +x 2 + …. + x n = n ∑x i (3.1) n Sample Variance (or dispersion): The variability in the sample data is measured by the sample variance and defined as n s2 = ∑ (x i =1 i − x )2 (3.2) n −1 A short-cut method for sample variance is n s2 = ∑x i =1 2 i − nx 2 n −1 The square root of the sample variance is called sample standard deviation (SD) and denoted by s, 5|Page n s = s2 = ∑ (x i =1 i − x )2 n −1 (3.3) The main advantage of the sample standard deviation is that it can be expressed in the original units of measurement. That means both mean and SD has the same unit of measurements. The sample variance and standard deviation of metal thickness data are 180.2928 and 13.43 respectively. 3.1.4 The Box Plot Stem plots and histograms are excellent graphic displays for focusing attention on key aspects of the shape of a distribution of data. However, they are not good tools for making comparison among data sets. To construct a box-plot, we need the following 5 numbers summary. Five numbers summary: Minimum, First Quartile, Median, Third Quartile and Maximum. Minimum: Minimum is the smallest value in the data set. Maximum: Maximum is the largest value in the data set. Median: Median is the middle most value of a data set. That is, the median of a set of measurements is the value of x such that at most half of the measurements are less than x and at most half of the measurements are greater than x. First Quartile (Lower quartile): First quartile is the middle value among the data points below the median and is denoted by Q1 . Third Quartile (Upper quartile): Third quartile is the middle value among the data points above the median and is denoted by Q3 Interquartile Range (IQR) = Q3 - Q1 Example 3.4, page 71: The data in Table 3.4 are diameters (in mm) of holes in a group of 12 wing leading edge ribs for a commercial transport airplane. Construct and interpret the box plot of these data. 6|Page From the above box plot we find, minimum=120.1, Q1 =120.35, Median ( Q 2 )=120.6, Q3 =120.9 and maximum=121.3. We expect that data will be right skewed. Comparative Box plots Figure 3.8 shows the comparative box plots for a manufacturing quality index on products at three manufacturing plants. We can see higher variability in plant 2 and both plant 2 & 3 need to raise their quality index performance. 7|Page Comments on Mean, Median, SD and IQR: The mean provides a better description of the center of a data set if the distribution of the data is symmetric while the median provides a better description of the center of a skewed (right or left) data. Standard deviation (SD) provides a better description of the variability of a symmetric data while IQR provides a better description of the variability of a skewed data set. 3.1.5 Probability Distributions 8|Page Discrete probability distribution and Continuous probability In Discrete probability distribution: P ( X = a ) = P( X ≤ a ) − P( X ≤ a − 1) P (a ≤ X ≤ b) = P ( X ≤ b) − P ( X ≤ a − 1) In Continuous probability distribution: P( X = a) = 0 P ( a ≤ X ≤ b ) = P ( X ≤ b) − P ( X ≤ a ) 9|Page The population mean and population standard deviation 10 | P a g e The mean is not necessarily the 50th percentile of the distribution (that’s the median). The mean is not necessarily the most likely value of the random variable (that’s the mode). However, for a mound shaped (symmetric) distribution, mean, median and mode are the same. 3.2 Important Discrete Distributions 3.2.1 The Hypergeometric Distribution Suppose there are N items in a lot and D of these items are defectives. A random sample of n items is selected from these N items without replacement. If x denotes the number of defective items in the sample of size n, then x will follow a hypergeometric distribution and defined as follows 11 | P a g e Example page 76-77 3.2.2 The Binomial Distribution Consider a process that consists of a sequence of n independent trials. When the outcome of each trial is either “success” or “failure”, the trials are called Bernoulli trials. If the probability of “success” on any trial say p, is constant, then the number of success x in n Bernoulli trials has the binomial distribution with parameters n and p and defined as follows. Extra Example 1: Suppose ten items will be tested from a lot. Each item can pass the test with probability 0.90 and fail with probability 0.10. Calculate the probability that (a) exactly 3 items will fail, (b) less than 3 items will fail, (c) between 2 and 4 items (inclusive) will fail. 12 | P a g e 3.2.3 The Poisson Distribution The Poisson distribution is widely used in statistical quality control and improvement, frequently as the underlying probability model for count data. Extra Example 2: For a certain manufacturing industry, the number of accidents averages 2 per week. (a) Find the probability that at least 2 accidents will occur in a given week. (b) Find the probability that no accident will occur in 2 weeks. (c) What is the expected number of accidents in a given 28 days? 3.2.4 The Pascal Distribution (Negative Binomial Distribution) The Pascal distribution, like the binomial distribution, has its basis in Bernoulli trials. Consider a sequence of independent trials, each with probability of success p, and let x denote the trial on which the rth success occurs. The x is a Pascal random variable with the following probability distribution. 13 | P a g e • When r = 1 the Pascal distribution is known as the geometric distribution • The geometric distribution has many useful applications in SQC Extra Example 3: Suppose 10% of the engines manufactured on a certain assembly line are defective. If engines are randomly selected one at a time and tested, find the probability that the third non-defective engine is found on the fifth trial. Find the mean and variance of the number of trial on which the third non-defective engine is found. 14 | P a g e 3.3 Some Important Continuous Distributions 3.3.2 The Normal Distribution The normal distribution is the most useful distribution in both theory and application of statistics. If x is a normal random variable, then the probability distribution of x is defined as follows. 15 | P a g e Standard Normal Distribution Example 3.7, page 83 Example 3.8, page 84 Example 3.9, page 85 Linear Combinations of Normal Distribution 16 | P a g e That means y is distributed as normal with mean µ y and variance σ 2 y . OR in short, y ~ N ( µ y , σ y2 ). Central Limit Theorem (CLT) Practical interpretation – the sum (or average) of independent random variables is approximately normally distributed regardless of the distribution of each individual random variable in the sum 3.3.3 The Exponential Distribution 17 | P a g e Exercise 3.29, page 101. The cumulative distribution function (cdf) of exponential is F (a) = P( x ≤ a) = 1 − e− λ a This CDF is very useful to solve some problems for exponential distribution. 3.3.4 The Gamma Distribution 18 | P a g e Result: If x1 , x 2 , …, x r are exponential with parameter λ and independent, then y=x1 + x 2 + … + x r is distributed as gamma with parameters λ and r. Example 3.11, page 91. 3.4 Probability Plots • Determining if a sample of data might reasonably be assumed to come from a specific distribution • Probability plots are available for various distributions • Easy to construct with computer software (MINITAB) • Subjective interpretation 3.4.1 Normal Probability Plots 19 | P a g e 3.4.2 Other Probability Plots (page 95) 3.5 Some Useful Approximations 3.5.1 The Binomial Approximation to the Hypergeometric Consider hypergeometric distribution in equation (3.8). If Binomial distribution with parameters 20 | P a g e p= D N n ≤ 0.10 , then the N and n is a good approximation to the hypergeometric distribution. The approximation is better for small n , which also called the sampling fraction. N See example on page 96 3.5.2 The Poisson Approximation to the Binomial When n is large and p is small (p < 0.1), the Poisson probability distribution provides a good approximation to binomial probabilities with λ=np. Extra Example 4: When the circuit boards used in the manufacture of compact disc players are tested, the percentage of defectives is found to be 5%. Let X denote the number of defectives board in a random sample of size 100. Then X has a binomial distribution. What is the probability that none of the 100 boards is defective? 3.5.3 The Normal Approximation to the Binomial distribution If x is distributed as Binomial with parameter n and p, then the binomial probability distribution can be approximated by using a normal curve with µ=np and σ = npq , where n = number of trials and p = probability of success. The binomial probability P(a ≤ x ≤ b) can be approximated by the normal probability, P[(a − 0.5) ≤ x ≤ (b + 0.5)] as long as n is large and the interval np ± 2 npq falls between 0 and n. The half unit adjustment is called correction for continuity. That means P(a ≤ x ≤ b) ≅ P[(a − 0.5) ≤ x ≤ (b + 0.5)] Extra Example 5: Suppose that 25% of the fire alarms in a large city are false alarms. Let x denotes the number of false alarms in a random sample of 100 alarms. Find the approximate probability that (a) there will be at least 30 false alarms. (b) there will be no more than 35 false alarms. 21 | P a g e