Download Handout on Chapter 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Learning Objectives
Definition: Statistics is a science, which deals with the collection of data,
analysis of data, and making inferences about the population using the
information contained in the sample.
Population: A finite or infinite collection of measurements or individuals
that comprises the totality of all possible measurements within the context of
a particular statistical study.
Sample: A sample is a subset of measurements selected from the population
of interest.
1|Page
An example of Population and Sample
A nationwide survey was conducted to determine which issues were of
greatest concern among Americans. Each responded in the survey was
randomly selected according to a sampling plan reflecting the proportion of
individuals in categories defined by several demographic variables such as
age, sex, income and geographic region. Participants were asked to specify
the national problem that caused them the most concern. Some typical
responses were poverty, drug abuse, unemployment, and the federal budget
deficit.
(a) What is the response that will be measured in this survey?
(b) Define the population of interest to the experimenter.
(c) Describe the sampling procedure used by the experimenter.
(d) What demographic groupings might the experimenter consider as
subpopulation within the main population to be studied concerning their
response to the survey?
3.1 Describing Variation
Some variation in the process is unavoidable. Because, two units of product
by the same manufacturing process are not identical. Statistics is a science of
analyzing data and drawing inferences by taking variation in the data into
account.
3.1.1 The Stem-and-Leaf Plot (stem plot)
Suppose we have a set of data denoted by x1 , x 2 , …., x n
and each
number of x i consists of at least two digits. To construct stem plot, we
divide each number x i into two parts: A stem consisting of one or more of
leading digits and a leaf, consisting of the remaining digits.
Example 3.1, page 64: A sample of the cycle time in days to process and pay
employee health insurance claims in a large company are given in Table 3.1.
The data and stem plot are presented below:
2|Page
Figure 3.2 also called a run chart.
3.1.2 The Histogram
Bar charts that depict data on a single measured characteristic are called
histograms. The bars are formed by dividing up the horizontal scale into a
collection of classes and then counting the class frequencies with which the
measurements fall into these classes. A histogram represents a visual display
of the data and very useful to describe the shape of the data distribution. The
shape of the histogram could be symmetric or skewed (left skewed or right
skewed).
3|Page
Example 3.2, page 67: The thickness of a metal layer on 100 silicon wafers
resulting from a chemical vapor deposition (CVD) process in a
semiconductor planet and presented in Table 3.2. Construct a histogram for
this data.
Construction of a Histogram
• Group values of the variable into bins (or classes, groups), then count
the number of observations that fall into each bin
• Plot frequency (or relative frequency) versus the values of the variable
Shape of the layer thickness data? Reasonably symmetric or bell shaped
4|Page
3.1.3 Numerical Summary of Data
Statistic: Any number or summary measure, calculated form a set of sample
data is called a statistic. Statistic is a function of sample observations.
Sample Average: Suppose x1 , x 2 , …., x n are the observations in a sample.
The most important measure of central tendency in the sample is the sample
average (or sample mean).
x=
x1 +x 2 + …. + x n
=
n
∑x
i
(3.1)
n
Sample Variance (or dispersion): The variability in the sample data is
measured by the sample variance and defined as
n
s2 =
∑ (x
i =1
i
− x )2
(3.2)
n −1
A short-cut method for sample variance is
n
s2 =
∑x
i =1
2
i
− nx 2
n −1
The square root of the sample variance is called sample standard deviation
(SD) and denoted by s,
5|Page
n
s = s2 =
∑ (x
i =1
i
− x )2
n −1
(3.3)
The main advantage of the sample standard deviation is that it can be
expressed in the original units of measurement. That means both mean and
SD has the same unit of measurements.
The sample variance and standard deviation of metal thickness data are
180.2928 and 13.43 respectively.
3.1.4 The Box Plot
Stem plots and histograms are excellent graphic displays for focusing
attention on key aspects of the shape of a distribution of data. However,
they are not good tools for making comparison among data sets. To
construct a box-plot, we need the following 5 numbers summary.
Five numbers summary: Minimum, First Quartile, Median, Third Quartile
and Maximum.
Minimum: Minimum is the smallest value in the data set.
Maximum: Maximum is the largest value in the data set.
Median: Median is the middle most value of a data set. That is, the median
of a set of measurements is the value of x such that at most half of the
measurements are less than x and at most half of the measurements are
greater than x.
First Quartile (Lower quartile): First quartile is the middle value among the
data points below the median and is denoted by Q1 .
Third Quartile (Upper quartile): Third quartile is the middle value among
the data points above the median and is denoted by Q3
Interquartile Range (IQR) = Q3 - Q1
Example 3.4, page 71: The data in Table 3.4 are diameters (in mm) of holes in
a group of 12 wing leading edge ribs for a commercial transport airplane.
Construct and interpret the box plot of these data.
6|Page
From the above box plot we find, minimum=120.1, Q1 =120.35,
Median ( Q 2 )=120.6, Q3 =120.9 and maximum=121.3. We expect that data will
be right skewed.
Comparative Box plots
Figure 3.8 shows the comparative box plots for a manufacturing quality
index on products at three manufacturing plants. We can see higher
variability in plant 2 and both plant 2 & 3 need to raise their quality index
performance.
7|Page
Comments on Mean, Median, SD and IQR:
The mean provides a better description of the center of a data set if the
distribution of the data is symmetric while the median provides a better
description of the center of a skewed (right or left) data. Standard deviation
(SD) provides a better description of the variability of a symmetric data
while IQR provides a better description of the variability of a skewed data
set.
3.1.5 Probability Distributions
8|Page
Discrete probability distribution and Continuous probability
In Discrete probability distribution:
P ( X = a ) = P( X ≤ a ) − P( X ≤ a − 1)
P (a ≤ X ≤ b) = P ( X ≤ b) − P ( X ≤ a − 1)
In Continuous probability distribution:
P( X = a) = 0
P ( a ≤ X ≤ b ) = P ( X ≤ b) − P ( X ≤ a )
9|Page
The population mean and population standard deviation
10 | P a g e
The mean is not necessarily the 50th percentile of the distribution (that’s the
median). The mean is not necessarily the most likely value of the random
variable (that’s the mode). However, for a mound shaped (symmetric)
distribution, mean, median and mode are the same.
3.2 Important Discrete Distributions
3.2.1 The Hypergeometric Distribution
Suppose there are N items in a lot and D of these items are defectives. A
random sample of n items is selected from these N items without
replacement. If x denotes the number of defective items in the sample of size
n, then x will follow a hypergeometric distribution and defined as follows
11 | P a g e
Example page 76-77
3.2.2 The Binomial Distribution
Consider a process that consists of a sequence of n independent trials. When
the outcome of each trial is either “success” or “failure”, the trials are called
Bernoulli trials. If the probability of “success” on any trial say p, is constant,
then the number of success x in n Bernoulli trials has the binomial
distribution with parameters n and p and defined as follows.
Extra Example 1: Suppose ten items will be tested from a lot. Each item can
pass the test with probability 0.90 and fail with probability 0.10. Calculate
the probability that
(a) exactly 3 items will fail,
(b) less than 3 items will fail,
(c) between 2 and 4 items (inclusive) will fail.
12 | P a g e
3.2.3 The Poisson Distribution
The Poisson distribution is widely used in statistical quality control and
improvement, frequently as the underlying probability model for count
data.
Extra Example 2: For a certain manufacturing industry, the number of
accidents
averages 2 per week.
(a) Find the probability that at least 2 accidents will occur in a given week.
(b) Find the probability that no accident will occur in 2 weeks.
(c) What is the expected number of accidents in a given 28 days?
3.2.4 The Pascal Distribution (Negative Binomial Distribution)
The Pascal distribution, like the binomial distribution, has its basis in
Bernoulli trials. Consider a sequence of independent trials, each with
probability of success p, and let x denote the trial on which the rth success
occurs. The x is a Pascal random variable with the following probability
distribution.
13 | P a g e
• When r = 1 the Pascal distribution is known as the geometric
distribution
•
The geometric distribution has many useful applications in SQC
Extra Example 3: Suppose 10% of the engines manufactured on a certain
assembly line are defective. If engines are randomly selected one at a time
and tested, find the probability that the third non-defective engine is found
on the fifth trial. Find the mean and variance of the number of trial on which
the third non-defective engine is found.
14 | P a g e
3.3 Some Important Continuous Distributions
3.3.2 The Normal Distribution
The normal distribution is the most useful distribution in both theory and
application of statistics. If x is a normal random variable, then the
probability distribution of x is defined as follows.
15 | P a g e
Standard Normal Distribution
Example 3.7, page 83
Example 3.8, page 84
Example 3.9, page 85
Linear Combinations of Normal Distribution
16 | P a g e
That means y is distributed as normal with mean µ y and variance σ 2 y . OR
in short, y ~ N ( µ y , σ y2 ).
Central Limit Theorem (CLT)
Practical interpretation – the sum (or average) of independent random
variables is approximately normally distributed regardless of the
distribution of each individual random variable in the sum
3.3.3 The Exponential Distribution
17 | P a g e
Exercise 3.29, page 101.
The cumulative distribution function (cdf) of exponential is
F (a) = P( x ≤ a) = 1 − e− λ a
This CDF is very useful to solve some problems for exponential
distribution.
3.3.4 The Gamma Distribution
18 | P a g e
Result: If x1 , x 2 , …, x r are exponential with parameter λ and independent,
then y=x1 + x 2 + … + x r is distributed as gamma with parameters λ and r.
Example 3.11, page 91.
3.4 Probability Plots
• Determining if a sample of data might reasonably be assumed to come
from a specific distribution
• Probability plots are available for various distributions
• Easy to construct with computer software (MINITAB)
• Subjective interpretation
3.4.1 Normal Probability Plots
19 | P a g e
3.4.2 Other Probability Plots (page 95)
3.5 Some Useful Approximations
3.5.1 The Binomial Approximation to the Hypergeometric
Consider hypergeometric distribution in equation (3.8). If
Binomial distribution with parameters
20 | P a g e
p=
D
N
n
≤ 0.10 , then the
N
and n is a good
approximation to the hypergeometric distribution. The approximation is
better for small
n
, which also called the sampling fraction.
N
See example on page 96
3.5.2 The Poisson Approximation to the Binomial
When n is large and p is small (p < 0.1), the Poisson probability distribution
provides a good approximation to binomial probabilities with λ=np.
Extra Example 4: When the circuit boards used in the manufacture of
compact disc players are tested, the percentage of defectives is found to be
5%. Let X denote the number of defectives board in a random sample of size
100. Then X has a binomial distribution. What is the probability that none of
the 100 boards is defective?
3.5.3 The Normal Approximation to the Binomial distribution
If x is distributed as Binomial with parameter n and p, then the binomial
probability distribution can be approximated by using a normal curve with
µ=np and σ = npq , where n = number of trials and p = probability of
success. The binomial probability P(a ≤ x ≤ b) can be approximated by the
normal probability, P[(a − 0.5) ≤ x ≤ (b + 0.5)] as long as n is large and the
interval np ± 2 npq falls between 0 and n. The half unit adjustment is called
correction for continuity. That means
P(a ≤ x ≤ b) ≅ P[(a − 0.5) ≤ x ≤ (b + 0.5)]
Extra Example 5: Suppose that 25% of the fire alarms in a large city are false
alarms. Let x denotes the number of false alarms in a random sample of 100
alarms. Find the approximate probability that
(a) there will be at least 30 false alarms.
(b) there will be no more than 35 false alarms.
21 | P a g e