Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
MAT 135
Introductory Statistics and Data Analysis
Adjunct Instructor
Kenneth R. Martin
Lecture 7
October 12, 2016
Confidential - Kenneth R. Martin
Agenda
• Housekeeping
– Readings
– Exam #1 review
• Chapter 1, 14, 10, 2, & 3
Confidential - Kenneth R. Martin
Housekeeping
•
•
•
•
•
Read, Chapter 1.1 – 1.4
Read, Chapter 14.1 – 14.2
Read, Chapter 10.1
Read, Chapter 2
Read, Chapter 3
Confidential - Kenneth R. Martin
Housekeeping
• Exam #1 Review
Confidential - Kenneth R. Martin
Statistics – Application to Research
Confidential - Kenneth R. Martin
Statistics
• Why collect samples ?
Population and Sample
POPULATION
SAMPLE
Sampling
Scheme
Measure
Use data from the
SAMPLE to make
conclusions about the
POPULATION
Data!
Often impractical to collect all
the data from the entire
population (i.e. U.S. census).
Some test methods are
destructive – we wouldn’t have
any products or services left to
ship to a customer!
Too expensive to sample the
entire population.
Don’t have to collect 100% of
the population ! We can use
inferential statistics to make
sound conclusions about the
population.
Confidential - Kenneth R. Martin
Statistics
Describing the Data
•
Two methods to summarize the data:
–
–
Graphical - Histogram
Analytical - Central Tendency
Confidential - Kenneth R. Martin
Statistics
Central Tendency
•
A statistical measure which describes how the
data is distributed around its central value: which
includes the Mean, Median, and Mode.
–
However, Central Tendency does not tell about
data Variation / spread.
Confidential - Kenneth R. Martin
Statistics
Relationship of Central Tendency
*** Normal distribution: Mean = Median = Mode
Confidential - Kenneth R. Martin
Statistics
Frequency Distributions
Confidential - Kenneth R. Martin
Statistics
Various curves (Different data spreads, common means)
Confidential - Kenneth R. Martin
Statistics
Various curves (Different means, common data spreads)
Confidential - Kenneth R. Martin
Statistics
Various Normal Curves
Confidential - Kenneth R. Martin
Statistics
Measures of Variability - how the data is spread
from it’s central value
•
The central tendency does not indicate any levels of
variability (dispersion) from the mean.
A = {100, 200, 300, 400, 500}
B = {50, 150, 300, 450, 550}
C = {250, 300, 300, 300, 350}
•
The mean & median of this data are all the same,
but the variability of data is different in all data sets.
Confidential - Kenneth R. Martin
Statistics
Measures of Variability:
Can be values from 0 to ∞ (infinity)
–
•
•
0 means no variability of data
A large value indicates lots of variability of data
–
Values can never be negative
–
As soon as one value in a data set differs from
another, variability exists
Confidential - Kenneth R. Martin
Statistics
Measures of Variability (Dispersion) - Range
Range (R) = Max. value – Min. value
=XH–XL
As data set size , the accuracy of using range .
Limit the usage of Range to ~ 10 readings.
Confidential - Kenneth R. Martin
Statistics
Measures of Variability – Range Example
A = {100, 200, 300, 400, 500}
B = {50, 150, 300, 450, 550}
C = {250, 300, 300, 300, 350}
RA = ?
RB = ?
RC = ?
Confidential - Kenneth R. Martin
Statistics
Measures of Variability
• So what is the limitation of all three of these Range
calculations ?
Confidential - Kenneth R. Martin
Statistics
Measures of Variability (Dispersion) - Variance
• Variance: a measure of the variability of the average
squared distance that data points deviate from their
mean.
• Variance calculations include all data points.
Confidential - Kenneth R. Martin
Statistics
Measures of Variability (Dispersion) - Variance
• Sum of Squares (SS): the sum (addition) of the
squared deviations of values from their mean. The
SS is the numerator of the variance formula.
Variance, 2 , for the
Population. μ is the
population average
Variance, S2, for a
Sample. M is the
sample average.
Confidential - Kenneth R. Martin
Statistics
Variance - Example
A = {100, 200, 300, 400, 500}
• In this case, notice that the SS of
both the population and the sample
will be the same
• Remember: PREMDAS
• What is 2 ?
• What is S2 ?
Confidential - Kenneth R. Martin
Statistics
Measures of Variability (Dispersion) - Variance
• What is a big limitation with Variance ?
– What do you notice about the units of the mean, and the
units of Variance ?
Confidential - Kenneth R. Martin
Statistics
Measures of Dispersion – Standard Deviation
•
Also called the Root Mean Square deviation, it is a
measure of the spread of the variability of the data;
the average distance data deviate from their mean.
• Calculated by taking the
square root of the Variance
Confidential - Kenneth R. Martin
Statistics
Measures of Dispersion – Standard Deviation
•
When the data comes from the “population”, we shall
use “” (sigma) to denote the Standard Deviation.
•
•
•
The mean value will be represented by the Greek symbol (mu)
The denominator does not have “uncertainty”, thus N
When the data comes from a “sample”, we shall use
“SD” to denote the Standard Deviation.
•
•
The mean value will be represented by M or X ( X-bar)
The denominator shows “uncertainty”, thus n-1
Confidential - Kenneth R. Martin
Statistics
Measures of Dispersion – Standard Deviation
•
We typically always want the standard deviation
(variance) value to be as small as possible.
–
We typically want to minimize variability !
Standard deviation is always a better measure to
precisely describe the data distribution versus range.
•
Other formulas exist for Standard Deviation, but will
not be covered.
Confidential - Kenneth R. Martin
Statistics
Standard Deviation - Example
A = {100, 200, 300, 400, 500}
• What do we notice about the units
of Standard Deviation and the units
of the mean ?
• The Mean and Standard
Deviation are typically reported
together.
Confidential - Kenneth R. Martin
Statistics
Standard Deviation - Example
B = {50, 150, 300, 450, 550}
Confidential - Kenneth R. Martin
Statistics
Measures of Dispersion – Coefficient of Variation
•
CVar – Allows a comparison of standard deviations
when the units of measure are not the same
Confidential - Kenneth R. Martin
Statistics
Coefficient of Variation - Example
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot – Boxplot
•
•
Simple graphical tool to summarize data.
Need to determine 5 values (five-number summary)
from data, to generate a boxplot:
1.
2.
3.
4.
5.
Median (2nd Quartile)
Maximum data value
Minimum data value
1st Quartile (values below 1/4 observations)[whisker end]
3rd Quartile (values below 3/4 observations)[whisker end]
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot – Boxplot Example
• Process aim = 9.0 minutes
• Spec = + / - 1.5 minutes
• n = 125
• R = 1.7
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot - Boxplot Example
•
•
Inside box is the median value, and approximately
50% of observations
Whiskers extend from the box to extreme values
•
Example:
1.
2.
3.
4.
5.
Median; n=125: Median = 63rd value = 9.8
Max = 10.7
Min = 9.0
1st Quartile = X 125 * 0.25 ~ X Avg 31 & 32 value = 9.6
3rd Quartile = X 125 * 0.75 ~ X Avg 94 & 95 value = 10.0
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot - Boxplot Example
9.0
Q1
9.6
9.8
Q2
Q3
10.7
10.0
•
Long Whiskers denote the existence of values much
larger than other values.
•
•
For this example, mean median.
Other variants exist, i.e. + / - 1.5*IQR [whisker ends],
all other points are “outliers” as depicted as asterisks
•
IQR = Inner Quartile Range
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot - Boxplot Example
Confidential - Kenneth R. Martin
Statistics
Measures of Variability (Dispersion) - IQR
• IQR – Interquartile Range
IQR = Q3 – Q1
Confidential - Kenneth R. Martin
Statistics
Box and Whisker Plot - Boxplot Example
9.0
Q1
9.6
•
9.8
Q2
Q3
10.0
For this example,
IQR = ?
Confidential - Kenneth R. Martin
10.7