Download mod1_ungroup

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHEE320
Module 1: Graphical Methods for Analyzing
Data, and Descriptive Statistics
CHEE320 - Fall 2001
J. McLellan
Graphical Methods for Analyzing Data
What is the pattern of variability?
Techniques
•
•
•
•
•
histograms
dot plots
stem and leaf plots
box plots
quantile plots
CHEE320 - Fall 2001
J. McLellan
2
Histogram
• summary of frequency with which certain ranges of
values occur
• ranges - “bins”
• choosing bin size - influences ability to recognize
pattern
» too large - data clustered in a few bins - no indication of
spread of data
» too small - data distributed with a few points in each bin no indication of concentration of data
» there are quantitative rules for choosing the number of
bins - typically automated in statistical software
• not automated in Excel!
CHEE320 - Fall 2001
J. McLellan
3
Histogram - Important Features
symmetry?
number of peaks
H
isto
g
ra
m
(lco
9
0
.S
T
A
1
v*
7
6
8
c)
max, min
data values
- range of values
Noofobs
tails? - extreme
data points
3
0
0
2
8
0
2
6
0
2
4
0
2
2
0
2
0
0
1
8
0
1
6
0
1
4
0
1
2
0
1
0
0
8
0
6
0
4
0
2
0
0
spread in the data
<
=6
3
0
(6
4
0
,6
5
0
]
(6
6
0
,6
7
0
]
(6
8
0
,6
9
0
]
(7
0
0
,7
1
0
]
(7
2
0
,7
3
0
]
(6
3
0
,6
4
0
]
(6
5
0
,6
6
0
]
(6
7
0
,6
8
0
]
(6
9
0
,7
0
0
]
(7
1
0
,7
2
0
]
>7
3
0
L
C
O
9
0
centre of gravity
CHEE320 - Fall 2001
J. McLellan
4
Dot Plots
• similar to histogram
»
»
»
»
»
plot data by value on horizontal axis
stack repeated values vertically
look for similar shape features as for histogram
e.g., data set for solder thickness
{0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1}
0.06
CHEE320 - Fall 2001
0.07
0.08
0.09
0.1
J. McLellan
0.11
0.12
0.13
5
Stem and Leaf Plots
• illustrate variability pattern using the numerical data itself
• choose base division - “stem”
• build “leaves” by taking digit next to base division
Data
12.00
10.00
14.00
20.00
18.00
18.00
25.00
21.00
36.00
44.00
11.00
15.00
22.00
21.00
27.00
25.00
18.00
21.00
18.00
20.00
CHEE320 - Fall 2001
Decimal point is 1 place to
the right of the colon
Stems
Tooth
Discoloration
by Fluoride
J. McLellan
10-14
1 : 0124
15-19
1 : 58888
2 : 001112
20-24
2 : 557
25-29
3:
3:6
Leaves
6
Stem and Leaf Plots
Solder example
» numbers viewed as 0.070, 0.080, 0.090, 0.100, 0.110,…
» decision - what is the stem?
• considerations similar to histogram - size of bins
Decimal point is 2 places to the left of the colon
7:0
8:
9 : 00
10 : 000
11 : 0
12 : 0
13 : 0
CHEE320 - Fall 2001
J. McLellan
7
Box Plots
• graphical representation of “quartile” information
»
»
»
»
quartiles - describe how data occurs - ordering
1st quartile - separates bottom 25% of data
2nd quartile (median) - separates bottom 50% of data
3rd quartile - separates bottom 75% of data
and extreme data values
» add “whiskers” - extend from box to largest data point
within
• upper quartile + 1.5 * interquartile range
• lower quartile - 1.5 * interquartile range
» interquartile range = Q3 - Q1
» plot outliers - data points outside Q3 + 1.5*IQR,
Q1-1.5*IQR
CHEE320 - Fall 2001
J. McLellan
8
Box Plot - for solder data
B
o
xP
lo
t(jso
ld
e
r.S
T
A
1
0
v*
1
0
c)
0
.1
3
5
Interpretation
• no outliers
• relatively symmetric
distribution
• longer tails on both sides
• fairly tightly clustered
about centre
0
.1
2
5
0
.1
1
5
THICKNES
0
.1
0
5
0
.0
9
5
0
.0
8
5
0
.0
7
5
N
o
n
-O
u
tlie
rM
a
x
N
o
n
-O
u
tlie
rM
in
0
.0
6
5
7
5
%
2
5
%
T
H
IC
K
N
E
S
M
e
d
ia
n
T
H
IC
K
N
E
S
:0
CHEE320 - Fall 2001
J. McLellan
9
Box Plot - for teeth discoloration
B
o
xP
lo
t(te
e
th
d
isc.S
T
A
1
0
v*
2
0
c)
Interpretation
• no outliers
• asymmetric distribution long lower tail
• some tails on both sides
• fairly tightly clustered at
higher range of
discoloration
3
2
2
8
2
4
DISCOLOR
2
0
1
6
1
2
N
o
n
-O
u
tlie
rM
a
x
N
o
n
-O
u
tlie
rM
in
7
5
%
2
5
%
8
D
IS
C
O
L
O
R
M
e
d
ia
n
V
A
R
2
:1
CHEE320 - Fall 2001
J. McLellan
10
Quantile Plots
• plot cumulative progression of data
» values vs. cumulative fraction of data
» comparison to standard distribution shapes
• e.g., normal distribution, lognormal distribution, …
» can be plotted on special axes
• analogous to semi-log graphs
to provide visual test for closeness to given distribution
• e.g., test to see if data are normally distributed
CHEE320 - Fall 2001
J. McLellan
11
Quantile Plot - teeth discoloration
Q
u
a
n
tile
-Q
u
a
n
tileP
lo
to
fD
IS
C
O
L
O
R
(te
e
th
d
isc.S
T
A
1
0
v*
2
0
c)
D
istrib
u
tio
n
:N
o
rm
a
l
y=
2
0
.7
9
8
+
7
.9
1
6
*
x+
e
p
s
.0
1
.0
5
.1
.2
5
.5
.7
5
.9
.9
5
.9
9
5
0
4
5
4
0
3
5
ObservedValue
3
0
2
5
2
0
1
5
1
0
5
-2
.5
-2
.0
-1
.5
-1
.0
-0
.5
0
.0
0
.5
T
h
e
o
re
tica
lQ
u
a
n
tile
1
.0
1
.5
Interpretation
• data don’t follow linear
progression
2
.0
2
.5
– underlying distribution
not normal?
Note the irregular spacing - similar to
“semi-log” paper - cumulative points should
follow
linear
on this scale
if distribution is normal.
CHEE320
- Fall progression
2001
J. McLellan
12
Graphical Methods for Quality Investigations
• primary purpose - help organize information in quality
investigation
Examples
• Pareto Charts
• Fishbone diagrams - Ishikawa diagrams
CHEE320 - Fall 2001
J. McLellan
13
Pareto Chart
• used to rank factors
• typically present as a bar chart, listing in descending
order of significance
• significance can be determined by
» number count - e.g., of defects attributed to specific
causes
» by size of effect - e.g., based on coefficients in
regression model
CHEE320 - Fall 2001
J. McLellan
14
Example - Circuit Defects
Number of Defects Attributed to:
Stamping_Oper_ID 1
Stamping_Missing 1
Sold._Short
1
Wire_Incorrect
1
Raw_Cd_Damaged 1
Comp._Extra_Part 2
Comp._Missing
2
Comp._Damaged
2
TST_Mark_White_Mark
3
Tst._Mark_EC_Mark
3
Raw_CD_Shroud_Re.
3
Sold._Splatter
5
Comp._Improper_16
Sold._Opens
7
Sold._Cold_Joint
20
Sold._Insufficient
40
CHEE320 - Fall 2001
J. McLellan
Data from
Montgomery
15
Sold._Insuficient
Sold._Cold_Joint
Sold._Opens
Comp._Improper_1
Sold._Splater
TST_Mark_White_Mark
Tst._Mark_EC_Mark
Raw_CD_Shroud_Re.
Comp._Damaged
Comp._Extra_Part
Comp._Mising
Sold._Short
Wire_Incorect
Raw_Cd_Damaged
Stamping_Oper_ID
Stamping_Mising
Pareto Chart
• for circuit defect data
P
a
re
toC
h
a
rt&
A
n
a
lysis;N
O
_
D
E
F
C
T
1
0
0
1
0
0
%
8
0
8
0
%
6
0
6
0
%
4
0
4
0
4
0
%
2
0
2
0
7
CHEE320 - Fall 2001
6
2
0
%
5
0
3
3
3
2
2
J. McLellan
2
1
1
1
1
1
0
%
16
Fishbone Diagrams
• organize causes in analysis
» have spine, with cause types branching from spine, and
sub-groups branching further
Example - factors influencing poor conversion in
catalyst used
reactive extrusion
- metallocene/Ziegler-Natta
half-life
initiator type
polymer grade
poor conversion
barrel temperature
temperature
control
CHEE320 - Fall 2001
temperature
distribution along barrel
J. McLellan
17
Graphical Methods for Analyzing Data
Looking for time trends in data...
• Time sequence plot
– look for
»
»
»
»
}
jumps
indicate shift in mean operation
ramps to new values
meandering - indicates time correlation in data
large amount of variation about general trend - indication
of large variance
CHEE320 - Fall 2001
J. McLellan
18
Time Sequence Plot
- for naphtha 90% point - indicates amount of heavy
hydrocarbons present in gasoline range material
T
im
eS
e
q
u
e
n
ceP
lo
t-N
a
p
h
th
a9
0
%
P
o
in
t
excursion - sudden
shift in operation
4
8
0
4
7
0
4
6
0
4
5
0
90%point(degreesF)
4
4
0
4
3
0
4
2
0
4
1
0
4
0
0
3
9
0
0
CHEE320 - Fall 2001
3
0
6
0
9
0
1
2
0
J. McLellan
meandering about
average operating point
- time
correlation
in
data
1
5
0
1
8
0
2
1
0
2
4
0
2
7
0
19
Graphical Methods for Analyzing Data
Monitoring process operation
• Quality Control Charts
– time sequence plots with added indications of variation
» account for fluctuations in values associated with natural
process noise
» look for significant jumps - shifts - that exceed normal
range of variation of values
» if significant shift occurs, stop and look for “assignable
causes”
» essentially graphical “hypothesis tests”
» can plot - measurements, sample averages, ranges,
standard deviations, ...
CHEE320 - Fall 2001
J. McLellan
20
Example - Monitoring Process Mean
• is the average process operation constant?
• collect samples at time intervals, compute average,
and plot in time sequence plot
• indication of process variation - standard deviation
estimated from prior data
» propagates through sample average calculation
» if “s” is sample standard deviation, calculated averages
will lie between  3 s / n of the historical average 99%
of the time if the mean operation has NOT shifted
» values outside this range suggest that a shift in the mean
operation has occurred - alarm - “something has
happened”
CHEE320 - Fall 2001
J. McLellan
21
Example - Monitoring Process Mean
• time sequence plot with these alarm limits is referred
to as a “Shewhart X-bar Chart”
» X-bar 
X
- sample mean of X
X
-B
A
RM
e
a
n
:7
4
.0
0
1
7
2
4
.(
0
0
1
)2
P
ro
c
.s
ig
m
a
:.0
0
9
7
.0
8
0
5
9
(
7
8
)5
n
:5
7
4
.0
1
4
3
upper and
lower control
limits
7
4
.0
0
1
2
centre-line
or target
line - indicates
mean when
process is
operating
properly
7
3
.9
8
8
0
1
5
1
0
1
5
S
a
m
p
le
s
CHEE320 - Fall 2001
J. McLellan
2
0
2
5
no points exceed
limits  in a state
of statistical control
22
Example - Monitoring Process Mean
Point exceeds
region of
natural
variation
- significant
shift has occurred
• X-bar chart
X
-B
A
RM
e
a
n
:7
4
.0
0
2
7
2
4
.(
0
0
2
)2
P
ro
c
.s
ig
m
a
:.0
1
1
8
.0
3
1
2
1
(
8
3
)2
n
:5
7
4
.0
1
8
1
7
4
.0
0
2
2
7
3
.9
8
6
4
1
5
1
0
1
5
2
0
2
5
S
a
m
p
le
s
CHEE320 - Fall 2001
J. McLellan
23
Graphical Methods for Analyzing Data
Visualizing relationships between variables
Techniques
• scatterplots
• scatterplot matrices
» also referred to as “casement plots”
CHEE320 - Fall 2001
J. McLellan
24
Scatterplots
,,, are also referred to as “x-y diagrams”
• plot values of one variable against another
• look for systematic trend in data
» nature of trend
• linear?
• exponential?
• quadratic?
» degree of scatter - does spread increase/decrease over
range?
• indication that variance isn’t constant over range of data
CHEE320 - Fall 2001
J. McLellan
25
Scatterplots - Example
• tooth discoloration data - discoloration vs. fluoride
c)
0
2
v*
th4
e
t(te
lo
rp
tte
ca
S
0
5
5
4
0
4
5
3
DISCOLOR
0
3
5
2
0
2
trend - possibly
nonlinear?
5
1
0
1
5
.0
0
.5
0
.0
1
.5
1
.0
2
.5
2
.0
3
.5
3
.0
4
.5
4
E
ID
R
O
U
L
F
CHEE320 - Fall 2001
J. McLellan
26
Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing
S
ca
tte
rp
lo
t(te
e
th4
v*
2
0
c)
5
0
4
5
4
0
signficant trend?
- doesn’t appear to
be present
3
5
DISCOLOR
3
0
2
5
2
0
1
5
1
0
5
4
5
6
7
8
9
1
0
1
1
1
2
1
3
B
R
U
S
H
IN
G
CHEE320 - Fall 2001
J. McLellan
27
Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing
S
ca
tte
rp
lo
t(te
e
th4
v*
2
0
c)
Variance appears
to decrease as
# of brushings increases
5
0
4
5
4
0
3
5
DISCOLOR
3
0
2
5
2
0
1
5
1
0
5
4
5
6
7
8
9
1
0
1
1
1
2
1
3
B
R
U
S
H
IN
G
CHEE320 - Fall 2001
J. McLellan
28
Scatterplot matrices
… are a table of scatterplots for a set of variables
Look for » systematic trend between “independent” variable and
dependent variables - to be described by estimated
model
» systematic trend between supposedly independent
variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically
with statistical software, manually using Excel
CHEE320 - Fall 2001
J. McLellan
29
Scatterplot Matrices - tooth data
M
a
trixP
lo
t(te
e
th4
v*
2
0
c)
F
L
U
O
R
ID
E
A
G
E
B
R
U
S
H
IN
G
D
IS
C
O
L
O
R
CHEE320 - Fall 2001
J. McLellan
30
Describing Data Quantitatively
Approach - describe the pattern of variability using a
few parameters
» efficient means of summarizing
Techniques
• average - (sample “mean”)
• sample standard deviation and variance
• median
• quartiles
• interquartile range
CHEE320
J. McLellan
• ... - Fall 2001
31
Sample Mean - “Average”
Given “n” observations xi :
1 n
x =  xi
n i =1
Notes » sensitive to extreme data values - outliers - value can be
artificially raised or lowered
CHEE320 - Fall 2001
J. McLellan
32
Sample Variance
• sum of squared deviations about the average
» squaring - notion of distance (squared)
» average - is the centre of gravity
• sample variance provides a measure of dispersion spread - about the centre of gravity
1
2
s =
n
 ( xi - x ) 2
n - 1 i =1
Note - there is an alternative
form of this equation which
is more convenient for
computation.
Note that we divide
by “n-1”, and NOT
“n” - degrees of freedom argument
CHEE320 - Fall 2001
J. McLellan
33
Sample Standard Deviation
… is simply
s = s2
• sample standard deviation provides a more direct link
to dispersion
» e.g., for Normal distribution
• 95% of values lie within 2 standard devn’s of the mean
• 99% of values like within 3 standard devn’s of the mean
CHEE320 - Fall 2001
J. McLellan
34
Range
• provides a measure of spread in the data
• defined as
maximum data value - minimum data value
• can be sensitive to extreme data points
• is often monitored in quality control charts to see if
process variance is changing
CHEE320 - Fall 2001
J. McLellan
35
“Order” Statistics
… summarize the progression of observations in the
data set
Quartiles
» divide the data in quarters
Deciles
» divide the data in tenths ...
CHEE320 - Fall 2001
J. McLellan
36
Quartiles
• order data - N data points {yi}, i=1,…N
• if N is odd,
» median is observation y( N +1) / 2
• if N is even,
yN
yN
» median is
+
+1
2
2
2
• i.e., midpoint between two middle points
CHEE320 - Fall 2001
J. McLellan
37
Quartiles - Q1 and Q3
• Q1: Compute (N+1)/4 = A.B
Q1 = y A + B * ( y A+1 - y A )
• Q3: Compute 3(N+1)/4 = A.B
Q3 = y A + B * ( y A+1 - y A )
» i.e., interpolate between adjacent points
» Note - there are other conventions as well - e.g., for Q1,
take bottom half of data set, and take midpoint between
middle two points if there are an even number of points...
CHEE320 - Fall 2001
J. McLellan
38
Quartiles - Example
• solder data set
»
»
»
»
»
observations
0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1
ordered: 0.07, 0.09, 0.09, 0.1, 0.1, 0.1, 0.11, 0.12, 0.13
9 points --> median is 5th observation: 0.1
Q1: (N+1)/4 = 2.5
• Q1 = 0.09+0.5*(0.09-0.09) = 0.9
» Q3: 3(N+1)/4 = 7.5
• Q3 = 0.11 + 0.5*(0.12-0.11) = 0.115
CHEE320 - Fall 2001
J. McLellan
39
Robustness
… refers to whether a given descriptive statistic is
sensitive to extreme data points
Examples
• sample mean
» is sensitive to extreme points - extreme value pulls
average toward the extreme
• sample variance
» sensitive to extreme points - large deviation from the
sample mean leads to inflated variance
• median, quartiles
» relatively insensitive to extreme data points
CHEE320 - Fall 2001
J. McLellan
40
Robustness -Solder Data Example
• replace 0.13 by 0.5 - output from Excel
With 0.13
With 0.5
thickness
thickness
Mean
Median
Mode
Standard Deviation
Sample Variance
Range
Minimum
Maximum
CHEE320 - Fall 2001
0.101111
0.1
0.1
0.017638
0.000311
0.06
0.07
0.13
J. McLellan
Mean
Median
Mode
Standard Deviation
Sample Variance
Range
Minimum
Maximum
0.142222
0.1
0.1
0.134887
0.018194
0.43
0.07
0.5
41
Robustness
• Other robust statistics
» “m-estimator” - involves iterative filtering out of extreme
data values, based on data distribution
» trimmed mean - other bases for eliminating extreme data
point effect
» median absolute deviation
CHEE320 - Fall 2001
J. McLellan
42