Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHEE320 Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics CHEE320 - Fall 2001 J. McLellan Graphical Methods for Analyzing Data What is the pattern of variability? Techniques • • • • • histograms dot plots stem and leaf plots box plots quantile plots CHEE320 - Fall 2001 J. McLellan 2 Histogram • summary of frequency with which certain ranges of values occur • ranges - “bins” • choosing bin size - influences ability to recognize pattern » too large - data clustered in a few bins - no indication of spread of data » too small - data distributed with a few points in each bin no indication of concentration of data » there are quantitative rules for choosing the number of bins - typically automated in statistical software • not automated in Excel! CHEE320 - Fall 2001 J. McLellan 3 Histogram - Important Features symmetry? number of peaks H isto g ra m (lco 9 0 .S T A 1 v* 7 6 8 c) max, min data values - range of values Noofobs tails? - extreme data points 3 0 0 2 8 0 2 6 0 2 4 0 2 2 0 2 0 0 1 8 0 1 6 0 1 4 0 1 2 0 1 0 0 8 0 6 0 4 0 2 0 0 spread in the data < =6 3 0 (6 4 0 ,6 5 0 ] (6 6 0 ,6 7 0 ] (6 8 0 ,6 9 0 ] (7 0 0 ,7 1 0 ] (7 2 0 ,7 3 0 ] (6 3 0 ,6 4 0 ] (6 5 0 ,6 6 0 ] (6 7 0 ,6 8 0 ] (6 9 0 ,7 0 0 ] (7 1 0 ,7 2 0 ] >7 3 0 L C O 9 0 centre of gravity CHEE320 - Fall 2001 J. McLellan 4 Dot Plots • similar to histogram » » » » » plot data by value on horizontal axis stack repeated values vertically look for similar shape features as for histogram e.g., data set for solder thickness {0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1} 0.06 CHEE320 - Fall 2001 0.07 0.08 0.09 0.1 J. McLellan 0.11 0.12 0.13 5 Stem and Leaf Plots • illustrate variability pattern using the numerical data itself • choose base division - “stem” • build “leaves” by taking digit next to base division Data 12.00 10.00 14.00 20.00 18.00 18.00 25.00 21.00 36.00 44.00 11.00 15.00 22.00 21.00 27.00 25.00 18.00 21.00 18.00 20.00 CHEE320 - Fall 2001 Decimal point is 1 place to the right of the colon Stems Tooth Discoloration by Fluoride J. McLellan 10-14 1 : 0124 15-19 1 : 58888 2 : 001112 20-24 2 : 557 25-29 3: 3:6 Leaves 6 Stem and Leaf Plots Solder example » numbers viewed as 0.070, 0.080, 0.090, 0.100, 0.110,… » decision - what is the stem? • considerations similar to histogram - size of bins Decimal point is 2 places to the left of the colon 7:0 8: 9 : 00 10 : 000 11 : 0 12 : 0 13 : 0 CHEE320 - Fall 2001 J. McLellan 7 Box Plots • graphical representation of “quartile” information » » » » quartiles - describe how data occurs - ordering 1st quartile - separates bottom 25% of data 2nd quartile (median) - separates bottom 50% of data 3rd quartile - separates bottom 75% of data and extreme data values » add “whiskers” - extend from box to largest data point within • upper quartile + 1.5 * interquartile range • lower quartile - 1.5 * interquartile range » interquartile range = Q3 - Q1 » plot outliers - data points outside Q3 + 1.5*IQR, Q1-1.5*IQR CHEE320 - Fall 2001 J. McLellan 8 Box Plot - for solder data B o xP lo t(jso ld e r.S T A 1 0 v* 1 0 c) 0 .1 3 5 Interpretation • no outliers • relatively symmetric distribution • longer tails on both sides • fairly tightly clustered about centre 0 .1 2 5 0 .1 1 5 THICKNES 0 .1 0 5 0 .0 9 5 0 .0 8 5 0 .0 7 5 N o n -O u tlie rM a x N o n -O u tlie rM in 0 .0 6 5 7 5 % 2 5 % T H IC K N E S M e d ia n T H IC K N E S :0 CHEE320 - Fall 2001 J. McLellan 9 Box Plot - for teeth discoloration B o xP lo t(te e th d isc.S T A 1 0 v* 2 0 c) Interpretation • no outliers • asymmetric distribution long lower tail • some tails on both sides • fairly tightly clustered at higher range of discoloration 3 2 2 8 2 4 DISCOLOR 2 0 1 6 1 2 N o n -O u tlie rM a x N o n -O u tlie rM in 7 5 % 2 5 % 8 D IS C O L O R M e d ia n V A R 2 :1 CHEE320 - Fall 2001 J. McLellan 10 Quantile Plots • plot cumulative progression of data » values vs. cumulative fraction of data » comparison to standard distribution shapes • e.g., normal distribution, lognormal distribution, … » can be plotted on special axes • analogous to semi-log graphs to provide visual test for closeness to given distribution • e.g., test to see if data are normally distributed CHEE320 - Fall 2001 J. McLellan 11 Quantile Plot - teeth discoloration Q u a n tile -Q u a n tileP lo to fD IS C O L O R (te e th d isc.S T A 1 0 v* 2 0 c) D istrib u tio n :N o rm a l y= 2 0 .7 9 8 + 7 .9 1 6 * x+ e p s .0 1 .0 5 .1 .2 5 .5 .7 5 .9 .9 5 .9 9 5 0 4 5 4 0 3 5 ObservedValue 3 0 2 5 2 0 1 5 1 0 5 -2 .5 -2 .0 -1 .5 -1 .0 -0 .5 0 .0 0 .5 T h e o re tica lQ u a n tile 1 .0 1 .5 Interpretation • data don’t follow linear progression 2 .0 2 .5 – underlying distribution not normal? Note the irregular spacing - similar to “semi-log” paper - cumulative points should follow linear on this scale if distribution is normal. CHEE320 - Fall progression 2001 J. McLellan 12 Graphical Methods for Quality Investigations • primary purpose - help organize information in quality investigation Examples • Pareto Charts • Fishbone diagrams - Ishikawa diagrams CHEE320 - Fall 2001 J. McLellan 13 Pareto Chart • used to rank factors • typically present as a bar chart, listing in descending order of significance • significance can be determined by » number count - e.g., of defects attributed to specific causes » by size of effect - e.g., based on coefficients in regression model CHEE320 - Fall 2001 J. McLellan 14 Example - Circuit Defects Number of Defects Attributed to: Stamping_Oper_ID 1 Stamping_Missing 1 Sold._Short 1 Wire_Incorrect 1 Raw_Cd_Damaged 1 Comp._Extra_Part 2 Comp._Missing 2 Comp._Damaged 2 TST_Mark_White_Mark 3 Tst._Mark_EC_Mark 3 Raw_CD_Shroud_Re. 3 Sold._Splatter 5 Comp._Improper_16 Sold._Opens 7 Sold._Cold_Joint 20 Sold._Insufficient 40 CHEE320 - Fall 2001 J. McLellan Data from Montgomery 15 Sold._Insuficient Sold._Cold_Joint Sold._Opens Comp._Improper_1 Sold._Splater TST_Mark_White_Mark Tst._Mark_EC_Mark Raw_CD_Shroud_Re. Comp._Damaged Comp._Extra_Part Comp._Mising Sold._Short Wire_Incorect Raw_Cd_Damaged Stamping_Oper_ID Stamping_Mising Pareto Chart • for circuit defect data P a re toC h a rt& A n a lysis;N O _ D E F C T 1 0 0 1 0 0 % 8 0 8 0 % 6 0 6 0 % 4 0 4 0 4 0 % 2 0 2 0 7 CHEE320 - Fall 2001 6 2 0 % 5 0 3 3 3 2 2 J. McLellan 2 1 1 1 1 1 0 % 16 Fishbone Diagrams • organize causes in analysis » have spine, with cause types branching from spine, and sub-groups branching further Example - factors influencing poor conversion in catalyst used reactive extrusion - metallocene/Ziegler-Natta half-life initiator type polymer grade poor conversion barrel temperature temperature control CHEE320 - Fall 2001 temperature distribution along barrel J. McLellan 17 Graphical Methods for Analyzing Data Looking for time trends in data... • Time sequence plot – look for » » » » } jumps indicate shift in mean operation ramps to new values meandering - indicates time correlation in data large amount of variation about general trend - indication of large variance CHEE320 - Fall 2001 J. McLellan 18 Time Sequence Plot - for naphtha 90% point - indicates amount of heavy hydrocarbons present in gasoline range material T im eS e q u e n ceP lo t-N a p h th a9 0 % P o in t excursion - sudden shift in operation 4 8 0 4 7 0 4 6 0 4 5 0 90%point(degreesF) 4 4 0 4 3 0 4 2 0 4 1 0 4 0 0 3 9 0 0 CHEE320 - Fall 2001 3 0 6 0 9 0 1 2 0 J. McLellan meandering about average operating point - time correlation in data 1 5 0 1 8 0 2 1 0 2 4 0 2 7 0 19 Graphical Methods for Analyzing Data Monitoring process operation • Quality Control Charts – time sequence plots with added indications of variation » account for fluctuations in values associated with natural process noise » look for significant jumps - shifts - that exceed normal range of variation of values » if significant shift occurs, stop and look for “assignable causes” » essentially graphical “hypothesis tests” » can plot - measurements, sample averages, ranges, standard deviations, ... CHEE320 - Fall 2001 J. McLellan 20 Example - Monitoring Process Mean • is the average process operation constant? • collect samples at time intervals, compute average, and plot in time sequence plot • indication of process variation - standard deviation estimated from prior data » propagates through sample average calculation » if “s” is sample standard deviation, calculated averages will lie between 3 s / n of the historical average 99% of the time if the mean operation has NOT shifted » values outside this range suggest that a shift in the mean operation has occurred - alarm - “something has happened” CHEE320 - Fall 2001 J. McLellan 21 Example - Monitoring Process Mean • time sequence plot with these alarm limits is referred to as a “Shewhart X-bar Chart” » X-bar X - sample mean of X X -B A RM e a n :7 4 .0 0 1 7 2 4 .( 0 0 1 )2 P ro c .s ig m a :.0 0 9 7 .0 8 0 5 9 ( 7 8 )5 n :5 7 4 .0 1 4 3 upper and lower control limits 7 4 .0 0 1 2 centre-line or target line - indicates mean when process is operating properly 7 3 .9 8 8 0 1 5 1 0 1 5 S a m p le s CHEE320 - Fall 2001 J. McLellan 2 0 2 5 no points exceed limits in a state of statistical control 22 Example - Monitoring Process Mean Point exceeds region of natural variation - significant shift has occurred • X-bar chart X -B A RM e a n :7 4 .0 0 2 7 2 4 .( 0 0 2 )2 P ro c .s ig m a :.0 1 1 8 .0 3 1 2 1 ( 8 3 )2 n :5 7 4 .0 1 8 1 7 4 .0 0 2 2 7 3 .9 8 6 4 1 5 1 0 1 5 2 0 2 5 S a m p le s CHEE320 - Fall 2001 J. McLellan 23 Graphical Methods for Analyzing Data Visualizing relationships between variables Techniques • scatterplots • scatterplot matrices » also referred to as “casement plots” CHEE320 - Fall 2001 J. McLellan 24 Scatterplots ,,, are also referred to as “x-y diagrams” • plot values of one variable against another • look for systematic trend in data » nature of trend • linear? • exponential? • quadratic? » degree of scatter - does spread increase/decrease over range? • indication that variance isn’t constant over range of data CHEE320 - Fall 2001 J. McLellan 25 Scatterplots - Example • tooth discoloration data - discoloration vs. fluoride c) 0 2 v* th4 e t(te lo rp tte ca S 0 5 5 4 0 4 5 3 DISCOLOR 0 3 5 2 0 2 trend - possibly nonlinear? 5 1 0 1 5 .0 0 .5 0 .0 1 .5 1 .0 2 .5 2 .0 3 .5 3 .0 4 .5 4 E ID R O U L F CHEE320 - Fall 2001 J. McLellan 26 Scatterplot - Example • tooth discoloration data -discoloration vs. brushing S ca tte rp lo t(te e th4 v* 2 0 c) 5 0 4 5 4 0 signficant trend? - doesn’t appear to be present 3 5 DISCOLOR 3 0 2 5 2 0 1 5 1 0 5 4 5 6 7 8 9 1 0 1 1 1 2 1 3 B R U S H IN G CHEE320 - Fall 2001 J. McLellan 27 Scatterplot - Example • tooth discoloration data -discoloration vs. brushing S ca tte rp lo t(te e th4 v* 2 0 c) Variance appears to decrease as # of brushings increases 5 0 4 5 4 0 3 5 DISCOLOR 3 0 2 5 2 0 1 5 1 0 5 4 5 6 7 8 9 1 0 1 1 1 2 1 3 B R U S H IN G CHEE320 - Fall 2001 J. McLellan 28 Scatterplot matrices … are a table of scatterplots for a set of variables Look for » systematic trend between “independent” variable and dependent variables - to be described by estimated model » systematic trend between supposedly independent variables - indicates that these quantities are correlated • correlation can negatively ifluence model estimation results • not independent information • scatterplot matrices can be generated automatically with statistical software, manually using Excel CHEE320 - Fall 2001 J. McLellan 29 Scatterplot Matrices - tooth data M a trixP lo t(te e th4 v* 2 0 c) F L U O R ID E A G E B R U S H IN G D IS C O L O R CHEE320 - Fall 2001 J. McLellan 30 Describing Data Quantitatively Approach - describe the pattern of variability using a few parameters » efficient means of summarizing Techniques • average - (sample “mean”) • sample standard deviation and variance • median • quartiles • interquartile range CHEE320 J. McLellan • ... - Fall 2001 31 Sample Mean - “Average” Given “n” observations xi : 1 n x = xi n i =1 Notes » sensitive to extreme data values - outliers - value can be artificially raised or lowered CHEE320 - Fall 2001 J. McLellan 32 Sample Variance • sum of squared deviations about the average » squaring - notion of distance (squared) » average - is the centre of gravity • sample variance provides a measure of dispersion spread - about the centre of gravity 1 2 s = n ( xi - x ) 2 n - 1 i =1 Note - there is an alternative form of this equation which is more convenient for computation. Note that we divide by “n-1”, and NOT “n” - degrees of freedom argument CHEE320 - Fall 2001 J. McLellan 33 Sample Standard Deviation … is simply s = s2 • sample standard deviation provides a more direct link to dispersion » e.g., for Normal distribution • 95% of values lie within 2 standard devn’s of the mean • 99% of values like within 3 standard devn’s of the mean CHEE320 - Fall 2001 J. McLellan 34 Range • provides a measure of spread in the data • defined as maximum data value - minimum data value • can be sensitive to extreme data points • is often monitored in quality control charts to see if process variance is changing CHEE320 - Fall 2001 J. McLellan 35 “Order” Statistics … summarize the progression of observations in the data set Quartiles » divide the data in quarters Deciles » divide the data in tenths ... CHEE320 - Fall 2001 J. McLellan 36 Quartiles • order data - N data points {yi}, i=1,…N • if N is odd, » median is observation y( N +1) / 2 • if N is even, yN yN » median is + +1 2 2 2 • i.e., midpoint between two middle points CHEE320 - Fall 2001 J. McLellan 37 Quartiles - Q1 and Q3 • Q1: Compute (N+1)/4 = A.B Q1 = y A + B * ( y A+1 - y A ) • Q3: Compute 3(N+1)/4 = A.B Q3 = y A + B * ( y A+1 - y A ) » i.e., interpolate between adjacent points » Note - there are other conventions as well - e.g., for Q1, take bottom half of data set, and take midpoint between middle two points if there are an even number of points... CHEE320 - Fall 2001 J. McLellan 38 Quartiles - Example • solder data set » » » » » observations 0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1 ordered: 0.07, 0.09, 0.09, 0.1, 0.1, 0.1, 0.11, 0.12, 0.13 9 points --> median is 5th observation: 0.1 Q1: (N+1)/4 = 2.5 • Q1 = 0.09+0.5*(0.09-0.09) = 0.9 » Q3: 3(N+1)/4 = 7.5 • Q3 = 0.11 + 0.5*(0.12-0.11) = 0.115 CHEE320 - Fall 2001 J. McLellan 39 Robustness … refers to whether a given descriptive statistic is sensitive to extreme data points Examples • sample mean » is sensitive to extreme points - extreme value pulls average toward the extreme • sample variance » sensitive to extreme points - large deviation from the sample mean leads to inflated variance • median, quartiles » relatively insensitive to extreme data points CHEE320 - Fall 2001 J. McLellan 40 Robustness -Solder Data Example • replace 0.13 by 0.5 - output from Excel With 0.13 With 0.5 thickness thickness Mean Median Mode Standard Deviation Sample Variance Range Minimum Maximum CHEE320 - Fall 2001 0.101111 0.1 0.1 0.017638 0.000311 0.06 0.07 0.13 J. McLellan Mean Median Mode Standard Deviation Sample Variance Range Minimum Maximum 0.142222 0.1 0.1 0.134887 0.018194 0.43 0.07 0.5 41 Robustness • Other robust statistics » “m-estimator” - involves iterative filtering out of extreme data values, based on data distribution » trimmed mean - other bases for eliminating extreme data point effect » median absolute deviation CHEE320 - Fall 2001 J. McLellan 42