Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module Two: Graphical and numerical exploration of univariate variables In this module, we will demonstrate how to use Minitab to conduct graphical and descriptive summary of data. In particular, we shall focus on • How to construct and when to apply each tool • For most situations, we will skip the ‘WHY’, the except a few essential concepts. theory behind, • The tools introduced here are general. They can be applied to inter-laboratory data analysis as well as others. 1 In general, the process of an inter-laboratory study can be described: Lab I Lab J Operator Operator System Measured System Measured Instruments Instruments External Environment 2 When planning an inter-laboratory study, one should: 1. Clearly identify the purpose of the study and list the needed instruments. 2. To make the uncontrollable environment as uniform as possible in order to eliminate unexplainable error and uncertainties. . 3. To make the experimental units as homogeneous as possible in order to reduce the random errors. 4. To make sure the testing process is under statistical control. 5. To collect variables that directly answer the purpose of the study. 6. To collect potentially confounding factors and co-variates that have potential impact to the response variables. 7. To plan the study in a reasonable time period with a reasonable cost. 3 Once the experiment is conducted and data are collected, a typical procedure for data analysis includes 1. Data screening to take care of trivial mistakes and identify possible causes that may lead to these mistakes. 2. Graphical summary and descriptive summary to check for nontrivial outliers and identify causes that may lead to these outliers. If there is no clear reason for the unusual data values, they should not be deleted immediately. 3. Check the validity of statistical assumption such as • If the data follow normal curve, • If the variances are approximately constant among different factor levels. 4 4. Conduct thorough and appropriate data analysis. In many situations, the type of analysis has already been determined when the the design of experiment was determined. 5. Properly interpret results based on both quantitative evidence as well as qualitative aspect of the study. It is often that pure empirical data without a close connection with the study could be misleading. It is extremely important that the results and interpretation should be logic from the study itself, not to be interpreted purely based on numbers. 6. Write an appropriate report, depending on the audiences. The report could be for general audience or for technical experts. 7. In many situations, the results may lead to improvement/modification of the testing procedure, testing process and operator training, and/or lead to further studies. 5 In this module, we will analyze some inter-laboratory data and the data we collected in class using graphical and descriptive summaries. When we collect data for a study, a very first thing we should be aware is the types of data we collect. Different types of data should be summarized and analyzed differently. Blood type, Gender Number of defective parts from inspecting 10 parts. Number of bacteria A in a ml of water. temperature, strength of a fiber. In most of inter-laboratory testing, we mostly likely collect this types of response. 6 Graphical methods to show uncertainty and distribution of measurements Qualitative data or categorical data: • Some useful tools are pie chats, bar charts, Pareto charts, stem & leaf plots. For Continuous data: •Some useful tools: stem & leaf plot, histogram, box-plots, timeseries plots For demonstrating relationship between two or more responses: •Some useful tools are scatter plots, matrix plots. For checking if a variable follows a normal curve: •Some useful tools are Q-Q plots, Normal probability plots. 7 Graphical tools are powerful for detecting outliers For checking outliers (unusual cases) for one response situation: • Some useful tools include Box-plots, h-plots, k-plots, For checking outliers (unusual cases) in two-sample situation: • Some useful tools include Youden’s plot, Confidence region plots, side-by-side box-plots, scatter plots. A detailed discussion will be given in a separate module for outlier detection. 8 Pie Chart, Bar Chart and Pareto Chart: A survey of 400 individuals are survey to rate the school quality in USA. The data is summarized: Rating Frequency A B 35 260 C D 93 12 Relative Frequency Draw a pie chart, a Bar Chart and a Pareto chart B C A Pareto chart: A ordered Bar Chart starting with highest bar and down. It provides a quick check of the categories occur most often. D Hands-on exercise – Identify types of data, use Minitab to construct Pie, Bar, Pareto charts. A set of 20 students is randomly chosen from a university, and the following measurements are recorded in the next table. 1. Identify qualitative variables: 2. Identify discrete variables: 3. Identify continuous variables: 4. Construct a pie chart, Bar chart and Pareto chart for the variable Year. Student GPA Gender Year Major Hour study/wk 1 2.0 M 3 Biology 3.6 2 3.2 F 1 Biology 7.4 3 2.5 F 4 Biology 4.8 4 2.8 M 3 Accounting 5.0 5 3.6 F 2 Accounting 6.5 6 3.1 M 3 Law 4.2 7 2.8 M 2 Law 3.8 8 2.4 M 2 Math 2.5 9 2.8 F 1 Math 5.2 10 2.6 M 3 Math 3.5 11 3.0 F 4 Math 6.8 12 3.2 M 2 Computer 9.3 13 3.7 M 1 Computer 7.2 14 2.7 F 3 Computer 5.3 15 2.9 M 2 Computer 4.2 16 2.5 M 4 Language 2.8 17 2.8 F 4 Language 3.8 18 3.2 F 1 Language 4.8 19 3.4 M 3 Engineer 4.4 20 3.1 F 3 Engineer 7.4 11 A Quick Guide of using Minitab Minitab (version13) has four windows:  Worksheet window for data in column/row format. Session window for results, a text file which is editable. Graph window for graphs, which has a full screen editing capability. Project manager window, which keeps all the history and program codes. • Minitab has nine pull down menu. Most of them are self-explanatory (File, Edit, Manip, Editor, Window, and Help). Three data analysis related manus: are: Calc Menu: for data transformation and a variety of statistical distributions and random number generations. Stat Menu: consists of a variety of statistical methods. Graph Menu: for constructing a variety of graphs. Create, manipulate, and retrieve data Once Minitab is open, Session and Worksheet are open as default. •To create a data set, e.g., the 20 student records: •C1, C2 are for variables. The row underneath C1,C2, and above the first case is for Variable Name (can have up to 512 characters). •Define your variable names and begin to enter the data just like any spreadsheet. For the 20 student record data, you can define the variables to be ID in C1, GPA in C2, Gender in C3, Year in C4, Major in C5, and Hourstudy in C6. Note: C5 is a non-numeric variable. C5 is change to C5-T once you enter the data. •Save the worksheet: •Go to File menu, you can save your file as a minitab Worksheet, a minitab Project, or other types of data. Minitab worksheet saves only the data file, and has the extension as .MTW. Minitab Project saves everything you have created, including any results or graphs from analysis. It has the extension: .MPJ. •Editing data points. •Simply move the cursor to the data cell and modify the data as needed. In addition, Edit menu and Editor Menu can be used for editing your data. •If you need to compute new variables or transform an existing variable: Under the Calc Menu, you can use Calculator or Column/Row statistics to manipulate the data. •Retrieving existing data: •If the existing data is in Minitab format, go to File, Open Project or Worksheet, and find your data file, double click to open the file. You are ready to work. 13 • If your data is in EXCEL or in TEXT format (.txt or .dat), go to File Menu, click on Open Worksheet, in the Dialog, click on the Files of type, choose the correct file format, find the file, open it. You are ready to work. The basic structure of handling data, outputs and graphs in Minitab The default windows are Session window and Worksheet window. The graph window will appear whenever a graph is created. A typical procedure is: 1. Create a new data set or open an existing dataset. 2. Conduct data screening and data manipulations and transformations as needed. 3. Conduct descriptive summary and graphic summary: 4. 1. Go to Stat Menu, choose Basic Statistics, then Display Descriptive Statistics, and enter the dialog box. The results will be held in the Session Window. 2. Go to Graphs, choose an appropriate Graphical tool, and enter the dialog box. The result will appear in the Graph Window. Conduct a specific data analysis procedure: 1. Go to Stat Menu, choose the appropriate statistical procedure, enter the dialog box. e.g., Basic Statistics is for descriptive summary, one-sample, two sample tests, and normality test. Regression is for model building and model selection. ANOVA is for balanced and generalized analysis of variance, and post-hoc tests. DOE is for generating a variety of designs for experiments. Control Charts have a variety of control charts, and Quality Control is for Capability analysis, Gage R&R analysis, 14 and so on. Tables is for cross-tabulation and frequency summaries. And there are many others. • • How to edit a graph? 1. To edit a graph, click on the graph twice, two editing palettes will appear. The tool palette gives you drawing tools, and the Attribute palette gives you tools for modifying the graph. There are other tools and functions in the Editor Menu for editing graphs. 2. To add any text, first select ‘T’ icon on the Tool palette, then go to the position where a text will be created, press the left mouse and drag a text box, enter your text, such as title of the graph. 3. To edit any part of the graph, first, click on the ‘Arrow’ sign on the Tool palette. Move the cursor to the graph, then highlight the part of the graph to be edited by pressing the left mouse and draging the area. A dotted box will appear. You can enlarge the line, the font, change the color, font size and so on. 4. By default, the graph that is created by the data values is locked. You may unlock it by : go to Editor, then, choose ‘Unlock Data Display’ for editing. 5. To rotate the angle of the X-makers, Y-Marker, and any text in the graph, you can highlight the text, and go to Editor Menu, choose Rotate Left or Rotate Right for a variety of angles. 6. Once the graph is edited, you can save it: go to File Menu, then save Graph. You can print it. You can paste it to your reports. How to integrate output, graphs into the report document? 1. You can copy the graph and paste it to your report: First, go to Editor to select ‘View’, then go to Edit, choose ‘Copy Graph’. Then go to your report document, and paste the graph. 2. You can edit any output in the Session just like a word processing file, highlight and copy any part of the output from the Session Window, and paste it to your report document. 15 Revisit: How to construct these charts using Minitab? A survey of 400 individuals are survey to rate the school quality in USA. The data is summarized: Rating Frequency A B 35 260 C D 93 12 Relative Frequency Draw a pie chart, a Bar Chart and a Pareto chart B C A Pareto chart: A ordered Bar Chart starting with highest bar and down. It provides a quick check of the categories occur most often. D In Minitab, 1. Enter School rate in C1 and Frequency in C2 C1 C2 School rate Frequency A 35 B 260 C 93 D 12 2. For Pie Chart: Go to Graph, choose Pie Chart. Click ‘Chart Table’, enter Categories in C1, Frequencies in C2. For the rest, you may leave them as they are, or you may change to your like. (NOTE: If the data is not summarized in table form, for example, the variable of Gender, Year and Major in the student information data, you will enter the variable into ‘Chart Data in____’. 3. For Bar Chart: Go to Graph, choose Chart. In the Graph box, choose ‘Mean’ for Function, enter C2 for Y, and enter C1 for X. You may change the Display Box for different types of displays of the Bar chart. Now, click on Annotation, and choose ‘Data Label’, then, ‘Show Data Label’. This will show the frequencies on the chart. 4. Pareto Chart: Go to Stat, choose Quality Tools, then choose Pareto Chart. In the Dialog box, click on ‘Chart Defect Table’, and enter Labels in C1, Frequencies in C2. 17 Hands-on exercise Revisit: use Minitab to construct Pie, Bar, Pareto charts for the student information data. 1. Construct a pie chart, Bar chart and Pareto chart for the variable Year. Steps for Bar Chart: Go to Graph, choose Chart. In the Graph box, choose ‘count’ for Function, enter ‘Year’ for Y and also enter ‘Year’ for X. Note: Minitab only allows to construct pie chart one at a time. However, we can use Chart or Pareto Chart to construct a side-by-side charts for different categories of another variable. 2. Construct bar graphs and Pareto charts for Year by Gender. For Bar Graph: 1. 2. 3. Go to Graph, choose ‘Count’ for Function, enter ‘Year’ fro Y and also enter ‘Year’ for X. in the Display box, choose , for each ‘Group’, the group variable is ‘Gender. Click on Options, in the Option Dialog box, for ‘Groups within X boxes, click on ‘cluster ‘Gender. You can also click on Annotation’ to display Data Label. For Pareto Chart: 1. 2. Go to Stat, choose Quality Tools, choose Pareto Chart. In the Dialog box, click Chart Defects in , and enter ‘Year’, and By Variable in ‘Gender’. Student GPA Gender Year Major Hour study/wk 1 2.0 M 3 Biology 3.6 2 3.2 F 1 Biology 7.4 3 2.5 F 4 Biology 4.8 4 2.8 M 3 Accounting 5.0 5 3.6 F 2 Accounting 6.5 6 3.1 M 3 Law 4.2 7 2.8 M 2 Law 3.8 8 2.4 M 2 Math 2.5 9 2.8 F 1 Math 5.2 10 2.6 M 3 Math 3.5 11 3.0 F 4 Math 6.8 12 3.2 M 2 Computer 9.3 13 3.7 M 1 Computer 7.2 14 2.7 F 3 Computer 5.3 15 2.9 M 2 Computer 4.2 16 2.5 M 4 Language 2.8 17 2.8 F 4 Language 3.8 18 3.2 F 1 Language 4.8 19 3.4 M 3 Engineer 4.4 20 3.1 F 3 Engineer 7.4 19 Another example: The amount of money expended in fiscal year 1995 by the U.S. Department of Defense in various categories is shown in Table 1.6. Construct both a pie chart and a bar chart to describe the data. Compare the two forms of presentation. Table 1.2 Category Amount (in billions) Military personnel $70.8 Operation and maintenance 90.0 Procurement 55.0 Research and development 34.7 Military construction 6.8 Total $258.2 Refresh: • the category of expenditure is (qualitative or quantitative). • the amount of the expenditure is (qualitative or quantitative). How to use pie chart for this data? Each “pie slice” represents the proportion of the total expenditures ($258.2 billion) corresponding to its particular category. For example, for the research and development category, the angle of the sector is 34.7  360   48.4 258.2 In percentage: 34.7  100  13.4% 258.2 Bar Chart for representing the amount in each category of expenditure Graphical summary for continuous data: Stem-Leaf Plots and Histograms The following lists the prices (in dollars) of 19 different brands of walking shoes. Construct a tem and leaf plot to display the distribution of the data. 90 65 75 70 Solution 70 68 70 70 60 68 70 74 65 75 70 40 70 95 65 Hands-on activity for Stem&Leaf Plots 1. Use the student information data. Use Minitab to construct a stem-leaf plot for variable Hourstudy, one for Male and one for Female. 1. In Minitab, go to Graph, choose Stem and Leaf. 2. In the Dialog box, enter ‘Hourstudy in the Variables box, 3. Click on By Variable and enter ‘Gender’ into this box. You notice, that ‘Gender’ does not appear in the available list of variables. This is because ‘Gender’ is a Text-variable. For Stem-Leaf plot, the ‘By Variable’ must be numeric variable. Before you can do this, you will need to convert the Gender variable from Text to Numeric. • Converting Gender from Text to Numeric: • Go to Manip Menu, choose Code, then select ‘Text to Numeric. • In the Dialog box, enter ‘Gender’ into ‘Code Data in Columns’ box, and a new column, say, C7, to ‘the ‘into columns’ box. • Enter Original values: M, New: 1 and Original Value: F, New : 2. 4. Now, go back to step Three, and replace ‘Gender’ by ‘C7’. You are in business. 24 Relative Frequency Histograms A relative frequency histogram for a quantitative data set is a bar graph in which the height of the bar (Y-axis) represents the proportion or relative frequency of occurrence for a particular class or subinterval of the variable being measured. The class or subintervals of the variable are plotted along the x axis. Constructing a relative frequency histogram: 1. 2. 3. 4. 5. Choose the number of classes, usually between 5 and 15. Calculate the approximate class width by dividing the difference between the largest and smallest values (Range = largest – smallest) by the number of classes. Round the approximate class width up to a convenient number. If discrete, assign one or more integers to a class. Locate the class boundaries. If continuous, use Method of left inclusion: Include the left class boundary point but not the right boundary point in the class. – NOTE: Different methods may be used in different software. Some may use right inclusion. Some may add an additional decimal place for the class boundary. 6. Construct a table containing the classes, their boundaries, and their relative frequencies. 7. Construct the histogram like a bar graph. X: the class boundaries, Y the relative frequency. Each rectangle bar represent the relative frequency (or frequency) of the variable in each class. Hands-on activities for constructing Histograms using Minitab Data: the 20 student information • • Construct histogram for GPA and Hourstudy. 1. Go to Graph, enter ‘GPA’ and Hourstudy as graph variables into X box. 2. Use Annotation for Data Labels. Use Options to modify the type of histogram. Construct histogram for Hourstudy by Gender with the same X and Y scales. 1. Go to Graph, enter ‘Hourstudy’ as the graph variable in X box. 2. In the Display Box, select ‘Group’ into ‘for each’ box, and enter ‘Gender’ into the ‘Group Variables box. 3. Go to Frame, choose Multiple Graphs. 4. Choose ‘Same X and Same Y’. Exercise: Compare the histogram distributions of GPA between Male and Female. 26 Student GPA Gender Year Major Hour study/wk 1 2.0 M 3 Biology 3.6 2 3.2 F 1 Biology 7.4 3 2.5 F 4 Biology 4.8 4 2.8 M 3 Accounting 5.0 5 3.6 F 2 Accounting 6.5 6 3.1 M 3 Law 4.2 7 2.8 M 2 Law 3.8 8 2.4 M 2 Math 2.5 9 2.8 F 1 Math 5.2 10 2.6 M 3 Math 3.5 11 3.0 F 4 Math 6.8 12 3.2 M 2 Computer 9.3 13 3.7 M 1 Computer 7.2 14 2.7 F 3 Computer 5.3 15 2.9 M 2 Computer 4.2 16 2.5 M 4 Language 2.8 17 2.8 F 4 Language 3.8 18 3.2 F 1 Language 4.8 19 3.4 M 3 Engineer 4.4 20 3.1 F 3 Engineer 7.4 27 Interpreting Graphs – What to observe from a graph? What to look for as you describe the data: - the degree of uncertainty – how wide the data spread. - the center of the data set , such as mean, median, mode. - shape of the distribution – Is normal curve a reasonable distribution? - outliers – Are there rare or unusual data? What may be the causes? • Distributions are often described by their shapes: - symmetric - skewed to the right (long tail goes right) - skewed to the left (long tail goes left) - unimodal, bimodal, multimodal (one peak, two peaks, many peaks) What kind of information can a histogram provide?  Relative frequency can give us information such as: the proportion of measurements that fall in a particular class or group of classes the probability that a measurement drawn at random from a set will fall in a particular class or group of classes  Different samples from the same population will produce different histograms.  The shape of histogram describe the distribution of the variable of interest. The following are common shapes we may find in real world applications: Symmetric Skew-to-right Skew-to-left Bimodal Skew-to-right with outliers •Skew-to-right: Most values are small. Only a few are much larger. The long tail is on the right side. •Skew-to-left: Most vales are large. Only a few are much smaller. The long tail is on the left side. •Bimodal: Two peaks. •Outliers: Observations which are extremely away from the majority (We will discuss how to identify them more specifically. Q: Base on your common experience, what would you say about the distribution shape of the following variables, if we observe 200 data values: (Symmetric, Skew-to-right, Skew-to-left) Adult height: Entry level Salary: Salary for individuals who are 40 years or older: Hours on the net per week: Scores from an easy test: Scores from a difficult test: Q: Can you find a variable has the distribution shape of (a) Symmetric: (b) Skew-to-right: (c) Skew-to-left: Describing Data with Numerical Measurements When introducing numerical measurements, one can not ignore to introduce some commonly used notations. The main reason we compute the numerical measurements is to try to use the sample information to make a good sense of the the unknown nature (or population). When we compute a sample measure, there is a corresponding population measure. For example, when we compute a sample mean from a data set, we usually try to use this sample mean to estimate the true mean of the unknown nature (or population mean). In order to make it clear, we use two different terminologies. One for sample, and one for population. • Measurements summarized from sample data: we call them • Measurements from the unknown population: we call them statistics. parameters. A Table for some commonly used notations Some commonly used notations Measurements from Sample data Sample mean (average) Corresponding population measurements x  nx i Population mean, m Sample median, m, the middle value when data are in ascending order. Population median Sample Mode, M, the observations occur most frequently. Population Mode Sample Variance, s2, a measure of uncertainty Population Variance, s2 Sample standard deviation, s, a measure of uncertainty Population standard deviation, s Sample range, R, a measure of uncertainty Population range Relative frequency histogram Probability distribution, P(x), or f(x) Sample percentile, e.g., 70th percentile: 70 % of the Population percentile observations are less than the 70th percentile, and 30% are larger. Measure of Center of a data set: Sample Mean, x  nx i Sample median , m of a set of n measurements is the value of x that falls in the middle position when • the measurements are ordered from smallest to largest. The value .5(n + 1) indicates the position of the median in the ordered data set. If .5(n+1) is an integer, the position in the order data set is the median. – If .5(n+1) is not an integer, the median is the average of two nearby middle observations. Sample Mode: the data values that occur most frequently. May be more than one mode in a data set. – Measure of relative standing: Percentile • • A set of n measurements on the variable x has been arranged in order of magnitude.The pth percentile is the value of x that exceeds p% of the measurements and is less than the remaining (100 - p)%. The value p(n + 1) indicates the position of the pth percentile in the ordered data set. – If p(n+1) is an integer, the position in the order data set is the pth percentile. – If p(n+1) is not an integer, the pth percentile is the average of two nearby observations. Some commonly used percentiles: 25th percentile ( Q1, or 1st quartile), 50th percentile (Q2, median, or 2nd Quartile), 75th percentile ( Q2, or 3rd quartile). Hands-on activity: Obtain average, median Q1, Q3, 60th percentile, and mode for the following inter-laboratory testing results (in mg) from 8 labs by hand: Notation: x1 Results: 84 x2 74 x3 78 x4 84 x5 92 x6 86 x7 84 x8 80 Q-a. The value 78 was a typo. It should be 48. Compute average, median, Q1, Q3and mode again. How the change of the value from 78 to 48 impacts average, median, Q1, Q3 and mode? Q-b. Each sample was tested twice. The results from the 2nd test indicates every lab had 5 mg lower than the 1st test. Compute average, median and mode. How the reduction of 5 mg from each lab impact average, median, Q1, Q3 and mode? Mean Median Mode Q1 Q3 60th percentile 1st test-incorrect 1st test-correct 2nd test Explain your observations: 35 Relative frequency distribution showing the effect of extreme values on the mean and median Note: Median is less sensitive than average to extreme values. Why? Measures of Variability • Variability or dispersion is a very important characteristic of data. It measures the spread of data values. • Example: – – Scores of 20 students are all 80% -- There is no variability. Scores of 20 students range from 30 to 100%. – There is a large variability. Measure of Variability: The range, R, of a set of n measurements is defined as the difference between the largest and the smallest measurements. Range = Largest – Smallest. Visualizing Variability using Histogram Numerical Measures of Variability Variance and Standard Deviation What are they? How can they be used to measure data spread? The following figure showing the deviations of points from the mean Sample Variance: Standard deviation: s 2 ( xi - x )2 n -1 s  s2 Measure of Variability for Population: The variance of a population of N measurements is denoted by s 2 and is given by the formula s 2  ( xi - m )2 N • • The population standard deviation is s = s 2 This measure will be relatively large for highly spread data and relatively small for less spread data. Measure of Variability for Sample: The variance of a sample of n measurements is given by The sample standard deviation is given by: x s  s2 s 2 ( xi - x )2 n -1 • The shortcut method for calculating s 2 : s2  2  xi - (  xi )2 n -1 n where  x 2i  sum of the squares of the individual measurements and (  x )2 square of the sum of the individual measurements. i The measures of center (mean, median, mode), the measure of relative standing (pth percentile), the measure of variability (Range, S2 , S) can be easily computed using Minitab. In minitab: 1. Go to Stat, choose Basic Statistics, then choose Display Descriptive Statistics. 2. In the Dialog box, enter the variable names. 3. One can choose to have some graphs of the variables, such as Histogram and so on, by clicking on ‘Graphs’ in the Dialog, and choose the wanted graphs. Hands-on activity for Numerical measurements using Minitab High blood pressures from 8 patients before and after a medication are recorded: Patient 1 2 3 4 5 6 7 8 Before 220 245 186 190 245 264 252 248 After 155 180 172 162 165 178 210 158 Improve 65 Q-a: Find the improvement. Compute the average and median improvement. Q-b: Q-c: Compute the sample variance, s2, and sample standard deviation, s , of the improvement. Compute sample average – 2(s) and sample average + 2(s). How many percent of patients whose improvements are within this interval? Points to remember about variance and standard deviation: - The value of s is always greater than or equal to zero. - The larger the value of s 2 or s, the greater the variability of the data set. - If s 2 or s is equal to zero, all measurements must have the same value. - The standard deviation s is computed in order to have a measure of variability measured in the same units as the observations.  In real world applications, the shape of the distribution is usually related to the mean, median and standard deviation. An Example: The gas price is a concern for people. A random sample of 40 stations gives the following data summary: Sample mean = $1.85 Median = $1.82 Q: Is the distribution of the gas prices more likely to be (a) Symmetric (b) skewed-to-right (c) Skewed-to-left And WHY? S = $.15 On the Practical Significance of the Standard Deviation NOTE: s measures the uncertainty of observed data. x  s provides an interval of potential blood pressures. Further more, it also tells us approximately 68% of blood pressures will be within the interval. It is a way of reporting measurement uncertainty. Based on the same thought, x  2( s ) is an interval of blood pressure, which approximately will cover 95% of all possible blood pressures. WHY? The Empirical Rule, and the Normal Curve The above claim is correct in many real world situations. This is due to the fact that many real world variables follow a distribution, the Normal Curve, which says, most observations are in the middle, around the mean. A few are small, and a few are large. And, the approximate proportion can be determined using the Normal Curve. This is described as the Empirical Rule. The following graph shows the rule. 34% 34% 2.5% m-2s 2.5% m-s m m+s m+2s Empirical Rule: Given a distribution of measurements that is approximately mound-shaped: - The interval (m  s) contains approximately 68% of the measurements - The interval (m  2s) contains approximately 95% of the measurements. - The interval (m  3s) contains almost all of the measurements. Empirical rule is often applied to identify rare (unusual, extreme)observations. If an observation falls outside two s.d. range, it only has 5% of chance to occur. Therefore, it is considered rare. We will discuss the Normal Curve and learn how to apply it to real world situations. In most of our discussions on analyzing inter-laboratory testing data, detecting outliers, and quality control, the chance of occurrence of the response variable will be assumed following a normal curve. Hands-on Activity for using Empirical Rule to identify rare cases 2.5 A Check on the Calculation of s • • Range  4s or s  Range / 4 Use the range approximation to detect gross errors in calculating, such as the failure to divide the sum of squares of deviations by (n -1) or the failure to take the square root of s 2. Measures of Relative Standing Definition: The sample z score is a measure of relative standing defined by z - score  • • • • x - x s A z-score measures the distance between an observation and the mean, measured in units of standard deviation. An outlier is an unusually large or small observation. z-scores between -2 and +2 are highly likely. z-scores exceeding 3 in absolute value are very unlikely. Definition: A set of n measurements on the variable x has been arranged in order of magnitude.The pth percentile is the value of x that exceeds p% of the measurements and is less than the remaining (100 - p)%. • The value p(n + 1) indicates the position of the pth percentile in the ordered data set. – – • If .5(n+1) is an integer, the position in the order data set is the pth percentile. If .5(n+1) is not an integer, the pth percentile is the average of two nearby observations. Example 2.13 is an example of the use of a percentile. Figure 2.12 shows a percentile on a relative frequency histogram. Figure 2.13 illustrates the location of quartiles. Example 2.13 Suppose you have been notified that your score of 610 on the Verbal Graduate Record Examination placed you at the 60th percentile in the distribution of scores. Where does your score of 610 stand in relation to the scores of others who took the examination? Solution Scoring at the 60th percentile means that 60% of all examination scores were lower than yours and 40% were higher. • • The median is the same as the 50th percentile. The 25th and 75th percentiles are called the lower and upper quartiles. Figure 2.12 Figure 2.13 Definition: A set of n measurements on the variable x has been arranged in order of magnitude. • The lower quartile (first quartile), Q1, is the value of x that exceeds one-fourth of the measurements and is less than the remaining 3/4. • The second quartile is the median. • The upper quartile (third quartile), Q 3, is the value of x that exceeds three-fourths of the measurements and is less than one-fourth. • When the measurements are arranged in order of magnitude, the lower quartile, Q1, is the value of x in the position .25(n +1). • The upper quartile, Q 3, is the value of x in the position .75(n + 1). • When these positions are not integers, the quartiles are found by interpolation, using the values in the two adjacent positions. • See Example 2.14 to illustrate the determination of the lower and upper quartiles. Figure 2.14 gives the Minitab output for the example. Definition: The interquartile range (IQR) for a set of measurements is the difference between the upper and lower quartiles; that is, IQR  Q 3 - Q 1. • The trimmed mean is the mean of the middle 90% of the measurements after excluding the smallest 5% and the largest 5%. Data: Beta-Carotene In serum.MTW Row Laboratory Sample Material A Material B Material C Material D 1 1 1 0.066 0.146 0.472 0.986 2 1 2 0.062 0.143 0.436 0.904 3 2 1 0.070 0.140 0.390 0.840 4 2 2 0.070 0.140 0.390 0.820 5 3 1 0.089 0.213 0.390 0.840 6 3 2 0.082 0.196 0.523 1.241 7 4 1 0.044 0.120 0.452 1.292 8 4 2 0.050 0.120 0.472 1.131 9 5 1 0.064 0.142 0.411 0.883 10 5 2 0.058 0.148 0.416 0.874 11 6 1 0.076 0.149 0.399 0.886 12 6 2 0.073 0.145 0.396 0.859 13 7 1 0.080 0.230 0.390 0.830 14 7 2 0.080 0.250 0.380 0.780 15 8 1 0.062 0.140 0.370 0.890 16 8 2 0.057 0.150 0.390 0.910 17 9 1 0.060 0.170 0.450 1.040 18 9 2 0.070 0.170 0.460 1.070 54 19 10 1 0.071 0.155 0.458 1.093 20 10 2 0.074 0.159 0.444 1.061 21 11 1 0.050 0.140 0.420 0.970 22 11 2 0.060 0.140 0.420 0.980 23 12 1 0.060 0.080 0.180 0.320 24 12 2 0.050 0.060 0.190 0.870 25 13 1 0.051 0.145 0.371 0.832 26 13 2 0.062 0.145 0.328 0.870 27 14 1 0.100 0.240 0.520 1.380 28 14 2 0.090 0.230 0.590 1.180 29 15 1 0.063 0.146 0.426 0.899 30 15 2 0.060 0.149 0.458 1.002 31 16 1 0.095 0.173 0.437 0.969 32 16 2 0.097 0.177 0.447 0.978 33 17 1 0.070 0.138 0.389 0.914 34 17 2 0.069 0.149 0.393 0.919 35 18 1 0.040 0.090 0.230 0.540 36 18 2 0.040 0.090 0.230 0.530 55 Project One: Analysis of Data using graphical and numerical summaries Study: An inter-laboratory testing was conducted to investigate the Beta-Carotene content of four materials, A,B,C,D. Two samples were tested for each material. Purpose: To study if there is a difference between labs and within lab for each material tested. Things to investigate: 1. Comparing distributions of four materials. 2. Comparing means and medians of four materials. 3. Comparing variability of four materials. 4. Comparing distributions of each material between two samples. 5. Comparing variability of each material between two samples. 6. Are there any unusual observations from any lab for each material? We will have team presentations for each project. 56