Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mathematical Statistics Instructor: Dr. Deshi Ye Course homepage: http://www.cs.zju.edu.cn/people/yedeshi/ Course information • What is for? – This course provides an elementary introduction to mathematical statistics with applications. – Topics include: statistical estimation, hypothesis testing; confidence intervals; calculation of a P-value; nonparametric testing; curve fitting; analysis of variance and factorial experimental design. Grading • Grades for the course will be based on the following weighting 1) Class attendance: 10% 2) Homework assignment: 26% 3) Unit quiz: 24% (12%, 12%) 4) Final exam: 40% Introduction • Probability theory is devoted to the study of uncertainty and variability • Statistics can be described as the study of how to make inference and decisions in the face of uncertainty and variability Brief History • Blaise Pascal and Pierre de Fermat: the origins of probability are found. – concerning a popular dice game – fundamental principles of probability theory • Pierre de Laplace: – Before him, concern on the analysis of games of chance – Laplace applied probabilistic ideas to many scientific and practical problems A case study • Visually inspecting data to improve product quality Population and Sample • Investigating: a physical phenomenon, production process, or manufactured unit, share some common characteristics. • Relevant data must be collected. • Unit: the source of each measurement. – A single entity, usually an object or person • Population: entire collection of units. Examples Population Unit variables All students currently enrolled in school student GPA Number of credits All books in library book Replacement cost Sample • Statistical population: the set of all measurement corresponding to each unit in the entire population of units about which information is sought. • Sample: A sample from a statistical population is the subset of measurements that are actually collected in the course of investigation. Ch2: Treatment of data • Outline – Pareto diagrams, dot diagrams – Histograms (Frequency distributions) – Stem-and-leaf display – Box-plot (Quartiles and Percentiles) – The calculation of x and standard deviation s Pareto Diagram • For a computer-controlled lathe whose performance was below par, workers recorded the following causes and their frequencies: power fluctuations 6 controller not stable 22 operator error 13 worn tool not replaced 2 other 5 Minitab14 • 1. Stat->Quality tools->Pareto chart • 2. Choose chart defects table as follows Output Pareto diagram • Pareto diagram: depicts Pareto’s empirical law that any assortment of events consists of a few major and many minor elements. • Typically, two or three elements will account for more than half of the total frequency. Dot diagram • Observation on the deviations of cutting speed from the target value set by the controller. • EX. Cutting speed – target speed • 3 6 –2 4 7 4 • In minitab: stat->dotplots->simple Dot diagram • This diagram visually summarize the information that the lathe is generally running fast. Data001. 80 data of emission (in ton)of sulfur oxides from an industry plant • 15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2 22.7 9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8 • 22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1 15.2 22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5 23.0 • 24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6 19.4 17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9 12.3 • 22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9 27.5 18.1 17.9 9.4 24.1 20.1 28.5 Frequency distributions • A frequency distribution is a tabular arrangement of data whereby the data is grouped into different intervals, and then the number of observations that belong to each interval is determined. • Data that is presented in this manner are known as grouped data. Class limits & frequnecy Class limits 5.0 -- 8.9 9.0 – 12.9 13.0 – 16.9 17.0 – 20.9 21.0 – 24.9 25.0 – 28.9 29.0 – 32.9 Total Frequency 3 10 14 25 17 9 2 80 Class limit and width • lower class limit: The smallest value that can belong to a given interval • upper class limit: The largest value that can belong to the interval. • Class width: The difference between the upper class limit and the lower class limit is defined to be the. • When designing the intervals to be used in a frequency distribution, it is preferable that the class widths of all intervals be the same. Class limits & frequnecy Class limits [5.0, 9.0) [9.0, 13.0) [13.0, 17.0) [17.0, 21.0) [21.0, 25.0) [25.0, 29.0) [29.0, 33.0) Total Frequency 3 10 14 25 17 9 2 80 Variants of frequency distribution • The cumulative frequency distribution is obtained by computing the cumulative frequency, defined as the total frequency of all values less than the upper class limit of a particular interval, for all intervals. • Relative frequency: the ratio of the number of observations in the interval to the total number of observations • The percentage frequency distribution is arrived at by multiplying the relative frequencies of each interval by 100%. cumulative frequnecy Class limits Less than 5 Less than 9 Less than 13 Less than 17 Less than 21 Less than 25 Less than 29 Less than 33 Frequency 0 3 13 27 52 69 78 80 Percentage distribution Class limits Perc. Dist. Frequency [5.0, 9.0) [9.0, 13.0) [13.0, 17.0) [17.0, 21.0) [21.0, 25.0) [25.0, 29.0) [29.0, 33.0) Total 3.75% 12.5% 17.5% 31.25% 21.25% 11.25% 2.5% 100% 3 10 14 25 17 9 2 80 Histogram • The most common form of graphical presentation of a frequency distribution is the histogram. • Histogram: is constructed of adjacent rectangles; the height of the rectangles is the class frequencies and the bases of the rectangles extend between successive class boundaries. Histogram in Minitab 1. Graph->histogram->simple 2. Graph variables: c4 3. Edit bars: Click the bars in the output figures, in Binning, Interval type select midpoint and interval definition select midpoint/cutpoint, and then input 7 11 15 19 23 27 31 as illustrated in the following Density histogram • When a histogram is constructed from a frequency table having classes of unequal lengths, the height of each rectangle must be changed to • Height = relative frequency / width. • The area of the rectangle then represents the relative frequency for the class and the total area of the histogram is 1. Density histogram Cumulative histogram • 1) Graph>histogram->simple • 2) Dataview-> Datadisplay: check “symbos” only Smoother: check “lowess” and “0” in degree of smoothing and “1” in number of steps. Stem-and-leaf Display • Class limits and frequency, contain data in each class, but the original data points have been lost. • Stem-and-leaf: function the same as histogram but save the original data points. • Example: 10 numbers: • 12, 13, 21, 27, 33, 34, 35, 37, 40, 40 • Frequency table Class limits Frequency 10 – 19 2 20 – 29 2 30 – 39 4 40 – 49 3 Stem-and-leaf Stem-and-leaf: each row has a stem and each digit on a stem to the right of the vertical line is a life. The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the right-hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. Key: “4|0” means 40 Stem-and-leaf in Minitab • The display has three columns: – The leaves (right) - Each value in the leaf column represents a digit from one observation. – The stem (middle) - The stem value represents the digit immediately to the left of the leaf digit. – Counts (left) - If the median value for the sample is included in a row, the count for that row is enclosed in parentheses. The values for rows above and below the median are cumulative. Stem-and-leaf for DATA001 • • Stem-and-leaf of frequencies N = 80 Leaf Unit = 1.0 • • • • • • • • • • • • • 2 0 67 6 0 8999 11 1 00111 17 1 223333 24 1 4445555 32 1 66677777 (13) 1 8888888999999 35 2 0000000111 25 2 222223333 16 2 4444455 9 2 66667 4 2 889 1 3 1 Ch2.5: Descriptive measures • Mean: the sum of the observation divided by the sample size. n x x i 1 i n • Median: the center, or location, of a set of data. If the observations are arranged in an ascending or descending order: – If the number of observations is odd, the median is the middle value. – If the number of observations is even, the median is the average of the two middle values. Example • 15 14 2 27 13 • Mean: 15 14 2 27 13 x 5 14.2 • Ordering the data from smallest to largest • 2 13 14 15 27 • The median is the third largest value 14 Sample variance • Deviations from the mean: n s2 2 ( x x ) i i 1 n 1 • Standard deviation s: n s 2 ( x x ) i i 1 n 1 n s2 n n x ( xi )2 i 1 2 i i 1 n(n 1) Quartiles and Percentiles • Quartiles: are values in a given set of observations that divide the data in 4 equal parts. • The first quartile,Q1 , is a value that has one fourth, or 25%, of the observation below its value. • The sample 100 p-th percentile is a value such that at least 100p% of the observation are at or below this value, and at least 100(1-p)% are at or above this value. Example • Example in P34: 14.7 15.2 Q1 14.95 2 19.0 19.1 Q2 19.05 2 22.9 23 Q3 22.95 2 Boxplots • A boxplot is a way of summarizing information contained in the quartiles (or on a interval) • Box length= interquartile range= Q3 Q1 Modified boxplot • Outlier: too far from third quartile. • 1.5(interquartile range) of third quartile. • Modified boxplot: identify outliers and reduce the effect on the shape of the boxplot.