Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STT 351: Sec. 003 Instructor: James F. Kelly Office: C434 Wells Hall Course Structure • Text: Probability and Statistics for Engineering and the Sciences, Custom 8th Edition • Rough Schedule: 1: Descriptive Statistics (Week 1) EASY! 2: Probability, Part I (Weeks 2-4) Exam 1 (Chapter 1, 2, and beginning of 3) 3. Probability, Part II (Weeks 5-10) Exam 2 (Chapter 3, 4, and Part of Chapter 5) 4. Inferential Statistics (Weeks 11-15) TOUGH! Exam 3 (Some of Chapter 6, Chapters 7, 8) 5. Regression (if we have time) Homework and Assignments • I will assign a list of odd-numbered HW problems. Not Graded. Do as many as possible for practice. • There is a Student Solutions Manual that details all the steps for the answers to the odd numbered problems. • Ask questions in class or office hours. • Stats Help Room (A102 Wells Hall) • We will have FIVE (5) graded assignments. • You may work in groups on graded assignments, but you must write up your own solutions. Please SHOW ALL WORK. • Most Mid-Term and Final Exam Problems will be similar to HW and Assignment problems. Software/Computing • Some of the HW problems and assignments require software. You can use whatever tool you like. • MATLAB: I will give a short MATLAB demo next week. Many probability distributions and stats algorithms are available. MATLAB is available at MSU computer labs and in the Engineering Labs. • MINITAB: Graphical Stats software. Book includes MINITAB examples. • R: Scripting language with lots of stats routines built in. • Other: Python, C++, FORTRAN. Chapter 1: Descriptive Statistics Deterministic Models • In previous math classes, you have modeled science problems with calculus and differential equations. • Example: Wind Resistance on a Vehicle 𝑑𝑣 • 𝑚 =𝐹 𝑑𝑡 𝑑𝑣 𝐶 2 • + 𝑣 𝑑𝑡 𝑚 = −𝐶𝑣 2 =0 𝑣 0 = 𝑣0 where the constant C depends on shape of the vehicle. Streamlines simulated by PowerFLOW (Exa Corp.) Deterministic model which may be solved to give the velocity of the car at any future time. NO RANDOMNESS. Statistical (Stochastic) Models • What if material parameters are not known exactly (uncertainty)? • We can still solve the model, but the predicted velocity will contain error! • By the way , this model (drag equation) is only valid for high velocity (technically, high Reynold’s number). Hence, there is model uncertainty as well as data uncertainty. • Statistics teaches us how to make intelligent judgments in the presence of uncertainty and variation. Terminology • Population: Collection of objects under study. Typically very large. • Sample: Subset of the population we perform an experiment on (observe). • Variable: Any characteristic whose value may change from one object to another in a population. • Univariate: Data that consists of observations of a single variable. • Multivariate: Data that consist of observations of more than one variable. Given univariate or multivariate data from a sample, we wish to either describe a population (descriptive statistics) or draw some conclusion about the population (inferential statistics). Unless the sample is identical with the population, there is uncertainty in any conclusion we draw. We need to quantify this uncertainty. Example 1.2: Flexural Strength of Concrete • Population: All batches of concrete. • Sample: N=27 measurements. • Variable: Flexural Strength (Mpa). Univariate. • Sample Mean: 8.14 Mpa • What can we say about the population mean? • In Chapter 7, we’ll discuss confidence intervals. With 95% confidence, the population mean is between 7.48 MPa and 8.80 MPa. 5.9 7.2 7.9 11.3 6.3 7.3 8.1 11.6 6.3 7.4 8.2 11.8 6.5 7.6 8.7 6.8 7.7 9 6.8 7.7 9.7 7 7.8 9.7 7 7.8 10.7 Visualizing Data: “Stem and Leaf” and “Histograms” Stem and Leaf Plot Stem: Ones digit Leaf: One-Tenths digit 5|9 6|33588 7|00234677889 8|127 9|077 10 | 7 11 | 3 6 8 Flexural Strength (Mpa) Sec. 1.2: Pictorial and Tabular Methods 1. Stem-and Leaf Displays (Table) 2. Dotplots (Graph) 3. Histograms (Graph) Stem and Leaf Consider a sample of size n where each variable consists of at least two digits. A quick summary is a stem and leaf plot. 1. Select one or more leading digits for stem values. Trailing digits are leaves. 2. List possible stem values in a vertical column. 3. Record the leaf for each observation beside the stem. Indicate units. Example: Temperature Data Average temperature over 51 days 2|899 3|0012223333556667777888899 4|011122334455566667 5|0014 6|9 Stem: Tens digit Leaf: Ones digit About 50% of the days had average temperature in the 30’s. One outlier: 69 degrees. Dotplot • Each observation is represented by a dot above corresponding location. • Dots are stacked vertically for repeated data. • Gives info about location, spread, extremes, and gaps. MATLAB Code I will post MATLAB scripts on the website A = importdata('exp01-08.txt'); temp = A.data; stemleafplot(temp,0) dotplot(temp) Discrete vs. Continuous Data • Discrete: Set of possible values are finite or can be listed as an infinite sequence. Data that is counted. Example: Number of hits by a baseball team in a game. • Continuous: Set of possible values consist of an entire interval on number line. Example: pH of chemical substance (real number between 0.0 and 14.0). Histograms: Discrete Data • Frequency: Number of times any particular value occurs in a data set. • Relative Frequency: Fraction (or proportion) of times the value occurs. Relative frequency = Frequency / number of observations Constructing a Histogram for Discrete Data 1. Determine frequency and relative frequency for each x value. 2. Mark possible x values on horizontal scale. 3. Above each value, create a rectangle whose height is relative frequency (or frequency) of that value. Example 1.9: Hits in 9 Inning Baseball Games Distribution is unimodal (single peak) and positively skewed (right tail is stretched compared with left tail) Histograms: Continuous Data • Class Interval: Subdivide horizontal axis into intervals. Example: miles per gallon (mpg) for autos. Mpg is measured, hence a continuous variable. Construct class boundaries: 27.5-<28.0,28.0-<28.5,28.5-<29.0,…,31.0-<31.5 Observation on the boundary is placed to the right of the boundary. Constructing a Histogram for Discrete Data 1. Determine frequency/relative frequency. 2. Mark class boundaries on the horizontal axis. 3. Above each class interval, draw a rectangle with height corresponding to frequency/relative frequency. Example 1.10: Energy consumption in gasheated homes. • N = 90 • Histogram is approximately symmetric. • Mean=Mode is about 10 BTU Rule of Thumb: Number of classes around (number of observations)^(1/2) MATLAB Code (Uses Statistics Toolbox) A = importdata('exp01-10.txt'); consumption = A.data; cint = 1:2:19; histogram(consumption,cint,'Normalization','probability') set(gca,'xtick',cint) xlabel('BTUIN') ylabel('relative frequency') Chapter 1 HW (Not Graded) • Sec 1.2: #11, 17 • Sec 1.3: #33, 35, 37, 39 • Sec 1.4: #45, 47, 49, 51 • Answers are in back of book. • MATLAB Demo • Download Data for Book Examples Here: http://www.stt.msu.edu/users/mcubed/ASCII-COMMA.zip Histogram Shapes • Histograms come in a variety of shapes. A unimodal histogram is one that rises to a single peak and then declines. A bimodal histogram has two different peaks. Multimodal has two or more peaks. • Bimodality can occur when the data set consists of observations on two quite different kinds of individuals or objects. • For example, consider a large data set consisting of driving times for automobiles traveling between San Luis Obispo, California, and Monterey, California (exclusive of stopping time for sightseeing, eating, etc.). Example 1.12 • Figure 1.11(a) shows a Minitab histogram of the weights (lb) of the 124 players listed on the rosters of the San Francisco 49ers and the New England Patriots (teams the author would like to see meet in the Super Bowl) as of Nov. 20, 2009. NFL player weights Histogram Figure 1.11(a) Example 12 cont’d • Figure 1.11(b) is a smoothed histogram (actually what is called a density estimate) of the data from the R software package. NFL player weights Smoothed histogram Figure 1.11(b) Example 1.12 cont’d • Both the histogram and the smoothed histogram show three distinct peaks; the one on the right is for linemen, the middle peak corresponds to linebacker weights, and the peak on the left is for all other players (wide receivers, quarterbacks, etc.). • A histogram is symmetric if the left half is a mirror image of the right half. A unimodal histogram is positively skewed if the right or upper tail is stretched out compared with the left or lower tail and negatively skewed if the stretching is to the left. Example 1.12 cont’d • A histogram is symmetric if the left half is a mirror image of the right half. A unimodal histogram is positively skewed if the right or upper tail is stretched out compared with the left or lower tail and negatively skewed if the stretching is to the left. Example 1.12 cont’d • Figure 1.12 shows “smoothed” histograms, obtained by superimposing a smooth curve on the rectangles, that illustrate the various possibilities. (b) bimodal (a) symmetric unimodal (c) Positively skewed (d) negatively skewed Smoothed histograms Figure 1.12 1.3 Measures of Location Copyright © Cengage Learning. All rights reserved. The Mean • For a given set of numbers x1, x2,. . ., xn, the most familiar and useful measure of the center is the mean, or arithmetic average of the set. Because we will almost always think of the xi’s as constituting a sample, we will often refer to the arithmetic average as the sample mean and denote it by x. The Mean • A physical interpretation of x demonstrates how it measures the location (center) of a sample. Think of drawing and scaling a horizontal measurement axis, and then represent each sample observation by a 1-lb weight placed at the corresponding point on the axis. • The only point at which a fulcrum can be placed to balance the system of weights is the point corresponding to the value of x (see Figure 1.14). The Mean • Just as x represents the average value of the observations in a sample, the average of all values in the population can be calculated. This average is called the population mean and is denoted by the Greek letter . When there are N values in the population (a finite population), then = (sum of the N population values)/N. • We will give a more general definition for that applies to both finite and (conceptually) infinite populations. Just as x is an interesting and important measure of sample location, is an interesting and important (often the most important) characteristic of a population. The Mean • In the chapters on statistical inference, we will present methods based on the sample mean for drawing conclusions about a population mean. • For example, we might use the sample mean x = 16.36 computed in Example 1.14 as a point estimate (a single number that is our “best” guess) of = crack length for all specimens treated as described. The Mean • The mean suffers from one deficiency that makes it an inappropriate measure of center under some circumstances: Its value can be greatly affected by the presence of even a single outlier (unusually large or small observation). • For example, if a sample of employees contains nine who earn $50,000 per year and one whose yearly salary is $150,000, the sample mean salary is $60,000; this value certainly does not seem representative of the data. The Mean • In such situations, it is desirable to employ a measure that is less sensitive to outlying values than x, and we will momentarily propose one. • However, although does x have this potential defect, it is still the most widely used measure, largely because there are many populations for which an extreme outlier in the sample would be highly unlikely. The Median The Median • The word median is synonymous with “middle,” and the sample median is indeed the middle value once the observations are ordered from smallest to largest. • When the observations are denoted by x1,…, xn, we will use the symbol to represent the sample median. The Median Example 1.15 • People not familiar with classical music might tend to believe that a composer’s instructions for playing a particular piece are so specific that the duration would not depend at all on the performer(s). • However, there is typically plenty of room for interpretation, and orchestral conductors and musicians take full advantage of this. Example 1.15 cont’d • The author went to the Web site ArkivMusic.com and selected a sample of 12 recordings of Beethoven’s Symphony #9 (the “Choral,” a stunningly beautiful work), yielding the following durations (min) listed in increasing order: • 62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0 • Here is a dotplot of the data: Dotplot of the data from Example 14 Figure 1.16 Example 1.15 cont’d • Since n = 12 is even, the sample median is the average of the n/2 = 6th and (n/2 + 1) = 7th values from the ordered list: • Note that if the largest observation 79.0 had not been included in the sample, the resulting sample median for the n = 11 remaining observations would have been the single middle value 66.4 (the [n + 1]/2 = 6th ordered value, i.e. the 6th value in from either end of the ordered list). Example 1.15 cont’d • The sample mean is x = xi = 816.1/12 = 68.01, a bit more than a full minute larger than the median. • The mean is pulled out a bit relative to the median because the sample “stretches out” somewhat more on the upper end than on the lower end. The Median • The data in Example 1.15 illustrates an important property of in contrast to x: The sample median is very insensitive to outliers. If, for example, we increased the two largest xis from 75.7 and 79.0 to 85.7 and 89.0, respectively, would be unaffected. • Thus, in the treatment of outlying data values, x and are at opposite ends of a spectrum. Both quantities describe where the data is centered, but they will not in general be equal because they focus on different aspects of the sample. The Median The population mean and median will not generally be identical. If the population distribution is positively or negatively skewed, as pictured in Figure 1.16, then (a) Negative skew (b) Symmetric Three different shapes for a population distribution Figure 1.16 (c) Positive skew Other Measures of Location: Quartiles, Percentiles, and Trimmed Means Other Measures of Location: Quartiles, Percentiles, and Trimmed Means • The median (population or sample) divides the data set into two parts of equal size. To obtain finer measures of location, we could divide the data into more than two such parts. • Roughly speaking, quartiles divide the data set into four equal parts, with the observations above the third quartile constituting the upper quarter of the data set, the second quartile being identical to the median, and the first quartile separating the lower quarter from the upper three-quarters. Other Measures of Location: Quartiles, Percentiles, and Trimmed Means • Similarly, a data set (sample or population) can be even more finely divided using percentiles; the 99th percentile separates the highest 1% from the bottom 99%, and so on. • Unless the number of observations is a multiple of 100, care must be exercised in obtaining percentiles. Other Measures of Location: Quartiles, Percentiles, and Trimmed Means • To paraphrase, the mean involves trimming 0% from each end of the sample, whereas for the median the maximum possible amount is trimmed from each end. • A trimmed mean is a compromise between and . A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains. Example 1.16 • The production of Bidri is a traditional craft of India. Bidri wares (bowls, vessels, and so on) are cast from an alloy containing primarily zinc along with some copper. • Consider the following observations on copper content (%) for a sample of Bidri artifacts in London’s Victoria and Albert Museum (“Enigmas of Bidri,” Surface Engr., 2005: 333–339), listed in increasing order: • 2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 • 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1 Example 1.16 cont’d • Figure 1.17 is a dotplot of the data. A prominent feature is the single outlier at the upper end; the distribution is somewhat sparser in the region of larger values than is the case for smaller values. Dotplot of copper contents from Example 1.16 Figure 1.17 Example 1.16 cont’d • The sample mean and median are 3.65 and 3.35, respectively. A trimmed mean with a trimming percentage of 100(2/26) = 7.7% results from eliminating the two smallest and two largest observations; this gives • Trimming here eliminates the larger outlier and so pulls the trimmed mean toward the median. Other Measures of Location: Quartiles, Percentiles, and Trimmed Means • A trimmed mean with a moderate trimming percentage—someplace between 5% and 25%—will yield a measure of center that is neither as sensitive to outliers as is the mean nor as insensitive as the median. • If the desired trimming percentage is 100 % and n is not an integer, the trimmed mean must be calculated by interpolation. For example, consider = .10 for a 10% trimming percentage and n = 26 as in Example 1.16. Other Measures of Location: Quartiles, Percentiles, and Trimmed Means • Then xtr(10) would be the appropriate weighted average of the 7.7% trimmed mean calculated there and the 11.5% trimmed mean resulting from trimming three observations from each end.