Download Lectures 1 and 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STT 351: Sec. 003
Instructor: James F. Kelly
Office: C434 Wells Hall
Course Structure
• Text: Probability and Statistics for Engineering and the Sciences,
Custom 8th Edition
• Rough Schedule:
1: Descriptive Statistics (Week 1)
EASY!
2: Probability, Part I (Weeks 2-4)
Exam 1 (Chapter 1, 2, and beginning of 3)
3. Probability, Part II (Weeks 5-10)
Exam 2 (Chapter 3, 4, and Part of Chapter 5)
4. Inferential Statistics (Weeks 11-15)
TOUGH!
Exam 3 (Some of Chapter 6, Chapters 7, 8)
5. Regression (if we have time)
Homework and Assignments
• I will assign a list of odd-numbered HW problems. Not Graded. Do as
many as possible for practice.
• There is a Student Solutions Manual that details all the steps for the
answers to the odd numbered problems.
• Ask questions in class or office hours.
• Stats Help Room (A102 Wells Hall)
• We will have FIVE (5) graded assignments.
• You may work in groups on graded assignments, but you must write
up your own solutions. Please SHOW ALL WORK.
• Most Mid-Term and Final Exam Problems will be similar to HW and
Assignment problems.
Software/Computing
• Some of the HW problems and assignments require software. You
can use whatever tool you like.
• MATLAB: I will give a short MATLAB demo next week. Many
probability distributions and stats algorithms are available. MATLAB
is available at MSU computer labs and in the Engineering Labs.
• MINITAB: Graphical Stats software. Book includes MINITAB examples.
• R: Scripting language with lots of stats routines built in.
• Other: Python, C++, FORTRAN.
Chapter 1: Descriptive
Statistics
Deterministic Models
• In previous math classes, you have modeled science
problems with calculus and differential equations.
• Example: Wind Resistance on a Vehicle
𝑑𝑣
• 𝑚 =𝐹
𝑑𝑡
𝑑𝑣
𝐶 2
• + 𝑣
𝑑𝑡
𝑚
= −𝐶𝑣 2
=0
𝑣 0 = 𝑣0
where the constant C depends on shape
of the vehicle.
Streamlines simulated by PowerFLOW (Exa Corp.)
Deterministic model which may be solved to give the velocity of the
car at any future time. NO RANDOMNESS.
Statistical (Stochastic) Models
• What if material parameters are not known exactly (uncertainty)?
• We can still solve the model, but the predicted velocity will contain error!
• By the way , this model (drag equation) is only valid for high velocity
(technically, high Reynold’s number). Hence, there is model uncertainty as
well as data uncertainty.
• Statistics teaches us how to make intelligent judgments in the presence of
uncertainty and variation.
Terminology
• Population: Collection of objects under study. Typically very large.
• Sample: Subset of the population we perform an experiment on (observe).
• Variable: Any characteristic whose value may change from one object to another
in a population.
• Univariate: Data that consists of observations of a single variable.
• Multivariate: Data that consist of observations of more than one variable.
Given univariate or multivariate data from a sample, we wish to either describe a
population (descriptive statistics) or draw some conclusion about the population
(inferential statistics).
Unless the sample is identical with the population, there is uncertainty in any
conclusion we draw. We need to quantify this uncertainty.
Example 1.2: Flexural Strength of Concrete
• Population: All batches of
concrete.
• Sample: N=27 measurements.
• Variable: Flexural Strength
(Mpa). Univariate.
• Sample Mean: 8.14 Mpa
• What can we say about the
population mean?
• In Chapter 7, we’ll discuss
confidence intervals. With 95%
confidence, the population
mean is between 7.48 MPa and
8.80 MPa.
5.9
7.2
7.9
11.3
6.3
7.3
8.1
11.6
6.3
7.4
8.2
11.8
6.5
7.6
8.7
6.8
7.7
9
6.8
7.7
9.7
7
7.8
9.7
7
7.8
10.7
Visualizing Data: “Stem and Leaf” and
“Histograms”
Stem and Leaf Plot
Stem: Ones digit
Leaf: One-Tenths digit
5|9
6|33588
7|00234677889
8|127
9|077
10 | 7
11 | 3 6 8
Flexural Strength (Mpa)
Sec. 1.2: Pictorial and Tabular Methods
1. Stem-and Leaf Displays (Table)
2. Dotplots (Graph)
3. Histograms (Graph)
Stem and Leaf
Consider a sample of size n where each variable consists of at least two
digits. A quick summary is a stem and leaf plot.
1. Select one or more leading digits for stem values. Trailing digits are
leaves.
2. List possible stem values in a vertical column.
3. Record the leaf for each observation beside the stem. Indicate units.
Example: Temperature Data
Average temperature over 51 days
2|899
3|0012223333556667777888899
4|011122334455566667
5|0014
6|9
Stem: Tens digit
Leaf: Ones digit
About 50% of the days had average temperature in the 30’s.
One outlier: 69 degrees.
Dotplot
• Each observation is represented by a dot above corresponding
location.
• Dots are stacked vertically for repeated data.
• Gives info about location, spread, extremes, and gaps.
MATLAB Code
I will post MATLAB scripts on the website
A = importdata('exp01-08.txt');
temp = A.data;
stemleafplot(temp,0)
dotplot(temp)
Discrete vs. Continuous Data
• Discrete: Set of possible values are finite or can be listed as an infinite
sequence. Data that is counted.
Example: Number of hits by a baseball team in a game.
• Continuous: Set of possible values consist of an entire interval on
number line.
Example: pH of chemical substance (real number between 0.0 and
14.0).
Histograms: Discrete Data
• Frequency: Number of times any particular value occurs in a data set.
• Relative Frequency: Fraction (or proportion) of times the value
occurs.
Relative frequency = Frequency / number of observations
Constructing a Histogram for Discrete Data
1. Determine frequency and relative frequency for each x value.
2. Mark possible x values on horizontal scale.
3. Above each value, create a rectangle whose height is relative
frequency (or frequency) of that value.
Example 1.9: Hits in 9 Inning Baseball Games
Distribution is unimodal (single peak) and positively skewed (right tail is
stretched compared with left tail)
Histograms: Continuous Data
• Class Interval: Subdivide horizontal axis into intervals.
Example: miles per gallon (mpg) for autos. Mpg is measured, hence a
continuous variable. Construct class boundaries:
27.5-<28.0,28.0-<28.5,28.5-<29.0,…,31.0-<31.5
Observation on the boundary is placed to the right of the boundary.
Constructing a Histogram for Discrete Data
1. Determine frequency/relative frequency.
2. Mark class boundaries on the horizontal axis.
3. Above each class interval, draw a rectangle with height
corresponding to frequency/relative frequency.
Example 1.10: Energy consumption in gasheated homes.
• N = 90
• Histogram is
approximately
symmetric.
• Mean=Mode is about
10 BTU
Rule of Thumb: Number
of classes around
(number of
observations)^(1/2)
MATLAB Code (Uses Statistics Toolbox)
A = importdata('exp01-10.txt');
consumption = A.data;
cint = 1:2:19;
histogram(consumption,cint,'Normalization','probability')
set(gca,'xtick',cint)
xlabel('BTUIN')
ylabel('relative frequency')
Chapter 1 HW (Not Graded)
• Sec 1.2: #11, 17
• Sec 1.3: #33, 35, 37, 39
• Sec 1.4: #45, 47, 49, 51
• Answers are in back of book.
• MATLAB Demo
• Download Data for Book Examples Here:
http://www.stt.msu.edu/users/mcubed/ASCII-COMMA.zip
Histogram Shapes
• Histograms come in a variety of shapes. A unimodal histogram is one that
rises to a single peak and then declines. A bimodal histogram has two
different peaks. Multimodal has two or more peaks.
• Bimodality can occur when the data set consists of observations on two
quite different kinds of individuals or objects.
• For example, consider a large data set consisting of driving times for
automobiles traveling between San Luis Obispo, California, and Monterey,
California (exclusive of stopping time for sightseeing, eating, etc.).
Example 1.12
• Figure 1.11(a) shows a Minitab histogram of the weights (lb) of the
124 players listed on the rosters of the San Francisco 49ers and the
New England Patriots (teams the author would like to see meet in the
Super Bowl) as of Nov. 20, 2009.
NFL player weights Histogram
Figure 1.11(a)
Example 12
cont’d
• Figure 1.11(b) is a smoothed histogram (actually what is called a
density estimate) of the data from the R software package.
NFL player weights Smoothed histogram
Figure 1.11(b)
Example 1.12
cont’d
• Both the histogram and the smoothed histogram show three distinct
peaks; the one on the right is for linemen, the middle peak
corresponds to linebacker weights, and the peak on the left is for all
other players (wide receivers,
quarterbacks, etc.).
• A histogram is symmetric if the left half is a mirror image of the right
half. A unimodal histogram is positively skewed if the right or upper
tail is stretched out compared with the left or lower tail and
negatively skewed if the stretching is to the left.
Example 1.12
cont’d
• A histogram is symmetric if the left half is a mirror image of the right
half. A unimodal histogram is positively skewed if the right or upper
tail is stretched out compared with the left or lower tail and
negatively skewed if the stretching is to the left.
Example 1.12
cont’d
• Figure 1.12 shows “smoothed” histograms, obtained by
superimposing a smooth curve on the rectangles, that illustrate the
various possibilities.
(b) bimodal
(a) symmetric unimodal
(c) Positively skewed
(d) negatively skewed
Smoothed histograms
Figure 1.12
1.3
Measures of Location
Copyright © Cengage Learning. All rights reserved.
The Mean
• For a given set of numbers x1, x2,. . ., xn, the most familiar and useful
measure of the center is the mean, or arithmetic average of the set.
Because we will almost always think of the xi’s as constituting a
sample, we will often refer to the arithmetic average as the sample
mean and denote it by x.
The Mean
• A physical interpretation of x demonstrates how it measures the
location (center) of a sample. Think of drawing and scaling a
horizontal measurement axis, and then represent each sample
observation by a 1-lb weight placed at the corresponding point on the
axis.
• The only point at which a fulcrum can be placed to balance the
system of weights is the point corresponding to the value of x (see
Figure 1.14).
The Mean
• Just as x represents the average value of the observations in a sample,
the average of all values in the population can be calculated. This
average is called the population mean and is denoted by the Greek
letter . When there are N values in the population (a finite
population), then
 = (sum of the N population values)/N.
• We will give a more general definition for  that applies to both finite
and (conceptually) infinite populations. Just as x is an interesting and
important measure of sample location,  is an interesting and
important (often the most important) characteristic of a population.
The Mean
• In the chapters on statistical inference, we will present methods
based on the sample mean for drawing conclusions about a
population mean.
• For example, we might use the sample mean x = 16.36 computed in
Example 1.14 as a point estimate (a single number that is our “best”
guess) of  = crack length for all specimens treated as described.
The Mean
• The mean suffers from one deficiency that makes it an inappropriate
measure of center under some circumstances: Its value can be greatly
affected by the presence of even a single outlier (unusually large or
small observation).
• For example, if a sample of employees contains nine who earn
$50,000 per year and one whose yearly salary is $150,000, the
sample mean salary is $60,000; this value certainly does not seem
representative of the data.
The Mean
• In such situations, it is desirable to employ a measure that is less
sensitive to outlying values than x, and we will momentarily propose
one.
• However, although does x have this potential defect, it is still the most
widely used measure, largely because there are many populations for
which an extreme outlier in the sample would be highly unlikely.
The Median
The Median
• The word median is synonymous with “middle,” and the sample
median is indeed the middle value once the observations are ordered
from smallest to largest.
• When the observations are denoted by x1,…, xn, we will use the
symbol
to represent the sample median.
The Median
Example 1.15
• People not familiar with classical music might tend to believe that a
composer’s instructions for playing a particular piece are so specific
that the duration would not depend at all on the performer(s).
• However, there is typically plenty of room for interpretation, and
orchestral conductors and musicians take full advantage of this.
Example 1.15
cont’d
• The author went to the Web site ArkivMusic.com and selected a
sample of 12 recordings of Beethoven’s Symphony #9 (the “Choral,” a
stunningly beautiful work), yielding the following durations (min)
listed in increasing order:
• 62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0
• Here is a dotplot of the data:
Dotplot of the data from Example 14
Figure 1.16
Example 1.15
cont’d
• Since n = 12 is even, the sample median is the average of the n/2 = 6th
and (n/2 + 1) = 7th values from the ordered list:
• Note that if the largest observation 79.0 had not been included in the
sample, the resulting sample median for the n = 11 remaining
observations would have been the single middle value 66.4 (the [n +
1]/2 = 6th ordered value, i.e. the 6th value in from either end of the
ordered list).
Example 1.15
cont’d
• The sample mean is x = xi = 816.1/12 = 68.01, a bit more than a full
minute larger than the median.
• The mean is pulled out a bit relative to the median because the
sample “stretches out” somewhat more on the upper end than on
the lower end.
The Median
• The data in Example 1.15 illustrates an important property of in
contrast to x: The sample median is very insensitive to outliers. If, for
example, we increased the two largest xis from 75.7 and 79.0 to 85.7
and 89.0, respectively,
would be unaffected.
• Thus, in the treatment of outlying data values, x and are at
opposite ends of a spectrum. Both quantities describe where the data
is centered, but they will not in general be equal because they focus
on different aspects of the sample.
The Median
The population mean and median will not generally be identical. If the
population distribution is positively or negatively skewed, as pictured in
Figure 1.16, then
(a) Negative skew
(b) Symmetric
Three different shapes for a population distribution
Figure 1.16
(c) Positive skew
Other Measures of Location: Quartiles,
Percentiles, and Trimmed Means
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
• The median (population or sample) divides the data set into two parts
of equal size. To obtain finer measures of location, we could divide
the data into more than two such parts.
• Roughly speaking, quartiles divide the data set into four equal parts,
with the observations above the third quartile constituting the upper
quarter of the data set, the second quartile being identical to the
median, and the first quartile separating the lower quarter from the
upper three-quarters.
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
• Similarly, a data set (sample or population) can be even more finely
divided using percentiles; the 99th percentile separates the highest
1% from the bottom 99%, and so on.
• Unless the number of observations is a multiple of 100, care must be
exercised in obtaining percentiles.
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
• To paraphrase, the mean involves trimming 0% from each end of the
sample, whereas for the median the maximum possible amount is
trimmed from each end.
• A trimmed mean is a compromise between and . A 10% trimmed
mean, for example, would be computed by
eliminating the smallest 10% and the largest 10% of the sample and
then averaging what remains.
Example 1.16
• The production of Bidri is a traditional craft of India. Bidri wares (bowls,
vessels, and so on) are cast from an alloy containing primarily zinc along
with some copper.
• Consider the following observations on copper content (%) for a sample of
Bidri artifacts in London’s Victoria and Albert Museum (“Enigmas of Bidri,”
Surface Engr., 2005: 333–339), listed in increasing order:
• 2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3
• 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1
Example 1.16
cont’d
• Figure 1.17 is a dotplot of the data. A prominent feature is the single
outlier at the upper end; the distribution is somewhat sparser in the
region of larger values than is the case for smaller values.
Dotplot of copper contents from Example 1.16
Figure 1.17
Example 1.16
cont’d
• The sample mean and median are 3.65 and 3.35, respectively. A
trimmed mean with a trimming percentage of 100(2/26) = 7.7%
results from eliminating the two smallest and two largest
observations; this gives
• Trimming here eliminates the larger outlier and so pulls the trimmed
mean toward the median.
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
• A trimmed mean with a moderate trimming
percentage—someplace between 5% and 25%—will yield a measure
of center that is neither as sensitive to outliers as is the mean nor as
insensitive as the median.
• If the desired trimming percentage is 100 % and n is not an integer,
the trimmed mean must be calculated by interpolation. For example,
consider  = .10 for a 10% trimming percentage and n = 26 as in
Example 1.16.
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
• Then xtr(10) would be the appropriate weighted average of the 7.7%
trimmed mean calculated there and the 11.5% trimmed mean
resulting from trimming three observations from each end.