Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 1:
Looking at Data—Distributions (Part 2)
1.2 Describing Distributions with Numbers
Dr. Nahid Sultana
1.2 Describing Distributions with
Numbers
Objectives
 Measures of center: mean, median
 Mean versus median
 Measures of spread: quartiles, standard deviation
 Five-number summary and boxplot
 Choosing among summary statistics
 Changing the unit of measurement
Measures of center: The Mean
 The most common measure of center is the arithmetic average, or
mean, or sample mean.
 To calculate the average, or mean, add all values, then divide by
the number of individuals.
 It is the “center of mass.”
If the n observations are x1, x2, x3, …, xn, their mean is:
sum of observations x1 + x2 + ... + xn
=
x=
n
n
or in more compact notation
1
x = ∑ xi
n
Measures of center: The Mean (cont…)
Find the mean:
Here are the scores on the first exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
55
Find the mean first-exam score for these students.
Solution:
80
90
Measuring Center: The Median
 Another common measure of center is the median.
 The median M is the midpoint of a distribution, the number such
that half of the observations are smaller and the other half are larger.
 To find the median of a distribution:
1. Arrange all observations from smallest to largest.
2.
If the number of observations n is odd, the median M is the
center observation in the ordered list.
3.
If the number of observations n is even, the median M is the
average of the two center observations in the ordered list.
Measuring Center: The Median (cont...)
Find the median:
Here are the scores on the first exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
55
80
Find the median first-exam score for these students.
Solution:
90
Comparing Mean and Median
Comparing Mean and Median (Cont...)
 The mean and the median are the same only if the distribution is
symmetrical.
In a skewed distribution,
the mean is usually farther
out in the long tail than is
the median.
 The median is a measure of center that is resistant to skew and
outliers. The mean is not.
Measuring Spread: The Quartiles
 A measure of center alone can be misleading.
 A useful numerical description of a distribution requires both a measure of
center and a measure of spread.
 We describe the spread or variability of a distribution by giving several
percentiles.
 The median divides the data in two ; half of the observations are above the
median and half are below the median. We could call the median the 50th
percentile.
 The lower quartile is the median of the lower half of the data; the upper
quartile is the median of the upper half of the data.
 With the median, the quartiles divide the data into four equal parts; 25% of
the data are in each part
Measuring Spread:The Quartiles (Cont.)
Calculate the quartiles and inter-quartile:
1. Arrange the observations in
increasing order and locate the
median M.
2. The first quartile Q1 is the
median of the lower half of
the data, excluding M.
3. The third quartile Q3 is it is the
median of the upper half of
the data, excluding M.
Measuring Spread: The Quartiles (Cont.)
Example: Here are the scores on the first-exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
Find the quartiles for these first-exam scores.
Solution: In order, the scores are:
55
73
75
80
80
85
90
55
80
90
92
93
98
The median is,
Q1 = 75, the median of the first five numbers: 55, 73, 75, 80, 80.
Q3 = 92, the median of the last five numbers: 85, 90, 92, 93, 98.
The Five-Number Summary
The five-number summary of a distribution consists of
 The smallest observation (Min)
 The first quartile (Q1)
 The median (M)
 The third quartile (Q3)
 The largest observation (Max)
written in order from smallest to largest.
Minimum
Q1
M
Q3
Maximum
Boxplots
A boxplot is a graph of the five-number summary.
 Draw a central box from Q1 to Q3.
 Draw a line inside the box to mark the median M.
 Extend lines from the box out to the minimum and maximum
values that are not outliers.
Boxplots (Cont…)
Example: Here are the scores on the first-exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
Make a boxplot for these first-exam scores.
Solution:
In order, the scores are:
55, 73, 75, 80, 80, 85, 90, 92, 93, 98
Min = 55
Q1 = 75
M = 82.5
Q3 = 92
Max = 98
55
80
90
Boxplots for skewed data
Comparing Boxplots to Histograms
Suspected Outliers: 1.5 × IQR Rule
 Outliers are troublesome data points, and it is important to be able to
identify them.
The interquartile range IQR is the distance between the first
and third quartiles,
IQR = Q3 − Q1
 IQR is used as part of a rule of thumb for identifying outliers.
The 1.5 × IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 × IQR above
the third quartile or below the first quartile.
Suspected Outliers: 1.5 × IQR Rule (Cont..)
Individual #25 has a value of 7.9 years, which is 3.55 years
above the third quartile. This is more than 1.5 * IQR =3.225
years. Thus, individual #25 is a suspected outlier.
Suspected Outliers: 1.5 × IQR Rule (Cont..)
 Modified boxplots plot suspected outliers individually.
 The 8 largest call lengths are
438, 465, 479, 700, 700, 951, 1148, 2631
 They are plotted as individual points, though 2 of them are
identical and so do not appear separately.
Measuring Spread:
The Standard Deviation
The most common measure of spread looks at how far each observation
is from the mean. This measure is called the standard deviation.
 The standard deviation s measures the average distance of the
observations from their mean.
 It is calculated by finding an average of the squared distances and then
taking the square root.
 This average squared distance is called the variance.
Calculating The Standard Deviation
1. Calculate mean
2. Calculate each deviation,
deviation = observation – mean
3. Square each deviation
4. Calculate the sum of the squared
deviations
5. Divided by degrees freedom,
(df) = (n − 1), this is called the
variance.
6. Calculate the square root of the
variance…this is the standard
deviation.
The variance = 52/(9 – 1) = 6.5
Standard deviation =
xi
(xi-mean) (xi-mean)2
1
1 - 5 = -4
(-4)2 = 16
3
3 - 5 = -2
(-2)2 = 4
4
4 - 5 = -1
(-1)2 = 1
4
4 - 5 = -1
(-1)2 = 1
4
4 - 5 = -1
(-1)2 = 1
5
5-5=0
(0)2 = 0
7
7-5=2
(2)2 = 4
8
8-5=3
(3)2 = 9
9
9-5=4
(4)2 = 16
Mean=5
Sum=0
Sum=52
Properties of The Standard Deviation
 s measures spread about the mean and should be used only when
the mean is the measure of center.
 s = 0 only when all observations have the same value and there
is no spread. Otherwise, s > 0.
 s is not resistant to outliers.
 s has the same units of measurement as the original bservations.
Choosing Measures of
Center and Spread
We now have a choice between two descriptions for center and spread
 Mean and Standard Deviation
 Median and Interquartile Range
Choosing Measures of Center and Spread
 The median and IQR are usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
outliers.
 Use mean and standard deviation only for reasonably symmetric
distributions that don’t have outliers.
 NOTE: Numerical summaries do not fully describe the shape
of a distribution. ALWAYS PLOT YOUR DATA!
Changing the Unit of Measurement
 Variables can be recorded in different units of measurement.
 Most often, one measurement unit is a linear transformation of
another measurement unit: xnew = a + bx.
Example 1: If a distance x is measured in kilometers, the same
distance in miles is xnew = 0.62 x
This transformation changes the units without changing the
origin —a distance of 0 kilometers is the same as a distance of 0 miles.
Example 2: A temperature x measured in degrees Fahrenheit can be
expressed in degrees Celsius by the transformation
This transformation changes both the unit size and the origin of
the measurements —The origin in the Celsius scale (0◦C, the
temperature at which water freezes) is 32◦ in the Fahrenheit scale.
Changing the Unit of Measurement
(Cont…)
 Linear transformations do not change the basic shape of a
distribution (skew, symmetry, multimodal).
 But they do change the measures of center and spread:
 Multiplying each observation by a positive number b multiplies
both measures of center (mean, median) and spread (IQR, s) by b.
Adding the same number a (positive or negative) to each
observation adds a to measures of center and to quartiles but it
does not change measures of spread (IQR, s).
Related documents