Download Chapter 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 2
Descriptive Statistics
We begin with the set of data below, with the measurements indicating total
protein, measured in µg/ml. This is an example of raw data.
76.33
78.15
58.50
54.07
59.20
74.78
106.00
153.56
62.32
59.76
57.68
77.63 149.49 54.38 55.47 51.70
85.40 41.98 69.91 128.40 88.17
84.70 44.40 57.73 88.78 86.24
95.06 114.79 53.07 72.30 59.36
67.10 109.30 82.60 62.80 61.90
77.40 57.90 91.47 71.50 61.70
61.10 63.96 54.41 83.82 79.55
70.17 55.05 100.36 51.16 72.10
73.53 47.23 35.90 72.20 66.60
95.33 73.50 62.20 67.20 44.73
This is a sample of size n = 61. What can we tell about this data in its current
form?
Not much, actually.
Ordered Array – data arranged from smallest to largest (usually).
So we arrange our data into an ordered array.
6
2. DESCRIPTIVE STATISTICS
35.90 41.98 44.40
51.70 53.07 54.07
55.47 57.68 57.73
59.36 59.76 61.10
62.32 62.80 63.96
69.91 70.17 71.50
73.50 73.53 74.78
78.15 79.55 82.60
86.24 88.17 88.78
100.36 106.00 109.30
153.56
44.73
54.38
57.90
61.70
66.60
72.10
76.33
83.82
91.47
91.47
7
47.23 51.16
54.41 55.05
58.50 59.20
61.90 62.20
67.10 67.20
72.20 72.30
77.40 77.63
84.70 85.40
95.06 95.33
95.06 149.49
Now what can we say about our data?
– the minimum value is 35.90 and the maximum is 153.56.
– the middle of the data is in the 60’s or 70’s.
Even ordering this data does not give us a good picture of what is happening.
Grouped Data – the Frequency Distribution
We need to select a set of contiguous, non-overlapping intervals such that each
value in the set of observations can be placed in exactly one interval, referred
to as the class intervals. Generally,
5  # of intervals k  15.
One can use Sturges rule as a guide:
k = 1 + 3.322 log10 n
where n is the number of observations. In our example, n = 61, giving us
k = 1 + 3.322 log10 61 ⇡ 6.93.
So, rounding o↵, n = 7. From our data, the
R = range = maximum
minimum = 153.56
35.9 = 117.66,
8
2. DESCRIPTIVE STATISTICS
so
R 117.66
=
= 16.81
k
7
Now, when the nature of the data makes them appropriate, class interval widths
of 5, 10, or multiples of 10 units make the summarization more comprehensible.
Here we will choose class intervals of 20 µg/ml, with the first interval beginning
at 30. We will label the intervals by their midpoints.
interval width = w =
Class intervals Frequency Midpoint
30  x < 50
5
40
50  x < 70
26
60
70  x < 90
20
80
90  x < 110
6
100
110  x < 130
2
120
130  x < 150
1
140
150  x < 170
1
160
—
61
For computations with grouped data, each element in an interval is given the
value of the midpoint of the interval. Thus each of the 26 values in the interval
50  x < 70 is treated as though it is 60.
Note. A value falling on the interval boundary is placed in the higher valued
interval (to the right on a number line).
Although we can now see where the majority of the data lies and how it is
spread (and graphs add to this), the data items lose their individual values to
the midpoint value of the interval in which they lie.
Relative Frequencies — the proportion of values falling into a class interval.
We divide the number of values in each category by the total number of values.
There are times when we will interpret the relative frequencies as the probability
of occurence within a given interval, called the experimental probability or the
2. DESCRIPTIVE STATISTICS
9
empirical probability. In the following table we also incorporate cumulative
frequencies and relative cumulative frequencies.
Cumulative
Cumulative Relative
Relative
Class intervals Midpoint Frequency Frequency Frequency Frequency
30  x < 50
40
5
5
.0820
.0820
50  x < 70
60
26
31
.4262
.5082
70  x < 90
80
20
51
.3279
.8361
90  x < 110
100
6
57
.0984
.9344
110  x < 130
120
2
59
.0328
.9672
130  x < 150
140
1
60
.0164
.9836
150  x < 170
160
1
61
.0164
1.0000
—
——
61
1.0001
Except for round-o↵ errors, the 1.0001 in the Relative Frequency column should
always be 1.0000.
10
2. DESCRIPTIVE STATISTICS
Frequency Histogram and Frequency Polygon – special types of bar and line
graphs. Here we show the frequency polygon superimposed over the frequency
histogram, as created in Maple. They are commonly separate graphs.
In this case, the bars of the histogram are labeled by their midpoints on the
horizotal axis. The points on the horizontal axis where the bars meet are called
cut points, which may be used instead of the midpoints to label the horizontal
axis. The frequency polygon is always labeled by the midpoints.
The area under the histogram is 61 ⇥ 20 = 1220 (n⇥ interval width). With
the lines of the frequency polygon joining the midpoints of the bars along with
the midpoints of the adjoining intervals, the area of the frequency polygon is
the same as that of the frequency histogram.
Suppose we look at the same data with class intervals of width 10. The following
table is also from Maple.
2. DESCRIPTIVE STATISTICS
11
The class intervals are not labeled, but are class width, frequency, relative frequency, cumulative frequency, and relative cumulative frequency. The frequency
histogram follows.
12
2. DESCRIPTIVE STATISTICS
With this histogram, the two values over 130 appear to be outliers (somewhat
disjoint from the rest of the data).
Relative Frequency Histogram and Relative Frequency Polygon
Maple. See hist.mw or hist.pdf..
2. DESCRIPTIVE STATISTICS
13
Stem-and-Leaf Displays – bears a strong resemblance to the histogram and
serves the same purpose.
Here are the ages of 48 students in a statistics course:
1) Use the first part of the data as a stem – write them vertically.
2) Use the last part as a leaf, in increasing order – we sometimes truncate or
round – leaves are one digit only
The last step is to put the leaves in increasing order.
We can split stems to show more detail: 0–4 and 5–9.
14
2. DESCRIPTIVE STATISTICS
Advantages
– quick visual picture of the data.
– see the actual values
Disadvantages
– best for small data sets (n  100)
– can give a poor picture of the data
Statistic – a descriptive measure computed from a sample
Parameter – a descriptive measure computed from a population
Measures of Central Tendency – mean, median, and mode. We want a
single value that is typical of the data as a whole.
(Arithmetic) Mean – average.
X = random variable (RV)
xi = specific values of X
N = number of values in a finite population
n = number of values in a sample
For ungrouped data:
population:
sample:
µ=
x=
N
X
xi
i=1
N
n
X
i=1
n
xi
2. DESCRIPTIVE STATISTICS
Example (Protein).
x=
n
X
xi
i=1
=
n
4717.08
= 77.32918033
61
For grouped data:
Class intervals Midpoint=xi Frequency=fi xifi
30  x < 50
40
5
200
50  x < 70
60
26
1560
70  x < 90
80
20
1600
90  x < 110
100
6
600
110  x < 130
120
2
240
130  x < 150
140
1
140
150  x < 170
160
1
160
—
——
61
4500
µ=
x=
7
X
xifi
i=1
n
=
7
X
xifi
i=1
61
4500
= 73.7704918
61
15
16
2. DESCRIPTIVE STATISTICS
Properties of the Mean
(1) Uniqueness – for a given set of data, there is exactly one arithmetic mean.
(2) Simplicity – the arithmetic mean is easily understood and easy to compute.
(3) Since each and every value in a set of data enters into the computation
of the mean, it is a↵ected by each value. Extreme values, therefore, have
an influence on the mean and, in some cases, can so distort it that it
becomes undesirable as a measure of central tendency.
Outliers (Extreme Values) – values that deviate appreciably from most of the
measurements in a data set.
Robust Estimators – estimators that are insensitive to outliers.
Trimmed Mean – a robust estimator of central tendency. For a set of sample
data containing n measurements we calculate the 100↵ percent trimmed mean
as follows:
(1) Order the measurements.
(2) Discard the smallest 100↵ percent and the largest 100↵ percent of the
measurements. The recommended value of ↵ is something between .1
and .2.
(3) Compute the arithmetic mean of the remaining measurements.
Example (Protein).
(1) The 5% trimed mean (removing 3 elements from each end of the data) is
71.2609090909090952.
(2) The 10% trimed mean (removing 6 elements from each end of the data) is
70.3293877551020472.
(3) The 20% trimed mean (removing 12 elements from each end of the data) is
69.3051351351351315.
2. DESCRIPTIVE STATISTICS
17
Median – a value that divides the ordered array into two equal parts. We order
n+1
the data points from smallest to largest and then take item
in order.
2
Example.
n+1 6
(1) 1 3
8
13
2000
=)
= = 3.
|{z}
2
2
median = 8
n+1 7
(2) 1 5
8| {z11}
13 21 =)
= = 3.5.
2
2
8 + 11
median =
= 9.5
2
(3) For our data set with n = 61, the median of the ungrouped data is 69.91.
(4) For the grouped data on Page 15, the median is 60, the 31st element of the
set where each data point takes on the value of the midpoint of its class
interval. Does this seem like a good measure of central tendency in this
case?
Obviously not! When your only source is grouped data, don’t put too much
confidence in mean and median.
Properties of the Median
(1) Uniqueness – as was true with the mean, there is a unique median for a
given set of data.
(2) Simplicity – the median is easy to calculate.
(3) Robustness – it is not as drastically a↵ected by extreme values as is the
mean.
Mode – the value that occurs most frequently. If all the data items are di↵erent,
there is no mode. A set of data may have more than one mode (this is common
for grouped data). A data set with two modes is called bimodal.
18
2. DESCRIPTIVE STATISTICS
Skewness – classification of data distributions on the basis of whether they are
symmetric or asymmetric.
(1) Symmetric – the left half of its graph (histogram or frequency polygon)
will be a mirror image of it right half.
(2) Asymmetric –not symmetric.
Definition. If the graph (histogram or frequency polygon) of a distribution
is asymmetric, the distribution is said to be skewed. If a distribution is not
symmetric because its graph extends further to the right than to the left, that
is, if it has a long tail to the right, we say that the distribution is skewed to the
right or positively skewed. If a distribution is not symmetric because its graph
extends further to the left than to the right, that is, if it has a long tail to the
left, we say that the distribution is skewed to the left or negatively skewed.
The Skewness Statistic
n
p X
n
(xi
Skewness = ✓ n i=1
X
(xi
i=1
n
3
x)
x)2
◆3/2 =
p X
n
(xi
x)3
i=1
(n
p
1) n
1 s3
.
2. DESCRIPTIVE STATISTICS
19
The skewness statistic is 0 for a perfectly symmetric distribution, positive for a
positively skewed distribution (skewed to the right), and negative for a negativly
skewed distribution (skewed to the left).
Typically, for unimodal distributions, if it is skewed to the left,
mean < median < mode,
and if it is skewed to the right,
mode < median < mean.
If you set a distribution on a fulcrum, the mean is where it balances. The
median is the point that divides the area in half, and the mode is the highest
point.
Measures of Dispersion – describe the variation, spread, and scatter of the
distribution.
Range – the di↵erence between the largest and smallest values in a set of observations.
Range = xL xS .
This conveys minimal information and is a poor measure for large samples.
Variance - measures dispersion based on how the data points are scattered
about the mean.
20
2. DESCRIPTIVE STATISTICS
Sample Variance (ungrouped)
n
X
(xi x)2
s2 =
i=1
n
1
=
n
X
n
(xi)2
i=1
n(n
✓X
◆2
n
xi
i=1
1)
Problem (Page 53#2.5.2). x = 540
xi
500
570
560
570
450
560
570
3780
xi
x
-40
30
20
30
-90
20
30
0
|{z}
?
(xi
x)2
1600
900
400
900
8100
400
900
13200
(xi)2
250000
324900
313600
324900
202500
313600
324900
2054400
? – except for rounding errors, this is always 0.
13200
s2 =
= 2200
6
or
7(2054400) 37802
2
s =
= 2200.
7(6)
Example (Protein).
s2 = 548.942654316940.
.
2. DESCRIPTIVE STATISTICS
21
Sample Variance (grouped)
X
(xi x)2fi
s2 = P
fi 1
Example (Protein). x = 73.77
Class intervals
30  x < 50
50  x < 70
70  x < 90
90  x < 110
110  x < 130
130  x < 150
150  x < 170
xi
40
60
80
100
120
140
160
s2 =
xi x
-33.77
-13.77
6.23
26.23
46.23
66.23
86.23
(xi x)2
1140.4129
189.6129
38.8129
688.0129
2137.2129
4386.4129
7435.6129
fi
5
26
20
6
2
1
1
—
61
(xi x)2fi
5702.0645
4929.9354
776.2580
4128.0774
4274.4258
4386.4129
7435.6129
——
31632.7869
31632.7869
= 527.213115
60
Notice howPthe variance changes P
with the grouping. We divide by n 1 instead
of n and
fi 1 instead of
fi in order to use the sample variance in
inference procedures discussed later. This is because dividing by n 1 better
approximates (is an unbiased estimator) the population variance. Also, we say
we have n 1 degrees of freedom, i.e., once we have made n 1 choices, the
last choice is determined.
Population Variance
2
=
N
X
(xi
µ)2
i=1
N
Problem – the variance units are the square of the data units.
22
2. DESCRIPTIVE STATISTICS
Standard Deviation (SD) – the square root of the variance - has the same units
as the data.
Sample SD:
p
s = s2
Problem (Page 53#2.5.2).
s=
Example (Protein).
s=
p
2200 = 46.9042
p
527.213115 = 22.9611
Population SD:
p
2
=
Coefficient of Variation – used for comparing the variation of two or more distarbutions. This would seem to require ratio scales. The coefficient of variation
expresses the SD as a percentage of the mean.
s
CV = (100)
x
Example.
5
x = 10, s = 5, CV = (100) = 50%
10
vs.
5
x = 100, s = 5, CV =
(100) = 5%
100
Five Number Summary
Definition. Given a set of n observations x1, x2, . . . , xn, the pth percentile
P is the value of X such that p percent or less of the observations are less than
P and (100 p) percent or less of the observations are greater than P .
Notation. P10 denotes the 10th percentile, etc. P25 is called the first
quartile (Q1). P50, the median, is the middle or second quartile (Q2). P75 is
the third quartile (Q3).
2. DESCRIPTIVE STATISTICS
n+1
th ordered observation
4
Example (Protein).
n + 1 61 + 1 62
=
=
= 15.5
4
4
4
Thus take the number 1/2 way from the 15th to the 16th observation.
1st quartile: Q1 =
Q1 = 57.73
| {z } +.5(57.90
| {z } 57.73
| {z }) = 57.815
15th
16th 15th
2(n + 1) n + 1
2nd quartile: Q2 =
=
th ordered observation
4
2
Example (Protein).
n + 1 61 + 1 62
=
=
= 31
2
2
2
Thus take the 31st observation.
Q2 = 69.91
3(n + 1)
th ordered observation
4
Example (Protein).
3(n + 1) 3(61 + 1) 3(62) 186
=
=
=
= 46.5
4
4
4
4
Thus take the number 1/2 way from the 46th to the 47th observation.
3rd quartile: Q3 =
Q3 = 83.82
| {z } +.5(84.70
| {z }
46th
47th
The five-number summary is then
83.82
| {z }) = 84.26
46th
minimum – Q1 –median – Q3 – maximum
Example (Protein). The five-number summary is
35.9 – 57.815 – 69.91 – 84.26 –153.56
23
24
2. DESCRIPTIVE STATISTICS
Definition. The interquartile range (IQR) is the di↵erence between the
third and first quartiles:
IQR = Q3 Q1.
Box-and-Whisker Plots (or Boxplots) – This is a graphical represntation of
the five-number summary.
It can be drawn vertically (left) or horizontally (right). The box shows the
interquartile range, extending from Q1 to Q3. The width of the box is arbitrary.
The line through the box shows the median. The whiskers extend from the box
to the minimum and maximum values.
It is di↵erent in SPSS.
2. DESCRIPTIVE STATISTICS
25
The whiskers extend to a maximum of 1.5(IQR) beyond the box. Values
1.5(IQR) to 3(IQR) are labeled with and are termed outliers. Values beyond
3(IQR) are labeled with ⇤ and are termed extremes.
Kurtosis – a measure of the degree to which a distribution is “peaked” or flat
in comparison to a normal distribution whose graph is characterized by a bellshaped distribution. The names of 3 basic types of curves are given below.
n
X
n
(xi
Kurtosis = ✓ ni=1
X
(xi
i=1
Summary
x)4
x)2
◆2
3=
n
X
n
(xi
x)4
i=1
(n
1)2s4
3.
In describing the center and dispersion of a data distribution, one usually either
provides the mean and standard deviation or the five-number summary, the
choice depending on the shape of the distribution – mean and standard deviation
for symmetric data and the five-number summary for non-symmetric data.
Maple. See centdist.mw and centdist.pdf.