Download 6 Random Sampling and Data Description

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics 601
6
Random Sampling and Data Description
Parameters are numerical characteristics of a population.
Statistics are numerical quantities calculated from the sample.
We will now use statistics to gain an understanding of the sample data. We will develop both
graphical and numerical methods of summarizing data.
Before we can use statistics to draw inferences about population parameters, we will need to learn
about statistical models which have probability theory as their basis.
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 1
Statistics 601
6.1
Pictorial and Tabular Methods in Descriptive Statistics
Consider the Following Data Set:
The concentration of suspended solids in river water is an important environmental characteristic.
The paper “Water Quality in Agricultural Watershed: Impact of Riparian Vegetation During Base
Flow” (Water Resources Bull., 1981, pp. 233-239) reported on concentrations (in parts per million,
or ppm) for several different rivers. Suppose the following 50 observations had been obtained for a
particular river.
55.8
45.9
83.2
75.3
60.7
60.9
39.1
40.0
71.4
77.1
37.0
35.5
31.7
65.2
59.1
91.3
56.0
36.7
52.6
49.5
65.8
44.6
62.3
58.2
69.3
42.3
71.7
47.3
48.0
69.8
33.8
61.2
94.6
61.8
64.9
60.6
61.5
56.3
78.8
27.1
76.0
47.2
30.0
39.8
87.1
69.0
74.5
68.2
65.0
66.3
Question: What do these data tell us about the concentration of suspended solids?
First few steps in analyzing a data set:
1. Organize and summarize the data.
2. Find the center of the data.
3. Examine the spread of the data.
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 2
Statistics 601
6.2
Stem and Leaf Display
A compact and descriptive method of organizing data without losing any information in the data.
• Leading digits are stems.
• Trailing digits are leaves.
• Indicate units somewhere on the display.
• Option: Sort the leaves.
• Comparative stem & leaf.
• Repeat stems if need be.
Advantages:
• No loss of information.
• Easy to do for small data sets.
Disadvantages:
• Time consuming for large data sets (by hand)
• Cannot be used for categorical data.
• Very space consuming for large data sets.
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 3
Statistics 601
Stem-and-leaf display of the solids data set with sorted leaves:
2
3
4
5
6
7
8
9
:
:
:
:
:
:
:
:
7
0245779
002567789
366689
111112255566899
01245679
37
15
units: ppm
Stem-and-leaf display with two stems per tens place:
2*:
3 :
3*:
4 :
4*:
5 :
5*:
6 :
6*:
7 :
7*:
8 :
8*:
9 :
9*:
7
024
5779
002
567789
3
66689
1111122
55566899
0124
5679
3
7
1
5
Chapter 6: Random Sampling and Data Description
units: ppm
c 2007 by Thomas E. Wehrly
Copyright °
Slide 4
Statistics 601
Comparative stem-and-leaf display on the solids data set taken two years earlier:
Two Years Ago
Current
------------------------------------8 : 1 :
9851 : 2 : 7
9887640 : 3 : 0245779
9997765322111 : 4 : 002567789
877554200 : 5 : 366689
9887653221 : 6 : 111112255566899
72210 : 7 : 01245679
95 : 8 : 37
: 9 : 15
units: ppm
Sometimes we redefine the leaves for low-numbered or ”narrow” data sets:
58, 58, 57, 54, 54, 54, 57, 57, 56, 56, 57, 51, 58, 54, 52, . . . , 52, 54
60
59
58
57
56
55
54
53
52
51
:
:
:
:
:
:
:
:
:
:
0
00
00000000000
0000000000
0000000000
0000000000000
0000000000000
0000
000
0
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 5
Statistics 601
6.3
Frequency Distributions for Quantitative Data
A very popular way to summarize data is with a frequency distribution. A frequency distribution is
a compact summary of a data set using a table with 3 or 4 columns:
Class interval (or category) — disjoint intervals containing all observations in the data
set
Frequency — Number of obs. in a class interval =
f
Relative frequency — Proportion of obs. in interval =
f /n
Cumulative frequency — Sum of the relative frequencies
Pclass
i=1
f /n.
Having too many intervals leads to a very jagged histogram.
Having too few intervals smooths away important features.
The number of classes is usually 5 to 20.
Use at least (2n)1/3 for a rough idea.
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 6
Statistics 601
We will form a frequency distribution for the solids data set:
55.8
45.9
83.2
75.3
60.7
60.9
39.1
40.0
71.4
77.1
37.0
35.5
31.7
65.2
59.1
91.3
56.0
36.7
52.6
49.5
65.8
44.6
62.3
58.2
69.3
50 observations. Approximate number of classes:
42.3
71.7
47.3
48.0
69.8
√
33.8
61.2
94.6
61.8
64.9
60.6
61.5
56.3
78.8
27.1
76.0
47.2
30.0
39.8
87.1
69.0
74.5
68.2
65.0
66.3
50 = 7.07.
Frequency
Relative f
Cumulative f
20–29.9
1
.02
.02
30–39.9
8
.16
.18
40–49.9
8
.16
.34
50–59.9
6
.12
.46
60–69.9
16
.32
.78
70–79.9
7
.14
.92
80–89.9
2
.04
.96
90–99.9
2
.04
1.0
Class Interval
Chapter 6: Random Sampling and Data Description
[Tally]
c 2007 by Thomas E. Wehrly
Copyright °
Slide 7
Statistics 601
6.4
Histogram
A histogram is a pictorial representation of a frequency distribution.
1. Draw an x-axis and mark class intervals.
0
5
10
15
2. Draw a rectangle whose area is proportional to the frequency of that interval.
20
40
60
80
100
solids
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 8
Statistics 601
A true histogram or a density scale will have an area that is equal to 1.0. In that case we make the:
Rectangle Height =
Relative Frequency
Base Length
In the case where all the intervals are of equal length all we need to do is add the appropriately
0.0
0.005
0.010
0.015
0.020
0.025
0.030
labeled y-axis.
20
40
60
80
100
solids
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 9
Statistics 601
Histograms often exhibit particular shapes:
• unimodal
• bimodal
• multimodal
• symmetric
• positively skewed
• negatively skewed
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 10
Statistics 601
6.5
Measures of Location
Another step in gaining understanding of our data is to find the “center” of our data. What is the
center?
6.6
Mean / Average
We calculate the sample mean or average as follows:
x̄ =
xi :
The ith observation in the sample.
n:
Sample size.
1
n
Pn
i=1
xi
Example: Calculate the average concentration of solids.
50
1
1X
× 2927 = 58.54
xi =
x̄ =
n i=1
50
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 11
Statistics 601
6.7
Median
Median: The middle observation of the sorted data set.
Sample Median =
x̃
We calculate the median:
n odd:
n even:
x̃ = x((n+1)/2)
x̃ = (x(n/2) + x((n+2)/2) )/2
Example: Calculate the median of the solid concentrations.
27.1
42.3
56.3
64.9
71.7
30.0
44.6
58.2
65.0
74.5
31.7
45.9
59.1
65.2
75.3
33.8
47.2
60.6
65.8
76.0
35.5
47.3
60.7
66.3
77.1
36.7
48.0
60.9
68.2
78.8
37.0
49.5
61.2
69.0
83.2
39.1
52.6
61.5
69.3
87.2
39.8
55.8
61.8
69.8
91.3
40.0
56.0
62.3
71.4
94.6
n/2 = 25, x̃ = (x(25) + x(26) )/2 = (60.7 + 60.9)/2 = 60.8
Discussion: How do outliers affect the mean and median?
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 12
Statistics 601
6.8
6.8.1
Other Measures of Location
Trimmed Mean
A trimmed mean is a compromise between x̄ and x̃ in that outliers will have some effect on the
trimmed mean but not as much as they have on the mean. It is calculated by eliminating a certain
percentage of the observations from both ends and calculating the average of the remaining data.
For example a 10% trimmed mean would eliminate 10% of the observations from each end of the
data (20% total) and average the remaining 80% of the observations.
Example: Calculate the 10% trimmed mean for the solid concentrations.
We have n
= 50 observations. 10% of this is 50 × .10 = 5. Therefore we eliminate 5
observations from each end for a total of 10 observations:
45
1
1 X
xi =
× 2333.8 = 58.345
x̄ =
40 i=6
40
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 13
Statistics 601
6.8.2
Percentiles and Quartiles
The 100pth percentile is the observation in our data set where 100p% are equal to or less than
this observation. The median is the 50th percentile.
The following is a general approach to calculate the 100pth percentile x[p] :
1. Let x(i) , i
2. Let ip
= 1, . . . , n, refer to our data set in ascending order.
= np.
3. Find the first index i such that i
> ip .
4. The 100pth percentile is then:


x[p] =
x(i−1) +x(i)
2
 x(i)
In short: If ip is integer we average the ith
p and (ip
if i − 1
= ip
otherwise
+ 1)th observation. Otherwise we round ip
up and take the dip eth observation.
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 14
Statistics 601
Q1 :
First Quartile
=
25th percentile
Q2 :
Second Quartile
=
50th percentile
Q3 :
Third Quartile
=
75th percentile
IQR = Q3 − Q1 = “Interquartile Range”
We can calculate quartiles by using our rules for finding the median. We consider two cases:
• n even:
– To obtain Q1 , obtain the median of x(1) , . . . , x(n/2) .
– To obtain Q3 , find the median of x((n/2)+1) , . . . , x(n)
• n odd:
– To obtain Q1 , obtain the median of x(1) , . . . , x((n+1)/2) .
– To obtain Q3 , find the median of x((n+1)/2) , . . . , x(n)
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 15
Statistics 601
Example: Calculate Q1 and Q3 for the solid concentrations.
27.1
42.3
56.3
64.9
71.7
30.0
44.6
58.2
65.0
74.5
31.7
45.9
59.1
65.2
75.3
33.8
47.2
60.6
65.8
76.0
35.5
47.3
60.7
66.3
77.1
36.7
48.0
60.9
68.2
78.8
37.0
49.5
61.2
69.0
83.2
39.1
52.6
61.5
69.3
87.2
39.8
55.8
61.8
69.8
91.3
40.0
56.0
62.3
71.4
94.6
Q1 =median of {x(1) , . . . , x(25) } = x(13) = 45.9
Q3 =median of {x(26) , . . . , x(50) } = x(38) = 69.3
Example: Calculate Q1 and Q3 for the values {2, 4, 9, 17, 22, 43, 65, 88, 103}.
n=9
Q1 =median of {x(1) , . . . , x(5) } = x(3) = 9
Q3 =median of {x(5) , . . . , x(9) } = x(7) = 65
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 16
Statistics 601
6.8.3
Boxplots
Box plots are useful in summarizing various aspects of the data. Side-by-side box plots provide
useful comparisons of two or more sets of data.
1. Form an axis that includes all possible values of the data.
2. Draw a box extending from Q1 to Q3 .
3. Draw a vertical bar at the median.
4. Draw whiskers (horizontal lines) to the most extreme observation within 1.5 IQR from
each end of the box.
5. Indicate mild outliers with a “◦”
6. Indicate extreme outliers with a “∗”
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 17
Statistics 601
Example: Calculate the summary statistics x̄, x̃, Q1, Q3 for the water quality data set. Then
construct a box plot.
x̄ = 58.54
Min = 27.10
Q1 = 45.9
X̃ = 60.80
Q3 = 69.3
Max = 94.60
IQR = 69.3 − 45.9 = 23.4
60
40
solids(ppm)
80
Particulate Matter
Chapter 6: Random Sampling and Data Description
c 2007 by Thomas E. Wehrly
Copyright °
Slide 18
Statistics 601
6.9
Measures of Variability
The mean, median, etc. do not give us a complete overview (summary) of our data.
For Example: Consider the following three data sets:
Data
Measures of Spread
1:
20
30
40
50
60
70
50
30
350
18.71
2:
20
43
44
46
47
70
50
4
252
15.87
3:
40
43
44
46
47
50
10
4
12
3.46
– The mean and median are 45 for all three data sets.
– These data sets have very different spreads.
Ways to measure spread:
Range: range = maximum observation – minimum observation
Interquartile Range:
Chapter 6: Random Sampling and Data Description
IQR = Q3 − Q1
c 2007 by Thomas E. Wehrly
Copyright °
Slide 19
Statistics 601
Average Deviation from the Mean: We define the ith deviation to be:
We average the deviations:
xi − x̄. Intuitive:
1X
(xi − x̄)
n
Problem: this does not give us anything useful!
1
n
P
(xi − x̄) =
1
n
P
xi −
1
n
P
x̄ =
1
n
P
xi − n1 nx̄ = x̄ − x̄ = 0
The result is always equal to zero!
Variance: We average the squared deviations from the mean and divide by n − 1
instead of n to get a measure of spread called the sample variance:
1 X
s =
(xi − x̄)2
n−1
2
Calculation formula:
s2 =
Chapter 6: Random Sampling and Data Description
1
n−1
³X
¡P
x2i −
c 2007 by Thomas E. Wehrly
Copyright °
xi
n
¢2 ´
Slide 20
Statistics 601
Standard Deviation: The units of the variance are units of the data squared. To make
the units the same as that of the data set we take the square root of the variance. This
is called the sample standard deviation:
s=
√
s2
Note:
s is translation invariant:
s(x1 , ..., xn ) = s(x1 + a, ..., xn + a) for all a.
s is scale equivariant:
s(ax1 , ..., axn ) = |a|s(x1 , ..., xn ) for all a.
Example: Calculate the range, variance and standard deviation of the particulate solid data.
Range = Maximum − Minimum = 94.6 − 27.1 = 67.5
s2 = 270.8469
s=
Chapter 6: Random Sampling and Data Description
√
270.8469 = 16.8469
c 2007 by Thomas E. Wehrly
Copyright °
Slide 21