Download Chapter Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Unit 2 Chapter 2 Summary
Section 2.1
Constructing a frequency table
1. Determine the number of classes and determine the class width.
Class Width =
l arg est value  smallest value
number of classes
Increase this value to the next whole number.
2. Create lower class limits and upper class limits using the class width. To set up
the interval properly, determine the lower class limit and add the width then
subtract 1.
3. Tally the data with tick marks as you classify the data into its respective class.
Then complete a table column entitled Frequency with the actual count for each
class.
4. Compute the midpoint (class mark) for each class. Do not round this value.
Midpoint =
lower class lim it  upper class lim it
2
5. Determine class boundaries. To find the upper class boundary, add 0.5 to the
upper class limits. To find the lower class boundary, subtract 0.5 from the lower
class limits.
6. Compute the relative frequency for each class.
Re lative frequency 
class frequency
total of all frequencies
Follow the directions in docsharing to construct a histogram and relative frequency
histogram.
Section 2.2
Other





types of graphs
Stem-leaf plot
Dot plots
Pareto charts
Circle graphs (pie chart)
Time-series graphs
Stem-and-leaf Displays is a method to display data that is used to rank-order and
arrange data into groups. Be sure to include any stems that are empty as shown it the
“6” stems.
Example: for the data shown below
12 23 56 25 15 45 35 25 32 14 10 18 29 43 75 71 13
Stems
1
2
3
4
5
6
7
Leaves
0 2 3 4 5 8
3 5 5 9
2 5
3 5
6
Note empty stem
1 5
Section 2.3
Measures of Central Tendency: Mode, Median, Mean
The 2 data set that will be used are shown below.
Data set x = {19, 18, 23, 19, 25, 27}
Data set y = {2, 3, 4, 3, 2, 5, 6}
MODE of a data set is the value that occurs most frequently.
For data set x, the MODE is 19. For data set y, the MODES are 2 and 3. This last situation is
called BIMODAL.
MEDIAN is the central value of an ordered distribution. The concept of order indicates that the
measure is positional. So the first step to take is to rewrite the data set in ascending order.
Data set x = {18, 19, 19, 23, 25, 27}
Data set y = {2, 2, 3, 3, 4, 5, 6}
Data set x has 6 items. Data set y has 7 items.
To find the median,
1. If the number of items is odd, the median is the middle item.
Data set y has an odd number of items, 7. Therefore the MEDIAN = 3
2. If the number of items is even, the median is the average of the middle two items.
Data set x has an even number of items, 6. The middle 2 values are 19 and 23.
Therefore MEDIAN =
19  23
 21
2
MEAN is the arithmetic average of all the data values.
Mean =
sum of all values
# of values
Mean of data set x =
19  18  23  19  25  27
 21.8
6
(rounded to 1 decimals)
Mean of data set y =
(rounded to 1 decimals)
The proper formulas use the summation notation, .
2 233 456
 3.6
7
Sample Mean =
x
Population Mean =
x
Pronounced x-bar. n is the number of values in the sample.
n

x
N
 is pronounced “mu”. N is the number of values in the
population.
Section 2.4
Measures of Variation: Range, Standard Deviation, Variance
Measures of variation show the spread of data or the spread of data about the mean.
RANGE is the difference between the largest and smallest values of a data set.
Data set x = {18, 19, 19, 23, 25, 27}
Data set y = {2, 2, 3, 3, 4, 5, 6}
Data set x has 6 items. Data set y has 7 items.
Mean of x = 21.8 Mean of y = 3.6
Range of x = 27 – 18 = 9
Range of y = 6 – 2 = 4
The range show the spread of the data but not how it is related to the mean. The standard
deviation and deviation show the spread relationship with the mean. As with the mean, there is a
sample standard deviation and a population deviation.
Method 1
SAMPLE VARIATION:
s
2
 ( x  x)

2
n 1
 ( x  x)
SAMPLE STANDARD DEVIATION: s =
Method 2
n 1
2
or
s  s2
 x 
x  n
2
2
SAMPLE VARIATION:
s2 
n 1
 x 
x  n
2
2
SAMPLE STANDARD DEVIATION:
s=
n 1
or
s  s2
Method 1 for data set x:
x
18
19
19
x - mean
-3.8
-2.8
-2.8
(x - mean)^2
14.7
8.0
8.0
n=
6
23
25
27
1.2
3.2
5.2
1.4
10.0
26.7
Mean of x
21.8
s=
s^2 =
3.7
13.8
n=
7
68.8
Data set for x:
s
68.8
 3.7 and s 2  13.8
6 1
Method 1 for data set y:
y
2
2
3
3
4
5
6
y - mean
-1.6
-1.6
-0.6
-0.6
0.4
1.4
2.4
(y - mean)^2
2.5
2.5
0.3
0.3
0.2
2.0
5.9
Mean of y
3.6
s=
s^2 =
1.5
2.3
13.7
Data set for y:
s
13.7
 1.5 and s 2  2.3
7 1
Method 2 for data set x:
x
18
19
19
23
25
27
x^2
324
361
361
529
625
729
131
2929
s2 
1312
6  2929  2860.167  68.833  13.8
6 1
5
5
2929 
s  13.8  3.7
Method 2 for data set y:
y
2
2
3
3
4
5
6
y^2
4
4
9
9
16
25
36
25
103
s2 
103 
252
7
7 1
s  2.3  1.5

103  89.29 13.71

 2.3
6
6
There is a corresponding standard deviation and variance for the population. The population
standard deviation is denoted , pronounced sigma. The population variance is 2, called sigma
squared. The formulas for these are found on page 88.
Data set x and y are quite different and would be difficult to compare with the measure that we
have produced so far. In order to compare different data set, we can use the coefficient of
variations to accomplish this.
Empirical Rule
This theorem is used to show the data spread about the mean.
Results of Chebyshev’s Theorem: For any set of data,
 At least 75% of the data fall in the interval from  - 2 to
 + 2

At least 88.9% of the data fall in the interval from
 - 3 to
 + 3

At least 93.8% of the data fall in the interval from
 - 4 to
 + 4
Using x to estimate  and s for  we can draw some conclusions about data set x.
At least 75% of the data falls must fall within 2 standard deviations of the mean.
x  2 s to x  2 s
21.8 – 2(3.7) to 21.8 + 2(3.7)
14.4 to 29.2
Section 2.5
Percentiles and Five-Number Summary
For whole number P, where 1  P  99 , the Pth Percentile of the distribution is a value such that
P% of the data fall at or below it and (100 – P%) of the data fall at or above it.
Quartiles are special percentiles. The 25th percentile is the first quartile Q1, the 50th percentile is
the second quartile Q2 , and the 75th percentile is the third quartile Q3. The second quartile Q2 is
the same as the Median.
Data set D is shown. It has been arranged in ascending order.
2
5
7
8
8
11
12
23
25
26
27
28
29
31
14
36
20
36
23
42
1. Find the median which is Q2. For this data, the median will fall between the 10th and 11th
items.
Median =
23  23
 23
2
2. Find Q1. This is the median of the data from the 10th and below.
Q1 =
8  11
 9 .5
2
3. Find Q3. This is the median of the data of the upper half of the data.
Q3 =
28  29
28.5
2
Now the five-number summary can be given.
Lowest value 2
Q2 = 9.5
Median or Q2 = 23
Q3 = 28.5
Highest value 42
Interquartile Range IQR = Q3 – Q1
The interquartile range for data set D is IQR = 28.5 – 9.5 = 19
The interquartile range is used to examine the data to evaluate if any extremely large or small
value may produce too much influence on the data analysis.