Download Statistics: The mathematics of the collection, organization, and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Math Analysis: Statistics
Class Notes (Page 1)
Name:
Statistics: The mathematics of the collection, organization, and interpretation of
numerical data, especially the analysis of population characteristics by inference
from sampling.
X
The Arithmetic Mean is the most widely used measure of location.
n
It is calculated by summing the values and dividing by the number of values (the average).
X 
The Median is the midpoint of the values after they have been ordered from the smallest to the
largest. There are as many values above the median as below it in the data array. For an even set
of values, the median will be the arithmetic average of the two middle numbers.
Example: The ages for a sample of five college students are: 21, 25, 19, 20, 22
Arranging the data in ascending order gives: 19, 20, 21, 22, 25. Thus the median is 21.
Example: Given the data set 8, 12, 14, 28, 36, 40, 42, 48, which is already in order, the median is
(28+36)/2 = 32.
The Mode is the value of the observation that appears most frequently.
There can be more than one mode per set of data, and if all of the data elements occur the same
number of times, then there is ‘no mode’.
Example: Given the data set: 450, 330, 260, 390, 330, 400, 450, 260, 410, 260, 180
the mode is 260, because it occurs the most times.
Practice 1: Determine the mean, median, and mode for the following data set: 22, 30, 42, 33, 25,
30, 40, 38, 29.
The Range is the difference between the largest (maximum) and the smallest (minimum) value.
Example: The range for the data set from the previous example is 450-180 = 270.
Stem-and-leaf display: A statistical technique for displaying a set of data. Each numerical value is
divided into two parts: the leading digits become the stem and the trailing digits the leaf.
Practice 2: Colin achieved the following scores on his twelve Accounting quizzes this semester:
86, 79, 92, 84, 49, 88, 91, 83, 96, 78, 82, 85. Construct a stem-and-leaf chart below.
Math Analysis: Statistics
Class Notes (Page 2)
Name:
The Percentile gives us the location, or ranking, of a data point in relation to the data set.
For instance, the 9th percentile is the value that is above exactly 9% of all the data points.
A special type of percentile is called the Quartile.
The first quartile, Q1, is the value that is above one quarter, or 25% of the data values.
The third quartile, Q3, is the value that is above three quarters, or 75% of the data values.
The first quartile, Q1, is essentially the median for the first half of the data.
The third quartile, Q3, is essentially the median for the second half of the data.
The Inter-quartile range is the distance between the third quartile Q3 and the first quartile Q1.
This distance will include the middle 50 percent of the observations.
Inter-quartile range = Q3 - Q1
Example: Given the following set of data: 52, 26, 33, 40, 35, 29, 26, 37, 28
Arranging the data in ascending order gives: 26, 26, 28, 29, 33, 35, 37, 40, 52. Thus the median is
33, Q1 is 27, and Q3 is 38.5. The inter-quartile range is Q3 - Q1 = 38.5 – 27 = 11.5.
A Box Plot (sometimes called a ‘box and whisker plot’) is a graphical display, based on quartiles,
that helps to picture a set of data.
Five pieces of data are needed to construct a box plot: the Minimum Value, the First Quartile, the
Median, the Third Quartile, and the Maximum Value.
Practice 3: Construct a box plot for the data set from the previous example:
A box plot sometimes includes an Outlier.
An outlier is an extreme value that is more than 1.5 times the inter-quartile range beyond the upper
or lower quartiles. If an outlier exists, it is marked by a single point, and each whisker is extended
to the last value of the data that is not an outlier.
Example: In the data from the previous example, the inter-quartile range was 11.5.
(1.5)(11.5) = 17.25. Q3 + 17.25 = 38.5+17.25 = 55.75. Since the highest value in the set was 52,
there is no outlier on that side. Q1 – 17.25 = 27-17.25 = 9.75. Since the lowest value in the set
was 26, there are no outliers on that side. So there are no outliers for this set of data.
Practice 4: Determine any outliers for this set of data: 20, 45, 46, 75, 40, 48, 47, 43.
Math Analysis: Statistics
Class Notes (Page 3)
Name:
To find the location of the percentile, p, in a data set containing n data points, first order the data
from smallest to largest. Then, to find the location in the ordered set, use the following formula.
L p  (n  1)
P
100
If the location falls between two data points, you will find a value between those data points.
Example: Here is the process for finding the 18th percentile for the following data set:
30, 32, 37, 39, 41, 43, 44, 46, 48, 48, 53
18
In this problem, n = 11. Therefore the location of the 18th percentile is L p  (11  1) 100  2.16
and is between the 2nd and 3rd data points.
With a difference of 5, the 18th percentile is 32 + .16*5 or p18 = 32.80
Practice 5: Find the 67th percentile for the following data set: 45, 16, 99, 54, 31, 84, 86, 39, 60
Practice 6: For the given data set: 48, 40, 42, 6, 37, 43, 31, 44, 39
a) create a stem and leaf plot.
b) determine the mean, median, mode, maximum, minimum
c) determine the first quartile, third quartile
d) determine the range, inter-quartile range, and any outliers
e) draw a box plot for this data
f) Determine the 46th percentile
Math Analysis: Statistics
Class Notes (Page 4)
Name:
Standard Deviation and Variance:
Just as the median is one measure of the middle of a set of data, so is the mean. The mean is defined as the
sum of the data divided by the number of elements.
The formula is:
x
x
n
Just as the quartiles measure the dispersion or spread around the median, variance and standard deviation
measure the spread around the mean.
 x  x
2
The formula is:

n
Example: Consider these test scores: 70, 98, 88, 98, 84, 75, 100, 76
Here is the process for finding the Standard Deviation:
1. Find the mean.
2. Find the difference between each data point and the mean. This is the deviation from the mean.
The sum of all these differences equals zero (some are positive and some are negative)
3. To do away with the problem of negative numbers, square each difference. Find the sum and
divide by the number of data points. This is the VARIANCE.
4. This is hard to compare, as it is not in the same scale as the original data. Take the square root of
the variance. This is the STANDARD DEVIATION.
Find the mean first: x =
Fill in the chart below to find the variance and standard deviation
List the data
xx
 x  x
2
Variance =
Standard Deviation =
Practice 7: Consider the following heights of students in inches:
52, 72, 62, 67, 65, 73, 55, 60, 62, 56
Find the variance and standard deviation for this set of data.
Math Analysis: Statistics
Class Notes (Page 5)
Name:
A Frequency Distribution is a grouping of data into mutually exclusive categories showing the
number of observations in each class.
Some concepts associated with Frequency Distributions are the following:
Class frequency: The number of observations in each class.
Class interval: The class interval is obtained by subtracting the lower limit of a class from the
lower limit of the next class. The class intervals used in the frequency distribution should be
equal.
Class Mark: The midpoint of a class interval.
Number of Classes: Should use at least k classes, where 2k > n ( the number of data points).
(This is the 2k rule)
Determine a suggested class interval (i) by using the formula:
i
(Highest va lue - Lowest val ue)
Number of classes
Note: this is a suggested class interval; if the computed class interval is ’97’for instance, it may be better to use ‘100’.
A Relative Frequency Distribution shows the percent of observations in each class.
Frequency
A Histogram is a graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis. It is a visual representation of the Frequency Distribution. The
class frequencies are represented by the heights of the bars and the bars are drawn adjacent to each
other. An example is drawn below:
14
12
10
8
6
4
2
0
7.5-12.5
12.5-17.5
17.5-22.5
22.5-27.5
27.5-32.5
32.5-37.5
Hours spent studying
Practice 8: Construct a frequency distribution, as well as a histogram, for the following data set
(representing number of people per household): 5, 3, 6, 7, 4, 4, 2, 5, 3, 2.