Download For ungrouped data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 9
Statistics
Frequency Distributions; Measures
of Central Tendency

Frequency Distributions
 Three
types of frequency distributions:
Categorical – primarily for nominal, ordinal level
data (FYI)
 Grouped – range of data is large
 Ungrouped – range of data is small, single data
values for each class (FYI)

Frequency Distributions; Measures
of Central Tendency

Grouped Frequency Distributions
 Step
1: Order data from smallest to largest
 Step 2: Determine the number of classes (e.g. class
intervals) using Sturges’ Rule k=1+3.322(log10n)
where n is the number of observations (data values).
*Always round up

Class intervals are contiguous, nonoverlapping intervals
selected in such a way that they are mutually exclusive and
exhaustive. That is, each and every value in the set of data
can be placed in one, and only one, of the intervals.
Frequency Distributions; Measures
of Central Tendency

Grouped Frequency Distributions
 Step

3: Determine width of class intervals
Width (W) = Range (R)
k
where Range= largest value-smallest value
k represents Sturges’ Rule
Frequency Distributions; Measures
of Central Tendency

Grouped Frequency Distributions
 Step


4: Assign observations to class intervals
The count in each class interval represents the
frequency for that interval.
The smallest observation serves as the first lower class
limit (LCL). Add the ‘width minus one’ to the LCL to get
UCL (upper class limit)

NOTE: Technically, class limits (i.e., 0-5, 6-11, 12-17 and so on)
are not adjacent.
However, class boundaries account for the space between the
class limit intervals (i.e., 0.5 – 5.5, 5.5-11.5, 11.5-17.5 and so
on). Boundaries are written for convenience but understood to
mean all values up to but not including the upper boundary.
Frequency Distributions; Measures
of Central Tendency

Grouped Frequency Distributions
 Step



5: Calculate cumulative & relative frequencies
Cumulative Frequency-Add number of observations from the
first interval through the preceding interval, inclusive.
Relative Frequency – Divide number of observations in each
class interval by the total number of observations
Cumulative Relative Frequency-Same calculation as cumulative frequency, but using the relative frequencies
A Frequency Distribution Table
Class Int. Freq. Cum. Freq. Rel. Freq. Cum. Rel. Freq.
LCL - UCL

Frequency Distributions; Measures
of Central Tendency

Measures of Central Tendency – the
value(s) the data tends to center around
 Arithmetic
 Mode
 Median
mean (average)
Frequency Distributions; Measures
of Central Tendency

Measures of Central Tendency

Arithmetic mean (sample mean or sample average) --“x-bar”

Ungrouped data (individual data such as 5, 6, 10, 14, etc.
_
x =  xi
n
_
x = x1 + x2 + x3 +… + xn
n


where xi is each data value (observation) in the data set.
where n is the number of observations in the data set
Frequency Distributions; Measures
of Central Tendency

Calculate the sample mean for ungrouped
data:
 Step
1: add all values in a data set
 Step 2: divide the total by the number of
values summed.
Frequency Distributions; Measures
of Central Tendency

Example
 7.0
6.2
6.5
7.7
7.2
8.0
6.4
6.2
7.2
5.4
*This is ungrouped data
5.4
6.4
 n = 12
_
x=
7.0+6.2+7.7+8.0+6.4+6.2+7.2+5.4+6.4+6.5+7.2+5.4
12


= 79.6
12
= 6.63
Frequency Distributions; Measures
of Central Tendency

Grouped data (assumes each value (observation) falling
within a given class interval is equal to the value of the
midpoint of that interval
_
x =  fi  x i
n

where xi represents each class interval midpoint (class mark)*
*an easy way to determine the class mark is to simply add the upper class
limit (boundary) to the lower class limit (boundary) then divide by 2.
Frequency Distributions; Measures
of Central Tendency

Calculate the sample mean for grouped
data:
 Step
1: multiply each class mark by its
corresponding frequency
 Step 2: add the resulting products
 Step 3: divide the total by the number of
observations
Frequency Distributions; Measures
of Central Tendency



Example
Class Limits
90 – 98
99-107
108-116
117-125
126-134
Frequency
Class Mark
6 (see note below)
94
22
103
43
112
28
121
9
130
108
xI  fI
564
2266
4816
3388
1170
12204
_
x = 12204 = 113
108
Note: Where did the number 6 come from? There are 6 data values
(observations) in the data set that fall between the range 90-98
(inclusive)
Frequency Distributions; Measures
of Central Tendency

Measures of Central Tendency
 Mode

Ungrouped data


– value that occurs most frequently
Step 1: identify the data value that occurs most
frequently
 Bi-modal -two values occurring at the same
frequency
 No mode – all values different (not same as mode=0)
Grouped data

Step 1: specify the modal class (i.e., the class interval
containing the largest number of observations
Frequency Distributions; Measures
of Central Tendency

For ungrouped data <mode>
 7.0
6.4
6.2
6.5
7.7
7.2
8.0
5.4
6.4
6.2
7.2
5.4
There are four numbers that appear two
times each:
 5.4 6.2 6.4 7.2
Therefore there are four
modes.
 The data set is quad-modal

Frequency Distributions; Measures
of Central Tendency

For grouped data
 The
<modal class>
modal class: 108-116 or 3rd class (The
class with the largest number of data values)
Frequency Distributions; Measures
of Central Tendency

Measures of Central Tendency
– The value above which half the
values in a data set lie and below which the
other half lie. (The middle value)
 Median

Ungrouped Data
Step 1: arrange the values in order of magnitude
(smallest to largest)
 Step 2: locate the middle value

Frequency Distributions; Measures
of Central Tendency

For ungrouped data

5.4
<median>
5.4 6.2 6.2 6.4 6.4 6.5 7.0 7.2 7.2 7.7
 Even
8.0
number of values therefore we must get
an average of the middle two values

6.4 + 6.5
2
=
6.45
Measures of Variation (Dispersion)

Range (R)

(for ungrouped data only)
Ungrouped data

Step 1: Take the difference between the largest and
smallest values in a data set. For example, a data set
such as 5, 6, 10, 14 has a range of 9 because 14 (the
largest value) minus 5 (the smallest value) is 9.
Measures of Variation (Dispersion)

Deviations from the Mean
 Differences
found by subtracting the mean
from each number in a sample

Given 3, 5, 2, 6
The mean ( x ) is 4
 The deviations from the mean would be -1, 1, -2, 2

Measures of Variation (Dispersion)

Variance (s2) - an average of the squares
of the deviations of the individual values
from their mean.

Ungrouped data
s2 =  (xi – x )2
n-1
Measures of Variation (Dispersion)

Standard deviation (s)

Step 1: Calculate the sample standard deviation
for grouped or ungrouped data by:

taking the square root of the variance
Measures of Variation (Dispersion)



Example
8
6
2
1
_
x = 4.2
n = 15
3
3
0
7
0
5
10
0
*This is ungrouped data
9
3
6

(a) Range (R) = 10 – 0 = 10

(b) variance (s2) = (8-4.2)2 + (6-4.2)2 + (3-4.2)2 + (0-4.2)2 + (0-4.2)2 + (5-4.2)2
+ (9-4.2)2 + (2-4.2)2 + (1-4.2)2 + (3-4.2)2 + (7-4.2)2 + (10-4.2)2
+ (0-4.2)2 +(3-4.2)2 + (6-4.2)2
_________
15-1
= 158.40__
14
= 11.31

(c) standard deviation (s) = the square root of 11.31 = 3.36
Measures of Variation (Dispersion)

Grouped data
s2 = n ( xi2  fi) - (xi  fi)2
n(n-1)


where xi represents each class boundary (or limit) midpoint
(class mark)*
where fi represents each class frequency
*an easy way to determine the class mark is to simply add
the upper class limit (boundary) to the lower class limit
(boundary) then divide by 2.
Measures of Variation (Dispersion)

Calculate the sample variance for grouped data:
Step 1: multiply each squared class mark by its
corresponding frequency
 Step 2: add the resulting products
 Step 3: multiply the sum by n
[A]
 Step 4: multiply each class mark by its corresponding
frequency
 Step 5: add the resulting products
 Step 6 :square the sum
[B]
 Step 7: perform subtraction
[C] = [A] – [B]
 Step 8: divide [C] by n(n-1)

Measures of Variation (Dispersion)

Example

Class limits freq(fi)
90 – 98
99-107
108-116
117-125
126-134
6
22
43
28
9
108
xi
94
103
112
121
130
xifi
564 (946)
2266
4816
3388
1170
12204
xi2fi
53,016 [(942)6]
233,398
539,392
409,948
152,100
1,387,854
Measures of Variation (Dispersion)

Refer to the formula for variance of
grouped data below and see if you can fill
in the formula using values from the table
on the previous slide.
s2 = n ( xi2  fi) - (xi  fi)2
n(n-1)
Measures of Variation (Dispersion)





s2 = 108(1,387,854) – (12,204)2
108(107)
= 149,888,232.0 - 148,937,616.0
11,556
= 950,616
11,556
= 82.26
Therefore s = 9.07
The Normal Distribution

The Normal Distribution
f(x)


 Also


x
known as the “bell-shaped” curve
Some statisticians say it is the most important
distribution in statistics
Most popular distribution in statistics
The Normal Distribution

The normal density function is given by
(x - )
f(x) =
1
 

where ∏≈ 3.142
and
ex ≈ 2.718
e

2 
The Normal Distribution

Properties of the Normal Distribution
-
symmetrical about mean;
mean = median = mode
area under the curve = 1
each different and specifies different normal
distribution, thus the normal distribution is
really a family of distributions
- a very important member of the family is
the standard normal distribution
The Normal Distribution

The Standard Normal Distribution
 has
mean (μ) = 0
 has standard deviation (σ) = 1
 the normal density function reduces to
f(z) =
1

e
z
2

The Normal Distribution

The probability that z lies between any two
points on the z-axis is determined by the
area bounded by perpendiculars erected
at each of the points, the curve, and the
horizontal axis.
f(z)
z
a
b
P(a <z< b)
The Normal Distribution

Generally we find the area under the curve
for a continuous distribution via calculus
by integrating the function between a & b.
b
a
1

e
z
2

dz
The Normal Distribution

However, we don't have to integrate
because we have a table that has
calculated this area
 See
TABLE 1 of Appendix A-2
The Normal Distribution

Exercises 6-3 #7 p. 282
Find the area under the normal distribution curve
between z = 0 and z = 0.56
 So, we want P (0 < z < 0.56)
 From the standard normal table we find that

P (0 < z f(z)
< 0.56) = 0.2123
z
a
b
where a = 0 and b = 0.56
The Normal Distribution

Exercises 6-3 #16 p. 283
Find the area under the normal distribution curve
between z = -0.87 and z = -0.21
 So we want P(-0.87 < z < -0.21)

a
b0
where a = -0.87 and b =-0.21
The Normal Distribution

Exercises 6-3 #16 p. 283 con’t

The table gives a probability of 0.3078 at z = 0.87 (note
area same for negative or positive z since distribution is
symmetrical). This area covers values of z from 0 out to .87. Since we don’t want that entire area we subtract the
area from 0 out to -.21. That is , we subtract .0832 which
is the area under the curve at z = 0.21

So 0.3078 – 0.0832 = 0.2246
The Normal Distribution

Exercises 6-3 #25 p. 283
Find the area under the normal distribution curve
to the right of z = 1.92 and to the left of
z = -0.44
 So we want P(z >1.92)  P(z < -0.44) = 0.3574

a0
b
where a = -0.44 and b = 1.92
The Normal Distribution
 Exercises

6-3 #25 p. 283 Con’t
Since the area at z = .44 is 0.1700 which is the area under the
curve from 0 out to 0.44, the remaining area of interest has to be 0.5
– 0.1700 = 0.3300.
AND
Since the area at z = 1.92 is .4726 which is the area under the curve
from 0 out to 1.92, the remaining area of interest has to be 0.5 – 0.4726
= 0.0274. So the combined areas of interest are
0.3300 + 0.0274 = 0.3574
The Normal Distribution

Exercises 6-3 #45

z=?
Given that the shaded area is 0.8962, what would be the value of z?

z has to be equal to -1.26. Since the area from 0 out to z is equal to 0.3962
(0.8962 - 0.5000) Recall that one-half of the area under the curve is .5. If we look
in the body of the standard normal table for an area of 0.3962 we find that value
at the intersection of the 13th row and 7th column which corresponds to a z value
of 1.26. Since z is located to the left of 0 it has to be negative, hence – 1.26.
0.8962
z
0
The Normal Distribution


Section 6-4
Applications of the Normal Distribution
 To
solve problems for a normally distributed variable
with a   0 or   1 we MUST transform the variable
to a standard normal variable, that is
P(x1 < X < x2) becomes P(z1 < Z < z2) which
allows us to use the standard normal table.
 Using z = value – mean = x - 
standard dev.

The Normal Distribution

Example

A survey found that people keep their television sets an average of 4.8 years.
The standard deviation is 0.89 year. If a person decides to buy a new TV set, find
the probability that he or she has owned the set for the following amount of time.
Assume the variable is normally distributed.





Less than 2.5 years
Between 3 and 4 years
More than 4.2 years
 = 4.8  = 0.89
-2.58
0
(a) P(x < 2.5) becomes P(z<-2.58) because z = (2.5 – 4.8)/ 0.89 = -2.58
The area under the curve at Z=2.58 is 0.4951 therefore the P(z<-2.58) = 0.5
– 0.4951 = 0.0049
The Normal Distribution


(b) P(3 < X < 4) becomes P(-2.02 < z < -0.9) because z = (3-4.8)/ .89 =
-2.02 and z=(4-4.8)/.89 = -0.90
from the standard normal table at a z of 2.02 we get .4783 and at a z of
.9 we get .3159 so the P(-2.02 < z < -0.9) = .4783 - .3159 = .1624
-2.02
-.9
0
The Normal Distribution


(c) P (x > 4.2) becomes P(z > -0.67) because z = (4.2-4.8)/.89 = -0.67
from the standard normal table at z of .67 we get .2486 so the P(z > 0.67) = 0.2486 + 0.5 = 0.7486
-.67
0
The Normal Distribution

Review Exercises #9
 Area (%age) = .5
 = 100
 = 15

We can find the X values that correspond to the z values by using the same
transformation equation.
-0.67 = (x – 100)/15 and
0.67 = (x -100)/15
15(-.67) = x – 100
15(.67) = x - 100
x = 89.95
x = 110.05
therefore the highest and lowest scores are in the
range (89.95 < x
< 110.05)

-.67
0
.67