Download mean

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTET 3
Data Description
OUTLINE:
Introduction.
3-1 Measures of Central Tendency.
3-2 Measures of Variation.
3-3 Measures of Position.
3-4 Exploratory Data Analysis

3-1 MEASURES OF CENTRAL
TENDENCY:
Mean, Median, Mode, Midrange,
Weighted Mean.
Statistic: 
Is a characteristic or
measure obtained by using
all the data values from a
sample.
Parameter: 
Is a characteristic or measure
obtained by using all the
data values from a specific
population.
THE MEAN:
 The
mean is the sum of the values, divided by the
total number of values.

x
The symbol for the
sample mean:
X
X
n

X 1  X 2  X 3  .............  X n
n
Where n: no. of val. In
sample.
The symbol for the
population mean:

X
N

X 1  X 2  X 3  .............  X n
N
Where N: no. of val. In
population.
EXAMPLE 3-1:
The data represent the number of days off per year for
a sample of individuals selected from nine different
countries. Find the mean
20, 26, 40, 36, 23, 42, 35, 24, 30
PROPERTIES OF THE MEAN






Uses all data values.
Varies less than the median or mode
Used in computing other statistics, such as the
variance
Unique, usually not one of the data values
Cannot be used with open-ended classes
Affected by extremely high or low values, called
outliers
THE MEDIAN:
 The
median (MD) is the midpoint of the
data array.
*** The median is found by arranging the data
in order and selecting the middle number ***
Odd number of values
The median will be one
of the data values in the
middle.
Even number of values
The median will be the
average of two data
values.
EXAMPLE 3-4:
The number of rooms in the seven hotels in
downtown Pittsburgh is:
713, 300, 618, 595, 311, 401, and 292..
Find the median
EXAMPLE 3-6:
The number of tornadoes that have occurred in
the United States over an 8-year period follows.
684, 764, 656, 702, 856, 1133, 1132, 1303
Find the median
PROPERTIES OF THE MEDIAN
Gives the midpoint
 Used when it is necessary to find out whether the
data values fall into the upper half or lower half of
the distribution.
 Can be used for an open-ended distribution.
 Affected less than the mean by extremely high or
extremely low values.

•The
cost of four toys in a certain toy shop is given:
15$, 40$, 30$, 1350$
measure of central tendency should be used:
a)
b)
c)
d)
mean
median
mode
range
THE MODE:
 The
value that occurs most often in a data set is called
the mode.
No mode
When no
data value
occurs more
than once.
Unimodal
A data set
that has only
one mode.
Multimodal
Bimodal
A data set
that has two
mode.
A data set that
has more than
two mode.
EXAMPLE 3-9:
Find the mode of the signing bonuses of eight
NFL players for a specific year. The bonuses in
millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
EXAMPLE 3-10:
Find the mode for the number of coal employees per
county for 10 selected counties in southwestern
Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
PROPERTIES OF THE MODE
 Used
when the most typical case is desired
 Easiest average to compute
 Can be used with nominal data (Qualitative data)
 Not always unique or may not exist
* The ………. Is a measure of central tendency should be used when
the data are qualitative.
a)
b)
c)
d)
median
mean
mode
range
----------------------------------------------------------------------* When the data are categorical (red, blue, black, white) the measure
of central tendency can be used:
a)
b)
c)
d)
median
mean
mode
range
•If
the number of books in a sample of 5-box are as
follow : 11, 18, 2, 2, 7, 7, 2, 5
the set is said to have ……….
a)
b)
c)
d)
Multimodal.
Unimodal.
Zero.
Bimodal.
THE MIDRANGE:
 The
midrange is defined as the sum of the
lowest and highest values in the data set,
divided by 2.
 The
symbol of midrange is MR.
lowest .value  highest .value L.v  H .v
MR 

2
2
EXAMPLE 3-15:
In the last two winter seasons, the city of
Brownsville, Minnesota, reported these
numbers of water-line breaks per month..
2, 3, 6, 8, 4, 1
Find the midrange
PROPERTIES OF THE MIDRANGE
 Easy
to compute.
 Gives the midpoint.
 Affected by extremely high or low values in a
data set.
THE WEIGHTED MEAN:
 Find
the Weighted Mean is used when the values
in a data set are not equally represented.

w1 X 1  w2 X 2  .......  wn X n  wX
Xw 

w1  w2  .......  wn
w
w1 ,w2 , …., wn are the weights
 And x1 , x2 , ….., xn are the values.
 Where
EXAMPLE 3-17:
A student received the following grades. Find
the corresponding GPA.
Course
Credits,
Grade,
English Composition
3
A (4 points)
Introduction to Psychology
3
C (2 points)
Biology
4
B (3 points)
Physical Education
2
D (1 point)
SUMMARY OF MEASURES OF CENTRAL
TENDENCY.
Measure
Mean
Definition
Sum of values, divided by
total no. of values
Median
Middle point in data array
Mode
Most frequent data value
Midrange L.V plus H.V ,divided by 2
Symbol
, x
MD
MR
y
Positively skewed
Mode Median
Mean
Mean > MD> Mode
y
x
Negatively skewed
Mean
Mode
Median
Mean < MD< Mode
x
Symmetric distribution
Mode
Median
Mean
= MD = D
3-2 MEASURES OF
VARIATION:
How Can We Measure Variability?
oRange
oVariance
oStandard
Deviation
oCoefficient
of Variation
RANGE

The range is the difference between the highest and
lowest values in a data set.
R  Highest  Lowest
EXAMPLE 3-18/19:
Two experimental brands of outdoor paint are
tested to see how long each will last before fading.
Six cans of each brand constitute a small
population. The results (in months) are shown.
Find the mean and range of each group.
Brand A
Brand B
10
35
60
45
50
30
30
35
40
40
20
25
VARIANCE & STANDARD DEVIATION



The variance is the average of the squares of the
distance each value is from the mean.
The standard deviation is the square root of the
variance.
The standard deviation is a measure of how spread
out your data are.

The population variance is


2
The population standard deviation is


X X



n 1
2
s
 X  X 
n 1
2
N
n X    X 
2
s 
2
The sample standard deviation is
2
N
 X  
The sample variance is
s2

X  


2
n  n  1
2
USES OF THE VARIANCE AND
STANDARD DEVIATION
To determine the spread of the data.
 To determine the consistency of a variable.
 To determine the number of data values that fall
within a specified interval in a distribution
(Chebyshev’s Theorem).
 Used in inferential statistics.

EXAMPLE 3-21:
Find the variance and standard deviation for the
data set for Brand A paint. 10, 60, 50, 30, 40, 20
Months,

X-µ
10
60
50
30
40
20
2
X  


n
1750

6
 291.7
1750
2
MEASURES OF VARIATION:
VARIANCE & STANDARD DEVIATION
(SAMPLE COMPUTATIONAL MODEL)

Is mathematically equivalent to the theoretical
formula.

Saves time when calculating by hand

Does not use the mean

Is more accurate when the mean has been rounded.
EXAMPLE 3-23:
Find the variance and standard deviation for the
amount of European auto sales for a sample of 6
years. The data are in millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
X
11.2
11.9
12.0
12.8
13.4
14.3
75.6
X
2
958.94
n X    X 
2
s 
2
n  n  1
2
COEFFICIENT OF VARIATION
The coefficient of variation is the
standard deviation divided by the mean,
expressed as a percentage.
s
CVAR  100%
X
Use CVAR to compare standard deviations
when the units are different.
EXAMPLE 3-25:
The mean of the number of sales of cars over
a 3-month period is 87, and the standard
deviation is 5. The mean of the commissions
is $5225, and the standard deviation is $773.
Compare the variations of the two.
•If
the Cvar for the English final exam was 6.9%
and Cvar for History final exam was 4.9%.
Compare the variations:
a) The English class was more variable.
b) The History class was more variable.
c) Cannote determine that.
d) Both of classes has the same variation.
For any bell shaped distribution.
Approximately 68% of the data values will fall within one
standard deviation of the mean .
Approximately 95% of the data values will fall within two
standard deviation of the mean .
Approximately 99.7% of the data values will fall within
three standard deviation of the mean .

= 480 , S = 90 , approximately 68%
Then the data fall
between 570 and 390

= 480 , S = 90 , approximately 95%
Then the data fall
between 660 and 300

= 480 , S = 90 , approximately 99.7%
Then the data fall
between 750 and 210
When a distribution is bell-shaped, approximately what
percentage of data values will fall within 1standard deviation
of the mean?
a) 90%
b) 68%
c) 86%
d) 99%
--------------------------------------------------------------------------------------- Math exam scores have a bell-shaped distribution with a
mean of 100 and standard deviation of 15 . use the empirical
rule to find the percentage of students with scores between
70 and 130?

MEASURE OF POSITION
•
•
•
Standard score or z score
Quartile
Outlier
z score or standard score (relative position)
For Population
For Sample
If z score is negative  the score is below the mean
If z score is positive  the score is above the mean
Test A
Test B
X=38
X=94
= 40
= 100
S=5
S=10
Quartiles
Smallest
data
value
25%
Q1
25%
Q2
25%
Q3
25%
largest
data
value
25%
50%
75%
Quartiles divide the data set into 4 equal groups .
Deciles are denoted Q1 , Q2 and Q3 with the corresponding
percentiles being P25 , P50 and P75 .
The median is the same as P50 or Q2 .
PROCEDURE TABLE
Finding Data Values Corresponding to Q1,Q2and Q3
.
Step 1: Arrange the data in order from lowest to highest .
Step 2: Find the median of the data values .This is the value
for Q2 .
Step 3: Find the median of the data values that fall below Q2.This is the
value for Q1 .
Step 4: Find the median of the data values that fall above Q2.This is the
value for Q3.
Outliers
 An outlier is an extremely high or an extremely low
data value when compare with the rest of the data values .
Step 1: Arrange the data in order and find Q1 and Q3.
Step 2: Find the interquartile range IQR= Q3 - Q1
Step 3: Multiply the IQR by 1.5 .
Step 4: Subtract the value obtained in step 3 form Q1 and add the value
to Q3.
Step 5: any value: smaller than Q1-1.5(IQR) or larger than
Q3+1.5(IQR).  outliers
•Use
the data below to find the outliers:
8, 4, 0, 2, 10, 15, 12, 22
a)
15
b)
22
c)
No outlier
d)
0
---------------------------------------------------------------------------------------------------------•If
a)
b)
c)
d)
Q1=4, IQR=40, Find Q3:
10
44
40
-4
* All the values in a dataset are between 60 and 88, except for
one value of 97. That value 97 is likely to be -----•An outlier
•The mean
•The box plot
•The range
EXPLORATORY DATA ANALYSIS
(EDA)
 A Box plot can be used to graphically represent the data set .
 For the box plot, The five –Number Summary :
1-lowest value of the data set . (min)
2-Q1.
3-the median(MD) Q2.
4-Q3.
5-the highest value of the data set . (max)
PROCEDURE FOR CONSTRUCTING A BOXPLOT
1.
2.
3.
4.
Find five -Number summary .
Draw a horizontal axis with a scale such that it includes
the maximum and minimum data value .
Draw a box whose vertical sides go through Q1 and Q3,and
draw a vertical line though the median .
Draw a line from the minimum data value to the left side
of the box and line from the maximum data value to the
Q2
Q3
right side of the box. Q1
maximum
minimum
Real cheese
310
420
220
240
45
180
40
90
Cheese substitute
270
180
250
130
260
340
290
310
Compare the distributions using Box Plots?
Step1: Find Q1,MD,Q3 for the Real cheese data
40 , 45 , 90 , 180 , 220 , 240 , 310 , 420
Q1
MD
Q3
45  90
180  220
240  310
,
,
Q1 
 67.5
Q2 
 200
Q3 
 275
2
2
2
Step2:Find Q1 , MD and Q3 for the cheese substitute data.
130 , 180 , 250 , 260 , 270 , 290 , 310 , 340
Q1
MD
Q3
260  270
180  250
Q1 
 215 , Q2 
=265
2
2
290  310
 300
, Q3 
2
67.5
200
275
40
420
215
265 300
130
0
100
340
200
300
400
500
Right skewed
(+ve skewed)
Left skewed
(-ve skewed)
INFORMATION OBTAINED FROM A BOX PLOT
From Box plot:
The median(MD) is near the center. The distribution is symmetric.
The median falls to left of the center The distribution is positively
skewed (right skewed).
The median falls to right of the
center .
The distribution is negatively
skewed (left skewed).
From Line:
The lines are the same length.
The distribution is symmetric.
The right line is larger than the
left line .
The left line is larger than the
right line .
The distribution is positively
skewed (Right skewed). .
The distribution is negatively
skewed (Left skewed).
•The
2
3
4
5
6
0
4
2
2
1
following stem & leaf represent the scores for students on an statistics exam:
1 3 3
5 7 7 7 7 8
3 3 5 5
4 6
2
* Find the mode:
a) 37 b) 23 c) 40 d) 62
* The range:
a) 40 b) 42 c) 30 d)33
* The mean:
a) 41.19 b) 43 c) 42.5 d) 20
* The skewed in this graph:
a) Symmetric skewed
b) positive skewed c) negative skewed
•The
2
3
4
5
following stem & leaf represent the scores for students on an statistics exam:
0
1 4
0 1 2 2
2
* Find the mode:
a) 42 b) 23 c) 40 d) 62
* The range:
a) 40 b) 42 c) 30 d)32
* The mean:
a) 41.19 b) 43 c) 42.5 d) 37.75
* The skewed in this graph:
a) Symmetric skewed
b) positive skewed
c) negative skewed
0
2
4
* The IQR is ……..
a) 4 b) 5 c) 10 d)0
* The distribution shape is ……..
a) Positively skewed
b) Negative skewed
6
8
10
* A characteristic or measure obtained by using the data values from a population is
collide a ……….
a) Statistic
b) Quartile
c) Percent
d) Parameter
----------------------------------------------------------------------------------------------------------------------
* ……….. is the symbols of mean in the sample
--------------------------------------------------------------------------------------------------------------------
* Find the mean of following data 10, 15, 12, 9, 2, 6 ?
a) 12
b) 9
c) 54
d) 10.9
----------------------------------------------------------------* What is the median of following data 5, 7, 10, 3, 8?
* If the number of books in a sample of five boxes are follow
11 , 8, 2, 2, 7, 7, 2, 5 . Then the set is said to have
a) Multimodal
b) Unimodal
c) Zero
d) No mode
-------------------------------------------------------------------------------------------------
* Find the midrange (MR) for the following data 7, -5, 2, 10, 15
---------------------------------------------------------* If the mean of 5 values equals 64, then  X  ?
--------------------------------------------------------* The measures of central tendency for the following data 1,3,9,11,2 are:
Mean= …………… median =……………… mode= ……………..