Download mean

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHS 221
VISUALIZING DATA
1
Week 3
Dr. Wajed Hatamleh
http://staff.ksu.edu.sa/whatamleh/en
VISUALIZING DATA
•Depict the nature of shape or shape of
the data distribution
•In a graph: Different graphs used for
different types of data
2
HISTOGRAM
 Another common graphical presentation of
quantitative data is a histogram.
 The variable of interest is placed on the horizontal
axis.
 A rectangle is drawn above each class interval with
its height corresponding to the interval’s frequency,
relative frequency, or percent frequency.
3
HISTOGRAMS
Histograms: Used for quantitative data
 Similar to a bar graph, with an X and Y axis—but
adjacent values are on a continuum so bars touch one
another
 Data values on X axis are arranged from lowest to
highest
 Bars are drawn to height to show frequency or
percentage (Y axis)

4
HISTOGRAMS (CONT’D)

Example of a histogram: Heart rate data
12
10
f
8
6
4
2
Heart rate in bpm
0
0
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
5
Histogram
A bar graph in which the horizontal scale represents the classes of
data values and the vertical scale represents the frequencies.
Figure 2-1
6
Relative Frequency Histogram
Has the same shape and horizontal scale as a histogram, but the
vertical scale is marked with relative frequencies.
7
Figure 2-2
Histogram
and
Relative Frequency Histogram
Figure 2-1
Figure 2-2
Ogive

An ogive is a graph of a cumulative distribution.

The data values are shown on the horizontal axis.

Shown on the vertical axis are the:
• cumulative frequencies, or
• cumulative relative frequencies, or
• cumulative percent frequencies

The frequency (one of the above) of each class is
plotted as a point.

The plotted points are connected by straight lines.
9
Ogive
A line graph that depicts cumulative frequencies
Figure 2-4
10
BAR GRAPHS
Bar graphs: Used qualitative data.
 Bar graphs have a horizontal dimension (X axis) that
specifies categories (i.e., data values)
 The vertical dimension (Y axis) specifies either
frequencies or percentages
 Bars for each category drawn to the height that
indicates the frequency or %

11
BAR GRAPHS
Example of
a bar graph
 Note the
bars do not
touch each
other

PIE CHART
Pie Charts: Also used for qualitative data.
 Circle is divided into pie-shaped wedges corresponding
to percentages for a given category or data value
 All pieces add up to 100%
 Place wedges in order, with biggest wedge starting at
“12 o’clock”

13
PIE CHART

Example of a
pie chart, for
same marital
status data
Recap
In this Section we have discussed graphs
that are pictures of distributions.
Keep in mind that the object of this section is
not just to construct graphs, but to learn
something about the data sets – that is, to
understand the nature of their distributions.
15
CHARACTERISTICS OF A DATA
DISTRIBUTION
Central tendency
 Variability


Both central tendency and variability can be expressed
by indexes that are descriptive statistics
16
CENTRAL TENDENCY



Indexes of central tendency provide a single number
to characterize a distribution
Measures of central tendency come from the center
of the distribution of data values, indicating what is
“typical,” and where data values tend to cluster
Popularly called an “average”
17
CENTRAL TENDENCY INDEXES

Three alternative indexes:
The mode
 The median
 The mean

18
THE MODE
 The
mode is the
score value with
the highest
frequency; the
most “popular”
score
Age: 26 27 27 28
29 30 31
 Mode = 27

2.5
2.0
1.5
1.0
Std. D
.5
Mean
0.0

AGE
The mode
26.0
27.0
28.0
N = 7
29.0
30.0
31.0
19
THE MODE: ADVANTAGES

Can be used with data measured on any
measurement level (including nominal level)

Easy to “compute”

Reflects an actual value in the distribution, so it is
easy to understand

Useful when there are 2+ “popular” scores (i.e., in
multimodal distributions)
20
Mode
A data set may be:
Bimodal
Multimodal
No Mode
 denoted by M
the only measure of central tendency that
can be used with qualitative data
21
Examples
a. 5.40 1.10 0.42 0.73 0.48 1.10
Mode is 1.10
b. 27 27 27 55 55 55 88 88 99
Bimodal -
c. 1 2 3 6 7 8 9 10
27 & 55
No Mode
22
THE MODE: DISADVANTAGES



Ignores most information in the distribution
Tends to be unstable (i.e., value varies a lot from one
sample to the next)
Some distributions may not have a mode (e.g., 10, 10,
11, 11, 12, 12)
23
THE MEDIAN
 The
median is the
score that divides
the distribution
into two equal
halves
 50% are below the
median, 50% above
Age: 26 27 27 28
29 30 31
 Median (Mdn) = 28

2.5
2.0
1.5
1.0
Std. De
.5
Mean =
N = 7.0
0.0
26.0
27.0
28.0
29.0
30.0
31.0
AGE

The median
24
5.40
1.10
0.42
0.42
0.48
0.73
0.73
0.48
1.10
1.10
1.10
5.40
(even number of values – no exact middle
shared by two numbers)
0.73 + 1.10
MEDIAN is 0.915
2
5.40
0.42
1.10
0.48
0.42
0.66
(in order -
exact middle
0.73
0.48
1.10
0.66
0.73
1.10
1.10
5.40
odd number of values)
MEDIAN is 0.73
25
THE MEDIAN: ADVANTAGES



Not influenced by outliers
Particularly good index of what is “typical” when
distribution is skewed
Easy to “compute”
26
THE MEDIAN: DISADVANTAGES


Does not take actual data values into account—only
an index of position
Value of median not necessarily an actual data
value, so it is more difficult to understand than
mode
27
THE MEAN
 The
mean is the
arithmetic average
2.5
2.0
 Data
values are
summed and
divided by N
1.5
1.0
Std. Dev =
.5
Mean = 2
N = 7.00
0.0
Age: 26 27 27 28
29 30 31
 Mean = 28.3

26.0
AGE
27.0
28.0

29.0
30.0
31.0
The mean
28
THE MEAN (CONT’D)

Most frequently used measure of central tendency

Equation:
M = ΣX ÷ N

Where:
M = sample mean
Σ = the sum of
X = actual data values
N = number of people
29
THE MEAN: ADVANTAGES

The balance point in the distribution:

Sum of deviations above the mean always exactly
balances those below it

Does not ignore any information

The most stable index of central tendency

Many inferential statistics are based on the mean
30
THE MEAN: DISADVANTAGES



Sensitive to outliers
Gives a distorted view of what is “typical” when
data are skewed
Value of mean is often not an actual data value
31
THE MEAN: SYMBOLS

Sample means:
In reports, usually symbolized as M
 In statistical formulas, usually symbolized as
(pronounced X bar)

x

Population means:

The Greek letter μ (mu)
32
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set of sample values
∑x
x =
n
µ is pronounced ‘mu’ and denotes the mean of all values
in a population
µ =
∑x
N
33
Best Measure of Center
34
Definitions

Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror image of
its right half.

Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
35
the other.
Skewness
Figure 2-11
36
Recap
In this section we have discussed:
 Types of Measures of Center
Mean
Median
Mode
 Mean from a frequency distribution
 Best Measures of Center
 Skewness
37
MEASURES OF VARIATION
Because this section introduces the concept
of variation, this is one of the most important
sections in the entire book
38
DEFINITION
The range of a set of data is the
difference between the highest
value and the lowest value
highest
value
lowest
value
39
DEFINITION
The standard deviation of a set of
sample values is a measure of
variation of values about the mean
40
SAMPLE STANDARD
DEVIATION FORMULA
S=
∑ (x - x)
n-1
2
41
SAMPLE STANDARD DEVIATION
(SHORTCUT FORMULA)
n (∑x ) - (∑x)
n (n - 1)
2
s=
2
42
Standard Deviation Key Points
 The
standard deviation is a measure of
variation of all values from the mean
 The
value of the standard deviation s is
usually positive
 The
value of the standard deviation s can
increase dramatically with the inclusion of one
or more outliers (data values far away from
all others)
 The
units of the standard deviation s are the
same as the units of the original data values
43
Definition
Empirical (68-95-99.7) Rule For data sets
having a distribution that is approximately bell
shaped, the following properties apply:
 About 68% of all values fall within 1 standard
deviation of the mean
 About
95% of all values fall within 2 standard
deviations of the mean
 About 99.7% of all values fall within 3
standard deviations of the mean
44
The Empirical Rule
45
FIGURE 2-13
The Empirical Rule
46
FIGURE 2-13
The Empirical Rule
47
FIGURE 2-13
ARE YOU READY
Post
test Time
48
Slide 3- 49
Which measure of center is the only one that
can be used with data at the catogrical
level of measurement?
A. Mean
B. Median
C. Mode
A. Mean
B. Median
C. Mode
Slide 3- 50
Which of the following measures of center is
not affected by outliers?
A. Mean
B. Median
C. Mode
Slide 3- 51
Which of the following measures of center is
not affected by outliers?
Find the mode (s) for the given sample data.
A. 79
B. 48.1
C. 42.5
D. 25
Slide 3- 52
79, 25, 79, 13, 25, 29, 56, 79
Find the mode (s) for the given sample data.
A. 79
B. 48.1
C. 42.5
D. 25
Slide 3- 53
79, 25, 79, 13, 25, 29, 56, 79
Which is not true about the variance?
B. It is a measure of the spread of data.
C. The units of the variance are different
from the units of the original data set.
D. It is not affected by outliers.
Slide 3- 54
A. It is the square of the standard deviation.
Which is not true about the variance?
B. It is a measure of the spread of data.
C. The units of the variance are different
from the units of the original data set.
D. It is not affected by outliers.
Slide 3- 55
A. It is the square of the standard deviation.
A. Mean
B. Median
C. Mode
Slide 3- 56
Which of the following measures of center is
not affected by outliers?
EXERCISE TIME
57
EXERCISE 1
1. The following 10 data values are diastolic blood
pressure readings. Compute the mean, the
range and SD, for these data.
 130
110 160 120 170
 120
150 140 160 140
58
EXERCISE 2
The following are the fasting blood glucose level
of 10 children
1. 56
6. 56
2. 62
7. 65
3. 63
8. 68
4. 65
9. 70
5. 65
10. 72
Compute the: a. range
b. standard deviation

59
EXERCISE 3
3.The fifteen patients making initial visits to a
rural health department travelled these
distances: Find: a. Range, b. Standard Deviation
Patient
Distance
(Miles)
Patient
Distance
(Miles)
1
2
3
4
5
6
7
5
9
11
3
12
13
12
8
9
10
11
12
13
14
15
6
13
7
3
15
12
15
5
60
ANSWER
1. Range = 60 ; SD = 20
 2. Range = 16 ; SD = 4.4
 3. Range = 12 ; SD = 4.2

61