Download Describing the Spread of Quantitative Data

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Exploring Data
Descriptive Data
1
Content
•
•
•
•
•
•
Types of Variables
Describing data using graphical summaries
Describing the Centre of Quantitative Data
Describing the Spread of Quantitative Data
How Measures of Position Describe Spread
How can Graphical Summaries be Misused
2
Variable
• A variable is any characteristic
that is recorded for the subjects
in a study
• Examples: Marital status, Height,
Weight, IQ
• A variable can be classified as
either
– Categorical or
– Quantitative
• Discrete or
• Continuous
www.thewallstickercompany.com.au
3
Categorical Variable
• A variable is categorical if each observation
belongs to one of a set of categories.
• Examples:
1.
2.
3.
4.
Gender (Male or Female)
Religion (Catholic, Jewish, …)
Type of residence (Apartment, House, …)
Belief in life after death (Yes or No)
www.post-gazette.com
4
Quantitative Variable
• A variable is called quantitative if observations
take numerical values for different magnitudes
of the variable.
• Examples:
1. Age
2. Number of siblings
3. Annual Income
5
Quantitative vs. Categorical
• For Quantitative variables, key features are the
center (a representative value) and spread
(variability).
• For Categorical variables, a key feature is the
percentage of observations in each of the
categories .
6
Types of Data
• Basically, there are two types of data:
– qualitative and quantitative.
• Qualitative data: are numerically nonmeasurable;
• Quantitative data: can be measured numerically.
• Most statistical analysis is based on quantitative data
using appropriate measurement of their variables.
• Quantitative variables are also classified into two
types:
– discrete and continuous.
7
Types of Data
• A discrete variable can take only certain distinct or
isolated values in a given range,
– for example, number of siblings 0, 1, 2, …, 10.
• A continuous variable can take any value in a given
range,
– for example, age from 0 years to 100 years.
• To take another example,
– if one would like to know what factors are associated with
a sales representative’s performance,
– a number of measures might be used to indicate success.
8
Discrete Quantitative Variable
• A quantitative variable is discrete if
its possible values form a set of
separate numbers: 0,1,2,3,….
• Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign languages
spoken by an individual
upload.wikimedia.org
9
Continuous Quantitative Variable
• A quantitative variable
is continuous if its
possible values form
an interval
• Measurements
• Examples:
1. Height/Weight
2. Age
3. Blood pressure
www.wtvq.com
10
Measuring Using Data
• Measures of a salesperson’s success
– Dollar or unit sales volume, or share of accounts lost could be utilised .
• Principally, to enable ease of understanding, the quantitative
variables are usually measured by various scales.
• A scale may be defined as a measuring tool for appropriate
quantification of variables.
• In other words, a scale is a continuous spectrum or series of
categories.
• Like other research, four types of scales are used in business
research.
– These include nominal, ordinal, interval and ratio scales.
11
Nominal Scale
• A nominal scale is the simplest type of scale. The
numbers or letters assigned to objects serve as labels
for identification or classification.
• For example, names and gender are categorical
variables;
– and one can put the level ‘M’ for Male and ‘F’ for Female,
– or ‘1’ for male and ‘2’ for female,
– or ‘1’ for female and ‘2’ for male.
• Other examples include marital status, religion, race,
colour and employment status, and so forth.
12
Ordinal Scale
• When a nominal scale follows an order then it becomes an
ordinal scale.
• In other words, an ordinal scale arranges objects or
categorical variables according to an ordered relationship.
• So, ranking of nominal scales is an essential prior criterion for
ordinal scales.
• A typical ordinal scale in business research asks respondents
to rate career opportunities and company brands as
‘excellent’, ‘good’, ‘fair’ or ‘poor’.
–
–
–
–
Other examples would be
(i) result of examination: first, second, third classes and fail;
(ii) quality of products; and
(iii) social class.
13
Interval Scale
• The interval scale indicates the distance or difference in units
between two events.
• In other words, such scales not only indicate order, they also
measure the order or distance in units of equal intervals.
• It is important to note that the location of the zero point is
arbitrary.
• To take an example, in the price index, the number of the
base year is set to be usually 100.
• Another classic example of an interval scale is the
temperature where the initial point is always arbitrary.
14
Ratio Scale
• Ratio scales have absolute rather than relative
quantities.
• In other words, if an interval scale has an absolute
zero then it can be classified as a ratio scale.
• The absolute zero represents a point on the scale
where there is an absence of the given attribute.
• For examples,
– age, money and weights are ratio scales
– because they possess an absolute zero and interval
properties.
15
Proportion & Percentage (Rel. Freq.)
Proportions and percentages are also
called relative frequencies.
16
Frequency Table
• Frequency table is a listing of possible values for
a variable
– together with the number of observations
– or relative frequencies for each value.
17
Describing data using graphical
summaries
18
Graphs for Categorical Variables
• Use pie charts and bar
graphs to summarize
categorical variables
1. Pie Chart: A circle
having a “slice of pie”
for each category
2. Bar Graph: A graph
that displays a vertical
bar for each category
wpf.amcharts.com
19
Pie Charts
• Summarize
categorical
variable
• Drawn as circle
where each
category is a slice
• The size of each
slice is
proportional to
the percentage in
that category
20
Bar Graphs
• Summarizes categorical
variable
• Vertical bars for each category
• Height of each bar represents
either counts or percentages
• Easier to compare categories
with bar graph than with pie
chart
• Called Pareto Charts when
ordered from tallest to
shortest
21
Graphs for Quantitative Data
1. Dot Plot: shows a
dot for each
observation placed
above its value on a
number line
2. Stem-and-Leaf Plot:
portrays the
individual
observations
3. Histogram: uses bars
to portray the data
22
Which Graph?
• Dot-plot and stem-and-leaf
plot:
– More useful for small data sets
– Data values are retained
• Histogram
– More useful for large data sets
– Most compact display
– More flexibility in defining
intervals
content.answers.com
23
Dot Plots
To construct a dot plot
1. Draw and label horizontal line
2. Mark regular values
3. Place a dot above each value
on the number line
Sodium
in
Cereals
24
Histograms
• Graph that uses bars to
portray frequencies or relative
frequencies of possible
outcomes for a quantitative
variable
26
Constructing a Histogram
1. Divide into intervals of equal width
2. Count # of observations in each interval
Sodium in
Cereals
27
Constructing a Histogram
3. Label endpoints
of intervals on
horizontal axis
4. Draw a bar over
each value or
interval with
height equal to
its frequency (or
percentage)
5. Label and title
Sodium in Cereals
28
Interpreting Histograms
• Assess where a
distribution is
centered by
finding the median
• Assess the spread
of a distribution
• Shape of a
distribution:
roughly symmetric,
skewed to the
right, or skewed to
the left
Left and right sides
are mirror images
29
Examples of Skewness
30
Shape and Skewness
• Consider a data set
containing IQ scores
for the general public.
What shape?
a.
b.
c.
d.
Symmetric
Skewed to the left
Skewed to the right
Bimodal
botit.botany.wisc.edu
31
Shape and Skewness
• Consider a data set of
the scores of students
on an easy exam in
which most score very
well but a few score
poorly. What shape?
a.
b.
c.
d.
Symmetric
Skewed to the left
Skewed to the right
Bimodal
32
Shape: Type of Mound
33
Outlier
An outlier falls far from the rest of the data
34
Time Plots
• Display a time series,
data collected over
time
• Plots observation on
the vertical against
time on the
horizontal
• Points are usually
connected
• Common patterns
should be noted
Time Plot from 1995 – 2001
of the # worldwide who
use the Internet
35
Describing the Centre of
Quantitative Data
36
Mean
• The mean is the sum
of the observations
divided by the
number of
observations
• It is the center of
mass
37
Median
Order
1
2
3
4
5
6
7
8
9
Data
78
91
94
98
99
101
103
105
114
Order
1
2
3
4
5
6
7
8
9
10
Data
78
91
94
98
99
101
103
105
114
121
• Midpoint of the observations
when ordered from least to
greatest
1. Order observations
2. If the number of observations
is:
a) Odd, the median is the
middle observation
b) Even, the median is the
average of the two middle
observations
38
Comparing the Mean and Median
• Mean and median of a symmetric distribution are close
– Mean is often preferred because it uses all
• In a skewed distribution, the mean is farther out in the
skewed tail than is the median
– Median is preferred because it is better
representative of a typical observation
39
Resistant Measures
• A measure is resistant if
extreme observations
(outliers) have little, if any,
influence on its value
– Median is resistant to
outliers
– Mean is not resistant to
outliers
www.stat.psu.edu
40
Mode
• Value that occurs most often
• Highest bar in the histogram
• Mode is most often used with categorical data
41
Describing the Spread of
Quantitative Data
42
Range
Range = max - min
The range is strongly affected by outliers.
43
Standard Deviation
• Each data value has an associated deviation
from the mean, x  x
• A deviation is positive if it falls above the
mean and negative if it falls below the mean
• The sum of the deviations is always zero

44
Standard Deviation
• Standard deviation gives a measure of variation by
summarizing the deviations of each observation from the
mean and calculating an adjusted average of these
deviations:
1.
2.
3.
4.
Find mean
Find each
deviation
Square deviations
Sum squared
deviations
5.
Divide sum by n-1
6.
Take square root 45
Standard Deviation
Metabolic rates of 7 men (calories/24 hours)
46
Properties of Sample Standard Deviation
1.
2.
3.
4.
5.
6.
Measures spread of data
Only zero when all observations are same; otherwise, s > 0
As the spread increases, s gets larger
Same units as observations
Not resistant
Strong skewness or outliers greatly increase s
47
Empirical Rule: Magnitude of s
48
How Measures of Position
Describe Spread
49
Percentile
The pth percentile is a value
such that p percent of the
observations fall below or at
that value
50
Finding Quartiles
• Splits the data into four parts
1. Arrange data in order
2. The median is the second
quartile, Q2
3. Q1 is the median of the lower
half of the observations
4. Q3 is the median of the upper
half of the observations
51
Measure of Spread: Quartiles
•
1.
2.
3.
Quartiles divide a ranked
data set into four equal parts:
25% of the data at or below Q1= first quartile = 2.2
Q1 and 75% above
50% of the obs are above
M = median = 3.4
the median and 50% are
below
75% of the data at or below
Q3 and 25% above
Q3= third quartile = 4.35
52
Calculating Interquartile Range
• The interquartile range is the distance between
the thirdand first quartile, giving spread of middle
50% of the data: IQR = Q3 - Q1
53
Criteria for Identifying an Outlier
• An observation is a potential outlier if it falls
more than 1.5 x IQR below the first or more
than 1.5 x IQR above the third quartile.
54
5 Number Summary
• The five-number
summary of a
dataset consists of:
1.
2.
3.
4.
5.
Minimum value
First Quartile
Median
Third Quartile
Maximum value
55
Boxplot
1. Box goes from the Q1 to Q3
2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box to smallest observation
not a potential outlier and
from upper end of box to
largest observation not a
potential outlier
4. Potential outliers are shown
separately, often with * or +
56
Comparing Distributions
•
Boxplots do not display the shape of the distribution as
clearly as histograms, but are useful for making
graphical comparisons of two or more distributions
57
Z-Score
• An observation from a bell-shaped distribution is a
potential outlier if its z-score < -3 or > +3
58
How can Graphical Summaries
be Misused
59
Misleading Data Displays
60
Guidelines for Constructing Effective Graphs
1. Label axes and give
proper headings
2. Vertical axis should
start at zero
3. Use bars, lines, or
points
4. Consider using
separate graphs or
ratios when variable
values differ
61