Download Methods for Describing Sets of Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 2
Methods for Describing Sets of Data
Objectives
Describe Data using Graphs
Describe Data using Charts
Describing Qualitative Data
• Qualitative data are nonnumeric in nature
• Best described by using Classes
• 2 descriptive measures
class frequency – number of data points in a
class
class relative =
class frequency
frequency
total number of data
points in data set
class percentage – class relative freq. x 100
Describing Qualitative Data –
Displaying Descriptive Measures
Summary Table
Class
Frequency
Class percentage – class relative frequency x 100
Describing Qualitative Data –
Qualitative Data Displays
Bar Graph
Describing Qualitative Data –
Qualitative Data Displays
Pie chart
Describing Qualitative Data –
Qualitative Data Displays
Pareto Diagram
Graphical Methods for Describing
Quantitative Data
The Data
Percentage of Revenues Spent on Research and Development
Company
1
2
3
4
5
6
7
8
9
10
11
12
13
Percentage
13.5
8.4
10.5
9.0
9.2
9.7
6.6
10.6
10.1
7.1
8.0
7.9
6.8
Company
14
15
16
17
18
19
20
21
22
23
24
25
26
Percentage
9.5
8.1
13.5
9.9
6.9
7.5
11.1
8.2
8.0
7.7
7.4
6.5
9.5
Company
27
28
29
30
31
32
33
34
35
36
37
38
Percentage
8.2
6.9
7.2
8.2
9.6
7.2
8.8
11.3
8.5
9.4
10.5
6.9
Company
39
40
41
42
43
44
45
46
47
48
49
50
Percentage
6.5
7.5
7.1
13.2
7.7
5.9
5.2
5.6
11.7
6.0
7.8
6.5
Graphical Methods for Describing
Quantitative Data
For describing, summarizing, and detecting
patterns in such data, we can use three
graphical methods:
• dot plots
• stem-and-leaf displays
• histograms
Graphical Methods for Describing
Quantitative Data
Dot Plot
Graphical Methods for Describing
Quantitative Data
Stem-and-Leaf Display
Graphical Methods for Describing
Quantitative Data
Histogram
Graphical Methods for Describing
Quantitative Data
More on Histograms
Number of Observations in Data Set
Number of Classes
Less than 25
5-6
25-50
7-14
More than 50
15-20
Summation Notation
Used to simplify summation instructions
Each observation in a data set is identified
by a subscript
x1, x2, x3, x4, x5, …. xn
Notation used to sum the above numbers
together is
n
x  x
i
i 1
1

x 2  x 3  x 4    xn
Summation Notation
Data set of 1, 2, 3, 4
Are these the same?  x
i
i 1


  xi 
 i 1 
4
4
2
and
2
4
2
2
2
2
2
x
i

x
1  x 2  x3  x 4
 1  4  9  16  30

i 1
2


2 
2

  xi   x1  x 2  x3  x 4   1  2  3  4   102  100
 i 1 
4
Numerical Measures of Central
Tendency
• Central Tendency – tendency of data to
center about certain numerical values
• 3 commonly used measures of Central
Tendency:
Mean
Median
Mode
Numerical Measures of Central
Tendency
The Mean
• Arithmetic average of the elements of the
data set
• Sample mean denoted by x
• Population mean denoted by 
n
x
n
i
• Calculated as
x
i 1
n
x
i
and

i 1
n
Numerical Measures of Central
Tendency
The Median
• Middle number when observations are
arranged in order
• Median denoted by m
n
• Identified as the 2  0.5 observation if n is
n
n
odd, and the mean of the
and  1
2
2
observations if n is even
Numerical Measures of Central
Tendency
The Mode
• The most frequently occurring value in the
data set
• Data set can be multi-modal – have more
than one mode
• Data displayed in a histogram will have a
modal class – the class with the largest
frequency
Numerical Measures of Central
Tendency
The Data set
1 3 5 6 8 8 9 11 12
n
x
i
1  3  5  6  8  8  9  11  12 63
Mean x 


7
n
9
9
i 1
Median is the
Mode is 8
n
 0.5
2
or 5th observation, 8
Numerical Measures of Variability
• Variability – the spread of the data across
possible values
• 3 commonly used measures of Variability:
Range
Variance
Standard Deviation
Numerical Measures of Variability
The Range
• Largest measurement minus the smallest
measurement
• Loses sensitivity when data sets are large
These 2 distributions
have the same range.
How much does the
range tell you about
the data variability?
Numerical Measures of Variability
The Sample Variance (s2)
• The sum of the squared deviations from
the mean divided by (n-1). Expressed as
units squared
n
s2 
2
(
x

x
)
 i
i 1
n 1
• Why square the deviations? The sum of
the deviations from the mean is zero
Numerical Measures of Variability
The Sample Standard Deviation (s)
• The positive square root of the sample
variance
n
s
2
(
x

x
)
 i
i 1
n 1
 s2
• Expressed in the original units of
measurement
Numerical Measures of Variability
Samples and Populations - Notation
Sample
Population
Variance
s2

Standard
Deviation
s

2
Numerical Measures of Relative
Standing
Descriptive measures of relationship of a
measurement to the rest of the data
Common measures:
• percentile ranking
• z-score
Numerical Measures of Relative
Standing
Percentile rankings make use of the pth
percentile
The median is an example of percentiles.
Median is the 50th percentile – 50 % of
observations lie above it, and 50% lie below
it
For any p, the pth percentile has p% of the
measures lying below it, and (100-p)%
above it
Numerical Measures of Relative
Standing
z-score – the distance between a
measurement x and the mean, expressed in
standard units
Use of standard units allows comparison
across data sets
z
x

xx
z
s
Numerical Measures of Relative
Standing
More on z-scores
Z-scores follow the empirical rule for
mounded distributions
Methods for Detecting Outliers
Outlier – an observation that is unusually large or
small relative to the data values being described
Causes:
• Invalid measurement
• Misclassified measurement
• A rare (chance) event
2 detection methods:
• Box Plots
• z-scores
Methods for Detecting Outliers
Box Plots
• based on quartiles, values that divide the
dataset into 4 groups
• Lower Quartile QL – 25th percentile
• Middle Quartile - median
• Upper Quartile QU – 75th percentile
• Interquartile Range (IQR) = QU - QL
Methods for Detecting Outliers
Box Plots
Potential Outlier
Whiskers
QU
(hinge)
Median
QL
(hinge)
Not on plot – inner and outer fences, which
determine potential outliers
Methods for Detecting Outliers
Rules of thumb
• Box Plots
– measurements between inner and outer
fences are suspect
– measurements beyond outer fences are
highly suspect
• Z-scores
– Scores of 3 in mounded distributions (2 in
highly skewed distributions) are considered
outliers
Graphing Bivariate Relationships
Bivariate relationship – the relationship between
two quantitative variables
Graphically represented with the scattergram
The Time Series Plot
Time Series Data – data produced and monitored
over time
Graphically represented with the time series plot
Time on x axis
Order on x axis
Summary
• Graphical methods for Qualitative Data
– Pie chart
– Bar graph
– Pareto diagram
• Graphical methods for Quantitative Data
– Dot plot
– Stem-and-leaf display
– Histogram
Summary
• Numerical measures of central tendency
– Mean
– Median
– Mode
• Numerical measures of variation
– Range
– Variance
– Standard Deviation
Summary
• Measures of relative standing
– Percentile ranking
– z-scores
• Methods for detecting Outliers
– Box plots
– z-scores
• Method for graphing the relationship
between two quantitative variables
– Scatterplot
Related documents