Download Graphical Descriptive Techniques

Document related concepts

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Graphical
Descriptive
Techniques
1
2.1 Introduction
Descriptive statistics involves the
arrangement, summary, and presentation of
data, to enable meaningful interpretation, and to
support decision making.
Descriptive statistics methods make use of
 graphical techniques
 numerical descriptive measures.
The methods presented apply to both
 the entire population
 the population sample
2
2.2 Types of data and information
A variable - a characteristic of population or
sample that is of interest for us.



Cereal choice
Capital expenditure
The waiting time for medical services
Data - the actual values of variables



Interval data are numerical observations
Nominal data are categorical observations
Ordinal data are ordered categorical observations
3
Types of data - examples
Interval data
Nominal
Age - income
55
42
75000
68000
.
.
.
.
Weight
gain
+10
+5
.
.
Person Marital status
1
2
3
married
single
single
.
.
Computer
.
.
Brand
1
2
3
.
.
IBM
Dell
IBM
.
.
4
Types of data - examples
Interval data
Nominal data
With nominal data,
all we can do is,
calculate the proportion
of data that falls into
each category.
Age - income
55
42
.
.
75000
68000
.
. gain
Weight
+10
+5
.
.
IBM
25
50%
Dell Compaq
11
8
22% 16%
Other
6
12%
Total
50
5
Types of data – analysis
Knowing the type of data is necessary to properly
select the technique to be used when analyzing data.
Type of analysis allowed for each type of data



Interval data – arithmetic calculations
Nominal data – counting the number of observation in each
category
Ordinal data - computations based on an ordering process
6
Cross-Sectional/Time-Series Data
Cross sectional data is collected at a certain
point in time



Marketing survey (observe preferences by gender,
age)
Test score in a statistics course
Starting salaries of an MBA program graduates
Time series data is collected over
successive points in time


Weekly closing price of gold
Amount of crude oil imported monthly
7
2.3 Graphical Techniques for
Interval Data
Example 2.1: Providing information
concerning the monthly bills of new
subscribers in the first month after
signing on with a telephone company.



Collect data
Prepare a frequency distribution
Draw a histogram
8
Example 2.1: Providing information
Collect data
Bills
42.19
38.45
29.23
89.35
118.04
110.46
0.00
72.88
83.05
.
.
(There are 200 data points
Prepare a frequency distribution
How many classes to use?
Number of observations
Less then 50
50 - 200
200 - 500
500 - 1,000
1,000 – 5,000
5,000- 50,000
More than 50,000
Number of classes
5-7
7-9
9-10
10-11
11-13
13-17
17-20
Class width = [Range] / [# of classes]
[119.63 - 0] / [8] = 14.95
Largest
Largest
Largest
Largest
observation
observation
observation
observation
Smallest
Smallest
Smallest
Smallest
observation
observation
observation
observation
15
9
Example 2.1: Providing information
Draw a Histogram
Frequency
80
60
40
20
0
15 30
45 60
75 90 105 120
Bills
Bin
Frequency
15
71
30
37
45
13
60
9
75
10
90
18
105
28
120
14
10
Example 2.1: Providing information
What information can we extract from this histogram
60
40
Bills
120
105
90
75
60
45
0
30
20
15
Frequency
About half of all A few bills are in Relatively,
the bills are small the middle range large number
13+9+10=32 of large bills
80 71+37=108
18+28+14=60
11
Relative frequency
It is often preferable to show the relative frequency
(proportion) of observations falling into each class,
rather than the frequency itself.
Class relative frequency =
Class frequency
Total number of observations
Relative frequencies should be used when



the population relative frequencies are studied
comparing two or more histograms
the number of observations of the samples studied are
different
12
Class width
It is generally best to use equal class width,
but sometimes unequal class width are called
for.
Unequal class width is used when the
frequency associated with some classes is
too low. Then,


several classes are combined together to form a
wider and “more populated” class.
It is possible to form an open ended class at the
higher end or lower end of the histogram.
13
Shapes of histograms
There are four typical shape characteristics
14
Shapes of histograms
Negatively skewed
Positively skewed
15
Modal classes
A modal class is the one with the largest
number of observations.
A unimodal histogram
The modal class
16
Modal classes
A bimodal histogram
A modal class
A modal class
17
Bell shaped histograms
• Many statistical techniques require that the
population be bell shaped.
• Drawing the histogram helps verify the shape of
the population in question
18
Interpreting histograms
Example 2.2: Selecting an investment



An investor is considering investing in one
out of two investments.
The returns on these investments were
recorded.
From the two histograms, how can the
investor interpret the
 Expected returns
 The spread of the return (the risk involved with
each investment)
19
Example 2.2 - Histograms
181614121086420-
The center
for A
-15
0 15 30 45 60 75
Return on investment A
181614121086420-15
The center
for B
0 15 30 45 60 75
Return on investment B
Interpretation: The center of the returns of Investment A
is slightly lower than that for Investment B
20
Example 2.2 - Histograms
181614121086420-
Sample size =50
17
34
46
-15
0 15 30 45 60 75
Return on investment A
Sample size =50
1816141210816
626
4243
0-15 0 15 30 45 60 75
Return on investment B
Interpretation: The spread of returns for Investment A
is less than that for investment B
21
Example 2.2 - Histograms
181614121086420-
-15
0 15 30 45 60 75
Return on investment A
181614121086420-15
0 15 30 45 60 75
Return on investment B
Interpretation: Both histograms are slightly positively
skewed. There is a possibility of large returns.
22
Providing information
Example 2.2: Conclusion

It seems that investment A is better, because:
 Its expected return is only slightly below that of
investment B
 The risk from investing in A is smaller.
 The possibility of having a high rate of return exists
for both investment.
23
Interpreting histograms
Example 2.3: Comparing students’
performance


Students’ performance in two statistics classes
were compared.
The two classes differed in their teaching
emphasis
 Class A – mathematical analysis and development of
theory.
 Class B – applications and computer based analysis.


The final mark for each student in each course
was recorded.
Draw histograms and interpret the results.
24
Interpreting histograms
Frequency
Histogram
40
20
0
50
60
Frequency
The mathematical emphasis
creates two groups, and a
larger spread.
70
80
90
100
90
100
Marks(Manual)
Histogram
40
20
0
50
60
70
80
Marks(Computer)
25
2.5 Describing the Relationship
Between Two Variables
We are interested in the relationship between
two interval variables.
Example 2.7



A real estate agent wants to study the relationship
between house price and house size
Twelve houses recently sold are sampled and
Size
Price
there size and price recorded
23
315
Use graphical technique to describe the 18
229
relationship between size and price.
26
335
20
261
……………..
……………..
26
2.5 Describing the Relationship
Between Two Variables
Solution


The size (independent variable, X) affects
the price (dependent variable, Y)
We use Excel to create a scatter diagram
Y
400
300
200
100
X
0
0
10
20
30
40
27
Typical Patterns of Scatter Diagrams
Positive linear relationship
No relationship
Negative nonlinear relationship
Negative linear relationship
Nonlinear (concave) relationship
This is a weak linear relationship.
A non linear relationship seems to
fit the data better.
28
2.6 Describing Time-Series Data
Data can be classified according to the
time it is collected.


Cross-sectional data are all collected at
the same time.
Time-series data are collected at
successive points in time.
Time-series data is often depicted on a
line chart (a plot of the variable over
time).
29
Line Chart
Example 2.9


The total amount of income tax paid by
individuals in 1987 through 1999 are listed
below.
Draw a graph of this data and describe the
information produced
30
Line Chart
Line Chart
1,200,000
1,000,000
800,000
600,000
400,000
200,000
0
87 88 89 90 91 92 93 94 95 96 97 98 99
For the first five years – total tax was relatively flat
From 1993 there was a rapid increase in tax revenues.
Line charts can be used to describe nominal data time series.
31
Numerical
Descriptive
Techniques
32
4.2 Measures of Central
Location
Usually, we focus our attention on two
types of measures when describing
population characteristics:


Central location (e.g. average)
Variability or spread
The measure of central location
reflects the locations of all the actual
data points.
33
4.2 Measures of Central
Location
The measure of central location reflects
the locations of all the actual data
points.
With two data points,
How?
the central location
But if theshould
third data
With one data point
fall inpoint
the middle
on the leftthem
hand-side
clearly the centralappears between
(in order
of
the
midrange,
it
should
“pull”of
location is at the point to reflect the location
the central
location
to the left.
itself.
both
of them).
34
The Arithmetic
Mean
This is the most popular and useful
measure of central location
Sum of the observations
Mean =
Number of observations
35
The Arithmetic
Mean
Sample mean
x
n
n
ii11xxii
nn
Sample size
Population mean

N
i1 x i
N
Population size
36
The Arithmetic
Mean
The arithmetic
mean
• Example 4.1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
x
10
 i 1 xi
10

0x1  7x2
 ...  22
x10
 11.0
10
• Example 4.2
Suppose the telephone bills of Example 2.1 represent
the population of measurements. The population mean is
x42.19
 x38.45
 ...  x45.77
 i200
1
2
200
1 x i



200
200
43.59
37
The Median
The Median of a set of observations is the
value that falls in the middle when the
observations are arranged in order of
magnitude.
Example 4.3
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
(exclude, say, the longest time (33))
for the 10 adults of example 4.1
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
38
The Mode
The Mode of a set of observations is the
value that occurs most frequently.
Set of data may have one mode (or modal
class), or two or more modes.
The modal class
For large data sets
the modal class is
much more relevant
than a single-value
mode.
39
The Mode The Mean, Median,
Mode
The Mode
Example 4.5
Find the mode for the data in Example 4.1.
Here are the data again: 0, 7, 12, 5, 33, 14, 8,
0, 9, 22
Solution
 All observation except “0” occur once. There are two “0”.
Thus, the mode is zero.
 Is this a good measure of central location?
 The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).
40
Relationship among Mean, Median,
and Mode
If a distribution is symmetrical, the
mean, median and mode coincide
If a distribution is asymmetrical, and
skewed to the left or to the right, the
three measures differ.
A positively skewed distribution
(“skewed to the right”)
Mode Mean
Median
41
Relationship among Mean, Median,
and Mode
If a distribution is symmetrical, the
mean, median and mode coincide
If a distribution is non symmetrical, and
skewed to the left or to the right, the
three measures differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
42
The Geometric Mean
This is a measure of the average growth
rate.
Let Ri denote the the rate of return in
period i (i=1,2…,n). The geometric
mean of the returns R1, R2, …,Rn is the
constant Rg that produces the same
terminal wealth at the end of period n as
do the actual returns for the n periods.
43
The Geometric Mean
The Geometric Mean
For the given series of rate of
returns the nth period return is
calculated by:
If the rate of return was Rg in every
period, the nth period return would
be calculated by:
n
(1  R1 )(1  R 2 )...( 1  R n )  (1  R g )
Rg is selected such that…
Rg  n (1  R1)(1  R2 )...(1  Rn )  1
44
4.3 Measures of variability
Measures of central location fail to tell the
whole story about the distribution.
A question of interest still remains
unanswered:
How much are the observations spread out
around the mean value?
45
4.3 Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
This data set is now
changing to...
46
4.3 Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
47
 The range


The range of a set of observations is the
difference between the largest and
smallest observations.
But, how do all the observations spread out?
Its major advantage is the ease with which
it can be computed.
? ? ?
The range cannot assistRange
in answering this question

Largest to
Smallest
Its major shortcoming
is its failure
observation
observation
provide information on the dispersion of the
observations between the two end points.
48
The Variance


This measure reflects the dispersion of all the
observations
The variance of a population of size N x1, x2,…,xN
whose mean is  is defined as
2 

2
N
(
x


)
i 1 i
N
The variance of a sample of n observations
x1, x2, …,xn whose mean is x is defined as
s2 
ni1( xi  x)2
n 1
49
Why not use the sum of deviations?
Consider two small populations:
9-10= -1
11-10= +1
8-10= -2
12-10= +2
A measure of dispersion
A
Can the sum of deviations
agreesofwith
this
Be aShould
good measure
dispersion?
The sum
of deviations is
observation.
zero for both populations,
8 9 10 11 12
therefore, is not a good
…but
Themeasurements
mean of both in B
measure
of
arepopulations
moredispersion.
dispersed
is 10...
4-10 = - 6
16-10 = +6
7-10 = -3
then those in A.
B
4
Sum = 0
7
10
13
16
13-10 = +3
Sum =500
The Variance
Let us calculate the variance of the two populations
2
2
2
2
2
2 (8  10)  (9  10)  (10  10)  (11  10)  (12  10)
A 
2
5
2
2
2
2
2
2 (4  10)  (7  10)  (10  10)  (13  10)  (16  10)
B 
 18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of
variation instead?
After all, the sum of squared
deviations increases in
magnitude when the variation
of a data set increases!!
51
The Variance
Let us calculate the sum
of squared
deviations
for both data sets
Which
data set has
a larger dispersion?
Data set B
is more dispersed
around the mean
A
B
1
2 3
1
3
5
52
The Variance
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the
observation that set B is more dispersed.
A
B
1
2 3
1
3
5
53
The Variance
However, when calculated on “per observation”
basis (variance), the data set dispersions are
properly ranked.
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4
A
B
1
2 3
1
3
5
54
The Variance
Example 4.7

The following sample consists of the
number of jobs six students applied for: 17,
15, 23, 7, 9, 13. Finds its mean and
variance
Solution
6 xi
x
i 1
ni1( x i
6
17  15  23  7  9  13 84


 14 jobs
6
6

2

x
)
1
2
s 

(17  14)2  (15  14)2  ...(13  14)2
n 1
6 1
 33.2 jobs2

55
Standard Deviation
The standard deviation of a set of
observations is the square root of the
variance .
Sample standard dev iation: s  s
2
Population standard dev iation:   
2
56
Standard Deviation
Example 4.8



To examine the consistency of shots for a
new innovative golf club, a golfer was
asked to hit 150 shots, 75 with a currently
used (7-iron) club, and 75 with the new
club.
The distances were recorded.
Which 7-iron is more consistent?
57
The Standard
Deviation
Standard Deviation
Example 4.8 – solution
Excel printout, from the
“Descriptive Statistics” submenu.
The innovation club is
more consistent, and
because the means are
close, is considered a
better club
Current
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Innovation
150.5467
0.668815
151
150
5.792104
33.54847
0.12674
-0.42989
28
134
162
11291
75
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
150.1467
0.357011
150
149
3.091808
9.559279
-0.88542
0.177338
12
144
156
11261
75
58
Interpreting Standard Deviation
The standard deviation can be used to


compare the variability of several distributions
make a statement about the general shape of a
distribution.
The empirical rule: If a sample of
observations has a mound-shaped
distribution, the interval
( x  s, x  s) contains approximately 68% of the measuremen ts
( x  2s, x  2s) contains approximately 95% of the measuremen ts
( x  3s, x  3s) contains approximately 99.7% of the measuremen ts
59
Interpreting Standard
Deviation
Example 4.9
A statistics practitioner wants to
describe the way returns on investment
are distributed.



The mean return = 10%
The standard deviation of the return = 8%
The histogram is bell shaped.
60
Interpreting Standard
Deviation
Example 4.9 – solution
The empirical rule can be applied (bell shaped
histogram)
Describing the return distribution



Approximately 68% of the returns lie between 2% and
18%
[10 – 1(8), 10 +
1(8)]
Approximately 95% of the returns lie between -6% and
26%
[10 – 2(8), 10 +
2(8)]
Approximately 99.7% of the returns lie between -14% and
61
34%
[10 – 3(8),
The Chebysheff’s
Theorem
The proportion of observations in any sample
that lie within k standard deviations of the mean
is at least
1-1/k2 for k > 1.
This theorem is valid for any set of
measurements (sample, population) of any
shape!!
x  s, x  s
(1-1/12)
xInterval
 2s, x  2s
K
Chebysheff
Empirical
(1-1/22)
Rulex  3s, x  3s
(1-1/32)
1
at least 0%
approximately 68%
62
The Chebysheff’s
Theorem
Example 4.10

The annual salaries of the employees of a chain of
computer stores produced a positively skewed
histogram. The mean and standard deviation are
$28,000 and $3,000,respectively. What can you say
about the salaries at this chain?
Solution
At least 75% of the salaries lie between $22,000 and
$34,000
28000 – 2(3000)
28000 + 2(3000)
At least 88.9% of the salaries lie between $$19,000
63
The Coefficient of
Variation
The coefficient of variation of a set of
measurements is the standard deviation
s
divided
by
the
mean
value.
Sample coefficien t of variation : cv 
x

Population coefficien t of variation : CV 

This coefficient provides a proportionate
A standard deviation of 10 may be perceived
measure of variation.
large when the mean value is 100, but only
moderately large when the mean value is 500
64
4.4 Measures of Relative
Standing
and Box Plots
Percentile

The pth percentile of a set of
measurements is the value for which
 p percent of the observations are less than that
value
 100(1-p) percent of all the observations are
greater than that value.

Example
 Suppose your
score is the 60% percentile
of a
40%
60% of all the scores lie here
SAT test. Then
Your score
65
Quartiles
Commonly used percentiles





First (lower)decile
= 10th percentile
First (lower) quartile, Q1,
= 25th
percentile
Second (middle)quartile,Q2, = 50th
percentile
Third quartile, Q3,
= 75th percentile
Ninth (upper)decile
= 90th
percentile
66
Quartiles
Example
Find the quartiles of the following set of
measurements 7, 8, 12, 17, 29, 18, 4,
27, 30, 2, 4, 10, 21, 5, 8
67
Quartiles
Solution
Sort the observations
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27,
29, 30
The first quartile
15 observations
At most (.25)(15) = 3.75 observations
should appear below the first quartile.
Check the first 3 observations on the
left hand side.
At most (.75)(15)=11.25 observations
should appear above the first quartile.
Check 11 observations on the
right hand side.
Comment:If the number of observations is even, two observations
remain unchecked. In this case choose the midpoint between these
two observations.
68
4.5 Measures of Linear
Relationship
The covariance and the coefficient of
correlation are used to measure the
direction and strength of the linear
relationship between two variables.

Covariance - is there any pattern to the

way two variables move together?
Coefficient of correlation - how strong is
the linear relationship between two
variables
69
Covariance
Population covariance  COV(X, Y) 
(x i   x )(y i   y )
N
x (y) is the population mean of the variable X (Y).
N is the population size.
(xi  x)(y i  y)
Sample cov ariance cov (x y, ) 
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
70
Covariance
Compare the following three sets
xi
yi
(x –
x)
(y –
y)
(x – x)(y –
y)
2
6
7
13
20
27
-3
1
2
-7
0
7
21
0
14
x=5
y
y=20
i
(x –
x)
(y –
y)
Cov(x,y)=17.
(x
5 – x)(y –
-3
1
2
7
0
-7
-21
0
-14
xi
2
6
7
27
20
13
y)
xi
yi
2
6
7
20
27
13
x=
5
y
=20
Cov(x,y) = 3.5
71
Covariance
If the two variables move in the same
direction, (both increase or both
decrease), the covariance is a large
positive
If the twonumber.
variables move in opposite
directions, (one increases when the
other one decreases), the covariance is
a large negative number.
If the two variables are unrelated, the
covariance will be close to zero.
72
The coefficient of correlation
Population coefficien t of correlatio n
COV ( X, Y)

xy
Sample coefficien t of correlatio n
cov(X, Y)
r
sx sy

This coefficient answers the question: How
strong is the association between X and Y.
73
The coefficient of correlation
+1 Strong positive linear relationship
COV(X,Y)>0
 or r =
or
0
No linear relationship
-1 Strong negative linear relationship
COV(X,Y)=0
COV(X,Y)<0
74
The coefficient of correlation
If the two variables are very strongly
positively related, the coefficient value is
close to +1 (strong positive linear
relationship).
If the two variables are very strongly
negatively related, the coefficient value
is close to -1 (strong negative linear
relationship).
No straight line relationship is indicated75
The coefficient of correlation and
the covariance – Example 4.16
Compute the covariance and the
coefficient of correlation to measure
how GMAT scores and GPA in an MBA
program are related to one another.
Solution

We believe GMAT affects GPA. Thus
 GMAT is labeled X
 GPA is labeled Y
76
The coefficient of correlation and
the covariance – Example 4.16
Student
1
x
599
y
9.6
x2
y2
xy
358801
92.16
5750.4
2
689
8.8
474721
77.44 6063.2
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
3
584
7.4
341056
54.76
4321.6
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56
4
100
6310
Sy =………………………………………………….
similar631
to Sx =10
1.12 398161
593 xSy = 26.16/(43.56)(1.12)
8.8
351649 77.44
r = 11
cov(x,y)/S
= .5362 5218.4
12
683
8
466489
64
5464
Total
7,587
106.4
4,817,755
957.2
67,559.2
Shortcut Formulas
cov(x, y ) 
 xi  y i 
1 
 xi y i 


n 1 
n

2



1

x
2
s2 

x
 i 

n  1 
n 
77
The coefficient of correlation and
the covariance – Example 4.16 –
Excel
Use the Covariance option in Data Analysis
If your version of Excel returns the population
covariance and variances, multiply each one by n/n-1
to obtain the corresponding sample values.
Use the Correlation option to produce the correlation
matrix.
Variance-Covariance
Matrix
Population
values
GPA
GPA
1.15
GMA
T
23.98
GMA
T
1739.
52
Sample
values
GPA
12
 12-1 GPA
1.25
GMA
T
26.16
GMA
T
1897.
66 78
The coefficient of correlation and
the covariance – Example 4.16 –
Excel
Interpretation


The covariance (26.16) indicates that
GMAT score and performance in the MBA
program are positively related.
The coefficient of correlation (.5365)
indicates that there is a moderately strong
positive linear relationship between GMAT
and MBA GPA.
79