Download BASIC STATISTICAL TOOLS

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
BASIC STATISTICAL TOOLS
What is Statistics
 Statistics refers to the
 collection,
 presentation,
 analysis,
 and utilization of numerical data
to make inferences and reach decisions in the face of
uncertainty in economics, business, and other social
and physical sciences.
Statistics is subdivided into descriptive and inferential.
Descriptive statistics
 Descriptive
statistics : Methods of organizing,
summarizing, and presenting data in an informative
way.
 EXAMPLE : According to Consumer Reports,
Whirlpool washing machine owners reported 9
problems per 100 machines during 1995. The
statistic 9 describes the number of problems out of
every 100 machines.
Inferential statistics
 Inferential statistics is the process of reaching
generalizations about the whole (called the
population) by examining a portion (called the
sample).
 A population is a collection of all possible
individuals, objects, or measurements of interest.
 A sample is a portion, or part, of the population of
interest.
EXAMPLE
 Suppose that we have data on the incomes of 1000 U.S.
families. This body of data can be summarized by finding
the average family income and the spread of these family
incomes above and below the average. The data also can
be described by constructing a table, chart, or graph of
the number or proportion of families in each income
class. This is descriptive statistics.
 If these 1000 families are representative of all U.S.
families, we can then estimate and test hypotheses about
the average family income in the United States as a
whole. This is statistical inference.
TYPES OF DATA
 There are three types of data that are generally
available for empirical analysis.
 1. Time series
 2. Cross-sectional
 3. Pooled (A combination of time series and crosssectional)
TIME SERIES DATA
 Collected over a period of time, such as the data on:
 GDP, employment, unemployment, money supply,







government deficit.
Such data may be collected at regular intervals:
Daily (e.g. Stock prices)
Weekly (e.g. Money supply)
Monthly (e.g. Unemployment rate)
Quarterly (e.g. GDP)
Annually (e.g. Government budget)
This is called the frequency of the data.
TIME SERIES DATA
 These data may be quantitative in nature (e.g. Prices,
income, money supply)
 Or qualitative in nature (e.g. Male or female,
employed or unemployed, married or unmarried,
white or black)
 Qualitative variables are also called dummy or
categorical variables.
CROSS-SECTIONAL DATA
 These are data on one or more variables collected at
one point in time
 For example GDP of European Union Countries in
2010.
 Government budget deficit of BRIC countries.
POOLED DATA
 In Pooled data we have elements of both time series




and cross-sectional data.
For example
Unemployment rate for 10 countries for a period of
20 years. (Pooled data)
Data on the unemployment rate for each country for
the 20 year period (Time series)
Data on the unemployment rate for the 10 countries
for any single year (Cross-sectional)
2-2
Frequency Distribution
 Frequency distribution: A grouping of data into
categories showing the number of observations in
each category.
 The number of classes is usually between 5 and 15.
2-4
Frequency Distribution
 Class mark (midpoint): A point that divides a
class into two equal parts. This is the average
between the upper and lower class limits.
 Class interval: For a frequency distribution
having classes of the same size, the class
interval is obtained by subtracting the lower
limit of a class from the lower limit of the next
class.
2-4
Frequency Distribution
 Class mark (midpoint): A point that divides a
class into two equal parts. This is the average
between the upper and lower class limits.
 Class interval: For a frequency distribution
having classes of the same size, the class
interval is obtained by subtracting the lower
limit of a class from the lower limit of the next
class.
2-5
EXAMPLE 1
 Dr. Tillman is the dean of the school of business and
wishes to determine the amount of studying business
school students do. He selects a random sample of 30
students and determines the number of hours each
student studies per week:
 15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8, 13.5, 20.7,
17.4, 18.6, 12.9, 20.3, 13.7, 21.4, 18.3, 29.8, 17.1, 18.9,
10.3, 26.1, 15.7, 14.0, 17.8, 33.8, 23.2, 12.9, 27.1, 16.6.
 Organize the data into a frequency distribution.
2-6
EXAMPLE 1 continued
Consider the classes 8-12 and 13-17. The class marks are 10 and 15.
The class interval is 5 (13-8).
Hours studying
8-12
13-17
18-22
23-27
28-32
33-37
Frequency, f
1
12
10
5
1
1
2-7
Suggestions on Constructing a Frequency
Distribution
 The class intervals used in the frequency
distribution should be equal.
 Determine a suggested class interval by using
the formula: i = (highest value-lowest
value)/number of classes.
2-8
Suggestions on Constructing a Frequency
Distribution
 Use the computed suggested class interval to
construct
the
frequency
distribution.
Note: this is a suggested class interval; if the
computed class interval is 97, it may be better to
use 100.
 Count the number of values in each class.
2-9
Relative Frequency Distribution
 The relative frequency of a class is obtained by dividing the
class frequency by the total frequency.
 The sum of the relative frequencies equals 1.
Hours
8-12
Relative
Frequency,
Frequency
f
1/30=.0333
1
13-17
12
12/30=.400
18-22
10
10/30=.333
23-27
5
5/30=.1667
28-32
1
1/30=.0333
33-37
1
1/30=.0333
TOTAL
30
30/30=1
T
EXAMPLE 2
 The cans in a sample of 20 cans of fruit contain net
weights of fruit ranging from 19.3 to 20.9 oz, as given
in the Table. If we want to group these data into 6
classes, we get class intervals of 0.3 oz
[21,0 – 19,2/6 ]= 0,3 oz.
The weights given in the Table can be arranged into
the frequency distributions given in the next Table.
Frequency Distribution of Weights
2-10
Stem-and-Leaf Displays
 Stem-and-Leaf
Display:
A
statistical
technique for displaying a set of data. Each
numerical value is divided into two parts: the
leading digits become the stem and the
trailing digits the leaf.
 Note: An advantage of the stem-and-leaf
display over a frequency distribution is we do
not lose the identity of each observation.
2-11
EXAMPLE 3
 Colin achieved the following scores on his twelve
accounting quizzes this semester: 86, 79, 92, 84,
69, 88, 91, 83, 96, 78, 82, 85. Construct a stemand-leaf chart for the data.
stem
leaf
6
9
7
89
8
234568
9
126
2-12
Graphic Representation of a Frequency
Distribution
 The three commonly used graphic forms are
 histograms,
 frequency polygons, and
 cumulative frequency distribution.
 Histogram: A graph in which the classes are marked
on the horizontal axis and the class frequencies on
the vertical axis.
The class frequencies are
represented by the heights of the bars and the bars
are drawn adjacent to each other.
2-14
Histogram for Hours Spent Studying
14
Frequency
12
10
8
6
4
2
0
10
15
20
25
Hours spent studying
30
35
Histogram of Weights
Frequency Polygon
 A frequency polygon consists of line segments
connecting the points formed by the class midpoint
and the class frequency.
2-15
Frequency Polygon for Hours Spent Studying
14
Frequency
12
10
8
6
4
2
0
10
15
20
25
30
Hours spent studying
35
Cumulative Frequency Distribution
 A cumulative frequency distribution is used to
determine how many or what proportion of the data
values are below or above a certain value.
2-16
Cumulative Frequency Distribution For Hours
Studying
35
30
25
Frequency
20
15
10
5
0
10
15
20
25
Hours Spent Studying
30
35
2-17
Bar Chart
 A bar chart can be used to depict any of the levels of
measurement (nominal, ordinal, interval, or ratio).
 EXAMPLE 3: Construct a bar chart for the number
of unemployed people per 100,000 population for
selected cities.
2-18
EXAMPLE continued
City
Atlanta, GA
Boston, MA
Chicago, IL
Los Angeles, CA
New York, NY
Washington, D.C.
Number of unemployed
per 100,000 population
7300
5400
6700
8900
8200
8900
2-19
# unemployed/100,000
Bar Chart for the Unemployment Data
10000
8000
8900
7300
8200
8900
6700
Atlanta
Boston
Chicago
Los Angeles
New York
Washington
5400
6000
4000
2000
0
1
2
3
4
Cities
5
6
2-20
Pie Chart
 A pie chart is especially useful in displaying a
relative frequency distribution. A circle is
divided proportionally to the relative
frequency and portions of the circle are
allocated for the different groups.
 EXAMPLE 4: A sample of 200 runners were
asked to indicate their favorite type of running
shoe.
2-21
EXAMPLE continued
 Draw a pie chart based on the following
information.
Type of shoe
# of runners
Nike
92
Adidas
49
Reebok
37
Asics
13
Other
9
2-22
Pie Chart for Running Shoes
Reebok
Asics
Other
Nike
Adidas
Reebok
Asics
Other
Adidas
Nike
MEASURES OF CENTRAL TENDENCY
 Central
tendency refers to the location of a
distribution. The most important measures of central
 tendency are
(1) the mean,
(2) the median, and
(3) the mode.
The Mean
The Median
 The median for ungrouped data is the value of the
middle item when all the items are arranged in either
ascending or descending order in terms of values:
 where N refers to the number of items in the
population (n for a sample).
The Mode
 The mode is the value that occurs most frequently in
the data set.
 The mean is the most commonly used measure of
central tendency. The mean, however, is affected by
extreme values in the data set, while the median and
the mode are not.
 Other measures of central tendency are the weighted
mean, the geometric mean, and the harmonic mean
EXAMPLE
 A student received the following grades (measured
from 0 to 10) on the 10 quizzes he took during a
semester: 6, 7, 6, 8, 5, 7, 6, 9, 10, and 6.
 Find the mean, median and mode for the population
on the 10 quizzes.
EXAMPLE
 To find the median for the ungrouped data, we first
arrange the 10 grades in ascending order: 5, 6, 6, 6,
6, 7, 7, 8, 9,10.
 Then we find the grade of the (N+1)/2 or (10+1)/2=
5,5th item. Thus the median is the average of the 5th
and 6th item in the array, or (6+7)/2=6,5
 The mode for the ungrouped data is 6 (the value that
occurs most frequently in the data set).
Example : Mean for Grouped Data
 estimate the mean for the grouped data given in the
Table below.
MEASURES OF DISPERSION
 Dispersion refers to the variability or spread in the
data. The most important measures of dispersion are
(1) the average deviation,
 (2) the variance, and
 (3) the standard deviation.
 We will measure these for populations and samples,
Average deviation
 The average deviation (AD), also called the mean
absolute deviation (MAD), is given by
 where the two vertical bars indicate the absolute
value, or the values omitting the sign.
Variance
2
 The population variance 
the Greek letter sigma
squared) and the sample variance s2 for ungrouped
data are given by
Standard deviation
 The population standard deviation  and sample
standard deviation s are the positive square roots of
their respective variances. For ungrouped data
EXAMPLE
 Calculate
the Average Deviation, Variance and
Standart deviation by using the data for quiz grades.
EXAMPLE continued
SHAPE OF FREQUENCY DISTRIBUTIONS
 The shape of a distribution refers to
 (1) its symmetry or lack of it (skewness) and
 (2) its peakedness (kurtosis).
Skewness
 A distribution has zero skewness if it is symmetrical
about its mean. For a symmetrical (unimodal)
distribution, the mean, median, and mode are equal.
 A distribution is positively skewed if the right tail is
longer. Then, mean > median > mode.
 A distribution is negatively skewed if the left tail is
longer. Then, mode > median > mean
Kurtosis
 A peaked curve is called leptokurtic, as opposed to a
flat one (platykurtic), relative to one that is
mesokurtic.
 The kurtosis for a mesokurtic curve is 3.
6
Series: X2
Sample 1960 1982
Observations 23
5
4
3
2
1
0
500
1000
1500
2000
2500
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
1035.065
843.3000
2478.700
397.5000
617.8470
0.962455
2.818835
Jarque-Bera
Probability
3.582342
0.166765
correlation coefficient
A
correlation coefficient is a number that
summarizes the degree to which two variables move
together.
 Correlations range in value from -1 to +1. When the
coefficient is 1 (either -1 or +1), the two variables are
perfectly "in sync" with each other - a unit change in
one is accompanied by a unit change in the other.
 If the variables are moving in opposite directions
(one increases as the other decreases), it is a negative
relationship.
correlation coefficient
 We indicate a negative relationship by using a minus
sign before the coefficient.
 If the variables are moving in the same direction
(both are increasing or both are decreasing
together), we denote that by reporting the coefficient
as a positive number.
 When the coefficient is 0, there is no relationship
between the two variables.
 Typically, coefficients fall somewhere between no
relationship (0) and a perfect relationship (+/—1).
Correlation Matrix
The scatterplot
 The scatterplot is the visual complement for the
correlation coefficient. It visually displays whether
there's any connection between the movements of
two variables.
 One variable is displayed on the X axis while the
other variable is displayed on the Y axis.
 The values on either axis might be expressed in
absolute numbers, percentages, rates, or scores.
Scatterplot
X3 vs. X2
75
70
65
X3
60
55
50
45
40
35
0
500
1000
1500
X2
2000
2500
Time Series Graph
240
200
160
120
80
40
60
62
64
66
68
70
X4
72
74
X5
76
78
80
82