Download lecture 3, january 12, 2004

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
A point from previous lecture

Last Friday, I told you that when you
create histogram, if the classes are like
10-20, 20-30 and so on and the value “20”
should be tallied in the class “10-20”. It is
not right. You should tally 20 in the class
“20-30”. If you have any question please
ask me.
Statistical Measures


Although the frequency distribution
arranges the raw data into a meaningful
pattern, that summary cannot by itself
answer many important statistical
questions.
For example, an industrial engineer
wishing to select the faster of two
production methods might obtain sample
completion times from pilot runs and then
try to reach a decision comparing the two
resulting sample frequency distributions.
Statistical Measures



The faster procedure ought to be more
clearly indicated by the “average “
completion times under the two
production methods.
Averages are one class of statistical
measures.
These quantities (statistical measures)
express various properties of the statistical
data.
Statistical Measures

First kind of measures in this discussion is
Measures of Location



There are two types of location measures.
One group expresses central tendency
The other group measures variability or
dispersion.
Statistics and Parameter

Summary data measures fall into two
major groupings, depending on whether
the observations they describe are a
population or a sample.
Population Parameter


When the data constitute a population,
each summary measure is referred to as a
population parameter.
But, ordinarily not all possible population
observations are made.
Sample Statistic



A measure that summarizes sample data is
called a sample statistic.
It is the statistic that is computed from
those observations actually made.
Important population parameters have
counterpart sample statistics that measure
the same characteristic.
NUMERICAL DESCRIPTIVE MEASURES





Numerical descriptive measures are numbers computed
from data set to help us create a mental image of its relative
frequency histogram.
Measures of Central Location
 Mean, median, mode
Relative Standing
 Percentile, box plots
Measures of Variability
 Range,
 variance,
 standard deviation,
Measures of Association
 Covariance, coefficient of correlation
MEASURES OF CENTRAL LOCATION
MEAN




The arithmetic mean is the most commonly used and best
understood measure of central tendency.
Mean is defined as follows:
Sum of the measurements
Mean =
Number of measurements
In the following, sample mean and population means are
discussed separately.
Note the difference of notation - sample mean is denote by
and the population mean is denoted by . The number of
values in a sample is denoted by n and the number of values
in the population is denoted by N.
x
MEASURES OF CENTRAL LOCATION
MEAN
Mean of
Data Set
Data Set is
Sample
Data Set is
Population
Sample
Mean
Population
Mean
MEASURES OF CENTRAL LOCATION
SAMPLE MEAN

The sample mean is the sum of all the sample values
divided by the number of sample values:
n
x




x
i 1
i
n
where x stands for the sample mean
n is the total number of values in the sample
xi is the value of the i- th observation.
 represents a summation
MEASURES OF CENTRAL LOCATION
SAMPLE MEAN


A sample of five executives received the following
amounts of bonus last year: $14,000, $15,000,
$17,000, $16,000, and $15,000. Find the average
bonus for these five executives.
Since these values represent a sample size of 5, the
sample mean is (14,000 + 15,000 +17,000 + 16,000
+15,000)/5 = $15,400.
MEASURES OF CENTRAL LOCATION
POPULATION MEAN

The population mean is the sum of all the population
values divided by the number of population values:
n





x
i 1
i
N
Where  stands for the population mean
N is the total number of values in the population
xi is the value of the i-th observation.
 represents a summation
MEASURES OF CENTRAL LOCATION
POPULATION MEAN


The Keller family owns four cars. The following is the
mileage attained by each car: 56,000, 23,000,
42,000, and 73,000. Find the average miles covered
by each car.
The mean is (56,000 + 23,000 + 42,000 + 73,000)/4
= 48,500
MEASURES OF CENTRAL LOCATION
PROPERTIES OF MEAN




Data possessing an interval scale or a ratio scale,
usually have a mean.
All the values are included in computing the mean.
A set of data has a unique mean.
The arithmetic mean is the only measure of central
tendency where the sum of the deviations of each
value from the mean is zero.
MEASURES OF CENTRAL LOCATION
PROPERTIES OF MEAN

Consider the set of values: 3, 8, and 4. The mean is
5. Illustrating the last property, (3-5) + (8-5) + (4-5) =
-2 +3 -1 = 0. In other words,
n
(x
i 1
i
 x)  0
MEASURES OF CENTRAL LOCATION
MEDIAN



Median: The midpoint of the values after they have
been ordered from the smallest to the largest, or the
largest to the smallest. There are as many values
above the median as below it in the data array.
For an even set of numbers, the median will be the
arithmetic average of the two middle numbers.
The median is the most appropriate measure of
central location to use when the data under
consideration are ranked data, rather than
quantitative data. For example, if 13 universities are
ranked according to the reputation, university 7 is the
one of median reputation.
MEASURES OF CENTRAL LOCATION
MEDIAN





Compute the median for the following data.
The age of a sample of five college students is: 21,
25, 19, 20, and 22.
Arranging the data in ascending order gives: 19, 20,
21, 22, 25. Thus the median is 21.
The height of four basketball players, in inches, is 76,
73, 80, and 75.
Arranging the data in ascending order gives: 73, 75,
76, 80. Thus the median is 75.5
MEASURES OF CENTRAL LOCATION
MODE



The mode is the value of the observation that
appears most frequently.
The mode is most useful when an important aspect of
describing the data involves determining the number
of times each value occurs. If the data are qualitative
(e.g., number of graduate in various disciplines
accounting,finance, etc.) then, mode is useful (e.g., a
modal class is accounting).
EXAMPLE 6: The exam scores for ten students are:
81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Since the
score of 81 occurs the most, the modal score is 81.
MEASURES OF CENTRAL LOCATION
MEAN, MEDIAN, MODE




Mean: affected by unusually large/small data, may be
used if the data are quantitative (ratio or interval scale).
Median: most appropriate if the data are ranked (ordinal
scale)
Mode: most appropriate if the data are qualitative
(nominal scale)
Appropriate measures if the data is
 quantitative: mean, median, mode
 ranked: median, mode
 qualitative: mode
MEASURES OF CENTRAL LOCATION
RELATIVE VALUES OF MEAN, MEDIAN, MODE
Mode<Median<Mean Mode=Median=Mean Mean<Median<Mode
If distribution is
If distribution is
if distribution is
positively skewed
symmetric
negatively skewed
RELATIVE STANDING PERCENTILES





Percentiles divide the distribution into 100 groups.
The p-th percentile is defined to be that numerical value
such that at most p% of the values are smaller than that
value and at most (100 – p)% are larger than that value in
an ordered data set.
For example, if the 78th percentile of GMAT scores is 600,
then at most 78% scores are below 600 and at most 22%
scores are above 600 (actually, this is also true that at least
22% are 600 or above).
Percentile gives you an idea about your relative standing in
a group.
Two questions:
 Find percentile of a given value
 Find value of a given percentile
RELATIVE STANDING: PERCENTILES
FIND PERCENTILE OF A GIVEN VALUE

The percentile corresponding to a given value (X)
is computed by using the formula:
number of values below X + 0.5
100%
Percentile 
total number of values
RELATIVE STANDING: PERCENTILES
FIND PERCENTILE OF A GIVEN VALUE






A teacher gives a 20-point test to 10 students.
Scores are as follows: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Find the percentile rank of the score of 12.
Ordered set of scores: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
There are 6 values below 12: 2, 3, 5, 6, 8, 10
Percentile = [(6 + 0.5)/10](100%) = 65th percentile.
Student did better than 65% of the class.
RELATIVE STANDING: PERCENTILES
FIND VALUE OF A GIVEN PERCENTILE





Procedure: Let p be the percentile and n the sample size.
Step 1: Arrange the data in the ascending order.
Step 2: Compute c = (np)/100.
Step 3: If c is not a whole number, round up to the next
whole number. If c is a whole number, use the value
halfway between c and c+1.
Step 4: The c-th value of the required percentile.
RELATIVE STANDING: PERCENTILES
FIND VALUE OF A GIVEN PERCENTILE






Example: Consider data set 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
Note: the data set is already ordered.
Find the value of the 25th percentile
n = 10, p = 25, so c = (1025)/100 = 2.5. Hence round up to
c = 3. Thus, the value of the 25th percentile is the 3rd value
X = 5.
Find the value of the 80th percentile
n = 10, p = 80, so c = (1080)/100 = 8. Thus the value of
the 80th percentile is the average of the 8th and 9th values.
Thus, the 80th percentile for the data set is (15 + 18)/2 =
16.5.
RELATIVE STANDING: PERCENTILES
DECILES AND QUARTILES





Deciles divide the data set into 10 groups.
Deciles are denoted by D1, D2, …, D9 with the
corresponding percentiles being P10, P20, …, P90
Quartiles divide the data set into 4 groups.
Quartiles are denoted by Q1, Q2, and Q3 with the
corresponding percentiles being P25, P50, and P75.
The median is the same as P50 or Q2.
RELATIVE STANDING: PERCENTILES
INTERQUARTILE RANGE AND OUTLIERS







An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.
The Interquartile Range, IQR = Q3 – Q1.
To determine whether a data value can be considered as
an outlier:
Step 1: Compute Q1 and Q3.
Step 2: Find the IQR = Q3 – Q1.
Step 3: Compute (1.5)(IQR).
Step 4: Compute Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR).
RELATIVE STANDING: PERCENTILES
INTERQUARTILE RANGE AND OUTLIERS



To determine whether a data value can be considered as
an outlier:
Step 5: Compare the data value (say X) with Q1–
(1.5)(IQR) and Q3 + (1.5)(IQR).
If X < Q1 – (1.5)(IQR) or
if X > Q3 + (1.5)(IQR), then X is considered an outlier.
RELATIVE STANDING: PERCENTILES
INTERQUARTILE RANGE AND OUTLIERS





Given the data set 5, 6, 12, 13, 15, 18, 22, 50, can the
value of 50 be considered as an outlier?
Q1 = 9, Q3 = 20, IQR = 11. Verify.
(1.5)(IQR) = (1.5)(11) = 16.5.
9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5.
The value of 50 is outside the range – 7.5 to 36.5,
hence 50 is an outlier.
RELATIVE STANDING
BOX PLOTS

When the data set contains a small number of values, a
box plot is used to graphically represent the data set.
These plots involve five values:
 the minimum value (S)
 the lower quartile (Q1)
 the median (Q2)
 the upper quartile (Q3)
 and the maximum value (L)
RELATIVE STANDING: BOX PLOTS
EXAMPLE

Example: Construct a box plot with the following data which
shows the assets of the 15 largest North American banks,
rounded off to the nearest hundred million dollars: 111,
135, 217, 108, 51, 98, 65, 85, 75, 75, 93, 64, 57, 56, 98
RELATIVE STANDING: BOX PLOTS
RANKING AND SUMMARIZING
Data
217
135
111
108
98
98
93
85
75
75
65
64
57
56
51
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Smallest = 51
Q1 = 64
Median = 85
Q3 = 108
Largest = 217
IQR = 44
Outliers = (217, )
Box Plot
0
50
100
150
200
Assets (in 100 million dollars)
250
RELATIVE STANDING: BOX PLOTS
INTERPRETATION






If the median is near the center of the box, the
distribution is approximately symmetric.
If the median falls to the left of the center of the box, the
distribution is positively skewed.
If the median falls to the right of the center of the box, the
distribution is negatively skewed.
If the lines are about the same length, the distribution is
approximately symmetric.
If the line segment to the right of the box is larger than
the one to the left, the distribution is positively skewed.
If the line segment to the left of the box is larger than the
one to the right, the distribution is negatively skewed.
SYMMETRIC BOX PLOT
0
50
100
150
200
Number of units sold
250
300
POSITIVELY SKEWED BOX PLOT
0
50
100
150
200
Number of units sold
250
300