Download LESSON TWO : DESCRIPTIVE STATISTICS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
LESSON TWO : DESCRIPTIVE STATISTICS
2.1 Summary statistics for raw data: mean, quartiles
and mode
The mean or average value for a set of raw data
To describe the distribution of a character in a sample or in
a population we can use frequency distributions and its
graphical representation. However, many times is necessary
to be able to describe data numerically, for example by a
value that is typical of the bulk of the data. Such a figure,
calculated from the data is called a summary statistics. This
is usually a mean (or average value), a median or a mode
Formula of the mean value (average) for raw data (only for
quantitative variables)
N
x
i 1
i
N
=
Properties of the average.
1) The sum of values of a variables taken by a set of
statistical units is equal to average multiplied by the
numbers of the units.
N
x  N
i1
i
2) The sum of the differences in absolute value between
the values taken by the variable and the average is
zero.
3) The sum of the squared differences between the
values taken by the variable and the average is a
minimum.
The median, quartiles and percentiles
The median
For the distributions that are asymmetrics and with extreme
vaules (or outliers) is better to use as a mesaure of central
tendendy the median, that can be calculate also for ordinal
variables.
The median can be computed only if the variable is an
ordinal one, in the sense that it can be possible to rank the
values of the variable in increasing or decreasing order and
consequently the corresponding statisical units. Having
done this is possible to evalute the position of each unit
with respect with that array.
The Median or second quartile, Q2, is the value of the
middle item in an ordered set of data. Q2 is the value of the
item that is at or nearest the 0,5(N+1)th position in the
ordered data.
The Lower Quartile Q1 is the item whose position in the
ordered list is at or nearest to the 0,25(N+1)th.
The Upper Quartile Q3 is the item whose position in the
ordered list is at or nearest to 0,75(N+1)th.
Property of the median.
For a quantitative variable X, the sum of the differences in
absolute value of the xi from a costant c is a minimum
when c is equal to the median.
Calculus of the median for odd and even number of data.
Percentiles
Percentiles divide the arrayed data in hundreths.
The mode
The mode is the value that occurs most often
a) there may be no mode or there may be several
b) the mode may be a value near the beginning, middle or
end of the data. In other words it can be anywhere and
therefore may not be representative.
See exercise 2.1 pages 68-70 Bradley.
2. 2 Summary statistics for grouped data: mean,
quartiles and mode
Grouping assumptions: It is assumed that the values within
each interval vary uniformly between the lowest and
highest value in the interval. Hence the average value of the
data in any intervals is the mid interval value: it is used to
represent the group numerically.
Calculations of mean for grouped data
(fi xi) / N
where i is the ordered position of the interval, xi is the mid
value of the interval, fi is the frequency of the interval.
See example 2.2 pages 74-75 Bradley.
Quartiles, percentiles for grouped data
When the data is sorted into a frequency table the data is
ordered from the lowest to the highest values in blocks or
intervals. Approximate estimates of the quartiles are made
by identifiying the intervals containing the items that are
25%, 50% and 75% of the way through the ordered data.
More accurate estimation is made using the formula:
Qm = LQm + ((mN/4 – cf)/fQm) (w).
where LQm is the lower limit of the interval containing Qm
fQm is the frequency of the interval containing Qm
w is the width of the interval containing Qm
cf is the cumulative frequency up to, but non including the
Qm interval.
Weighted average
( wi xi)/  wi
wi are called the weights and reflect the relative importance
of each xi.
The mean for grouped data is weighted mean where the
weights are the frequencies.
Examples: in a statistic examination the overall marks
calculated as follows: 50% for the written paper, 30% for
the practical examination and 20% for homework.
Consumer price index.
2.3Measure of dispersion for raw data
Range
Variance
Standard deviation
Semi-interquartile range
Quartile deviation
Examples:
Age of two groups of tourists
P : 48, 50, 52, 51, 49
Q: 2, 88,76, 31, 64, 39, 50
The range is the difference between the highest and the
lowest value in the data set.
Variance:
2 = ((xi -)2)/N
Standard deviation
 = (((xi -)2)/N)1/2
When the variance is calculated from sample data, the
denominator is given by n-1, where n is the sample size.
Semi-interquartile range (or interquartile range) is the the
difference between the upper and lower quartile.
The quartile deviation is the semi-interquartile divided by 2.