Download Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

World Values Survey wikipedia , lookup

Transcript
Statistics
Introduction:
In many real-life situations, it is helpful to describe data by a single
number that is most representative of the entire collection of numbers. Three ways of
characterizing any data distribution are:

Measures of Central Tendency. Describe the center point of a data set with a
single value.

Measures of Dispersion. Describe how far individual data values have strayed from
the mean.

You need to know that some measures of central tendency and variability are inappropriate
for qualitative variables.
Mean:
The mean (or average) of a set of data values is the sum of all of the data values
divided by the number of data values.
Mean =
Mean:
x
x .f
i
Sum of all data
Number of data values
value
i
N
Where:
'x bar' is the mean of the set of x values.

is the sum of all the xi fi values, and N is the number of data values in the population
A fruit-seller has the following daily sales (in $) for five consequtive days:
100 - 120 - 125 - 100 - 130
Determine his average daily sales.
Thus, the average daily sale of the fruit-seller is $115.
We calculate the statistical mean of a list of numbers in order to find the general tendency
of the numbers in the list.
Find the mean number of minutes per day spent in Facebook: 75, 36, 0, 94, 56
Solution: Mean=52.2 minutes
Characteristics of the mean:

Every value in the distribution contributes to the value of the mean.

The mean is very sensitive to extreme scores. An extreme score can pull the mean in
one or the other direction and make it less representative of the set of scores and less
useful as a measure of central tendency.

Arithmetic mean is affected by change of both origin and scale. (Proof)

Its value may not actually exist in the data (e.g., for the data set 2,3,4 and 5; the
mean is 3.5).
Remember that the word average means only the one measure that best represents a set of
scores, and that there are many different types of averages. Which type of average you use
depends on the question that you are asking and the type of data you are trying to
summarize.
Exercises
The heights (cm) of the students in a class are:
195, 192, 192, 150, 174, 186, 159, 156, 189, 168, 156, 168, 150, 186, 192, 162,
183, 174, 189, 159
Their mean height is:
cm
Median:
The median is the middle value when the data is arranged in order of size.
In other words, the median divides the whole set of values in two parts such that half of the
observations are less than or equal to it and half are more than or equal to it.
Find the median of the followin set of data: 2, 3, 5, 3, 4, 3, 6
Step 1. Rewrite the numbers in ascending order: 2, 3, 3, 3, 4, 5, 6
Step 2. There are 7 values in the data set. The median is the fourth value.
The median is 3.

If the total number of given values n, is an odd number, then there exists only one
middlemost value, namely the
median of the values.
th value in the arrangement and it represents the
Find the median for the following set of values:
243514533
Step 1. Rank the data in ascending order as follows:
123334455
Step 2. Because the number of values in this set is odd (nine), there are four values less
than and four values greater than the median. Therefore, the median is teh fifth value, 3.
If the total number of given values n, is an even number, median may not be ubiquely

determined. In fact, any possible value between the two middle values, namely, the
the
th and
th values in the ordered arrangement, may be takes as median. But in order to
obtain a definite value, the arithmetic mean of the
th and the
regarded as the median of te set of values, by convention.
Find the median for the following set of values:
02351453
Step 1. Rank the data in ascending order as follows:
02334455
th values is
Step 2. Because the number of values in this set is even (eight), the median is the midpoint
between the fourth and the fifth values, 3 and 4.
The median for grouped data is slightly more difficult to compute. We know that the
median occurs in the particular class interval for which the cumulative frequency is . On
observing the less-than type, say, cumulative frequencies, we can obtain the class interval
that contains the median. In fact, the cumulative frequency for this interval is just more
than or equal to .
Marks
Number of students
0-10
2
10-20
12
20-30
22
30-40
8
40-50
6
Advatages:

It is very easy to calculate.

The median is unaffected by extreme scores.
Disadvantages:

It may not correspond to any observed value (e.g., for the data set 2,3,4 and 5; the
median is 3.5)

Cannot be manipulated algebraically.
Exercise: Find the median for the following set of scores:
1, 8, 10, 8, 4, 10, 6, 3, 7, 3, 5, 5, 6, 1, 3, 10, 0, 7, 9
Median:
Mode:
The mode of a set of data is the value or values which occur most often.
Steps to determine the mode:
Step 1.
Count the number of times each value in a set occurs.
If one value occurs more time than any other, it is the mode.

If two or more values occur more time than any other, they are all modes
of the set.

If all values occur the same number of time, there is no mode.

Step 2.
Find the mode of: 2, 3, 4, 4, 2, 3, 4
Number 2 occurs 2 times,
Number 3 occurs 2 times,
Number 4 occurs 3 times,
So the number with most occurrences is 4 and is the Mode of this distribution.
Another method for determining mode is to use the empirical relation between mean, median
and mode which is found to hold for unimodal distributions that do not deviate much from
symmetry. The relation is:
Mode for grouped data.

In the computation of the value of the mode for grouped data, it is necessary to
identify the class interval that contains the mode. This interval, called the modal class,
contains the hightest frequency in the distribution.
This table shows the monthly income of different families in a special locality. Find the
income earned by the most number of families.
Income
Families
1000-2000
10
2000-3000
14
3000-4000
10
4000-5000
12
Advantages:

It is applicable to nominal data.

It is unaffected by extreme values.
Disadvantages:

It may not be unique in a set of data.

It can not be manipulated using the rules of algebra.
Exercises
The modal class is 20003000.
1.-Find the mode of the following scores:
0, 10, 1, 0, 4, 0, 0, 4, 0, 9, 0, 6, 8, 4, 1, 2, 7, 6, 9, 8, 10, 0, 4, 5, 3
Mode:
2.-A farmer has 50 chickens. After weighing them all he got the following amounts (in
grams):
1800 2700 3000 2500 2900 1900 3000 3400 2900 3400 2300 1500 3100 1500 1800 1900
2700 2600 2400 2200 3500 2100 1700 2000 2500 2900 2700 1700 2700 3100 1600 3100
2000 3200 1800 1800 3200 2000 3000 1900 2500 2400 3500 3200 1500 2100 1900 2000
1800 1600
Find Median:
Range is the difference between the highest value and the lowest value of the given set of
observations:
Range = maximum value - minimum value.
The heights of a sample of five people are 180, 183, 190, 179 and 180 cm. Find the range.
Maximum value = 190
Minimum value = 179
Range = 190 - 179 = 11
Properties.

It is easy to understand.

It is simple to calculate.

It does not depend on all observations, and is based on only the largest and the
smaller among them.

It is highly affected by extreme values.

It does not take into account the form of the distribution.
Mean Deviation and its Coefficient:
The mean deviation (also called average
deviation), of a set of N numbers X1,X2,...,XN is abreviated by MD and is defined by.
Where
is the arithmetic mean of the numbers and
deviation of Xj from
Find the mean deviation of the set 2, 3, 4, 5, 6.
Properties:
is the absolute value of the
The mean deviation is based on all the observations.

Shows the dispersion of values around the measure of central tendency.

It is easy to compute.

Average deviation from mean is always zero in any data set. The MD avoids this
problem by using absolute values to elimitate negative signs.

The mean deviation is a better measure of absolute dispersion than the range and the
quartile deviation.

Variance:
The variance is a numerical index describing the dispersion of a set of scores
around the mean of the distribution. The variance is calculated as the average of the
squared deviations from the mean.
Formula for variance:
s
2
x

2
i
. fi
N
x
2
A couple has six children whose ages are 6, 8, 10, 12, 14 and 16. Find the variance in ages.
Solution:
The following table gives the frequency distribution of the number of computers sold during
the past 30 weeks at a computer store.
Computers sold
Frequency (f)
[0-4)
2
[4-8)
3
[8-12)
4
[12-16)
2
[16-20)
1
Calculate the variance.
Solution:
S2=21.6
The Standard deviation is simply the square root of the variance and gives the spread
of the sample or population about the mean.
That is,
The standard deviation plays a dominating role for the study of variation in the data. It is a
very widely used measure of dispersion. As far as the important statistical tools are
concerned, the first important tool is the mean and the second important tool is the
standard deviation.
A couple has six children whose ages are 6, 8, 10, 12, 14 and 16. Find the standard deviation
in ages.
1. The population mean is:
2.
3. Find the positive square root of the variance:
Properties:

The standard deviation is in the same units as the units of the original observations.

Standard deviation is independent of change of origin but not of scale (Proof)
Coefficient of variation:
The coefficient of variation (symbol CV), also referred to
as the coefficient of mean deviation, is defined as the ratio of the standard deviation to the
mean of the data set. It is used to express the standard deviation as a percentage of the
mean.
Mathematically, the coefficient of variation is calculated using the following equation:
Sample:
The coefficient of variation is especially useful when comparing data set, which have
different units because the coefficient of variation is a dimensionless number.
So when comparing between data sets with different units or widely different means, one
should use the coefficient of variation for comparison instead of the standard deviation.

A national sampling of prices for new and used houses found that the mean price for a new
house is $120,000 and the standard deviation is $6100 and that the mean price for a used
house is $50,000 with a standard deviation equatl to $3150. In terms of absolute deviation,
the standard deviation of price for new houses is more than twice that of used houses.
However, in terms of relative variation, there is more relative variation in the price of used
houses that in new houses.
The CV for used houses is
The CV for new houses is
Properties:

When the mean value is near zero, the coefficient of variation is sensitive to small
changes in the mean, limiting its usefulness.

The coefficient of variation is independent of change of scale but not of origin
Exercises
The mean and standard deviation of height of a group of teenages are found to be 138 and
1.25 cm, while the same measures for their parents are 189 and 7.9 cm.
The Coefficient of variation of the teenages is:
The Coefficient of Variation of their parents is:
There is more variability on
teenages
parents