Download The statistical significance of a difference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Practical No : 07
PRACTICAL ON STATISTICS
Objectives :
At the end of the practical, the student should be able to,
1. List the methods of data representation.
2. Explain the Tabular and Diagrammatic methods of data representation.
3. Define Mean, Median and Mode
4. Explain the following measures of dispersion and their method of calculation.
i. Range
ii. Inter-quartile range
iii. Variance
iv. Standard deviation
5. Describe the normal distribution / Gaussian curve
6. Explain about hypotheses and hypothesis testing
7. Explain and calculate the standard error of the mean.
The methods of data representation
Tabular Method
Cross Tabulation
Cumulative Frequency table
Data
Presentation
(3 main
methods)
Histogram
Diagrammatic
Method
Frequency polygon
Bar chart / Pie chart
Mean
Compute the average
Median
Mode
Numerical
Method
Range
Compute the measure
of variability
Inter-quartile
Range
Variance
Standard
Deviation
Tabular Method
Population
Sample
Variable
Number
:
:
:
:
A batch of medical students at faculty of medical sciences, USJP.
A random sample – practical group A
Height of students
Male
169
173
171
168
168
165
169
169
174
Frequency Table
Female
155
159
156
150
160
161
160
Cross Tabulation
Variable Frequency
(x)
(f)
Variable
%
Male
Cumulative Frequency table
Female
Height
(f)
%
(f)
%
Variable
(f)
Cumulative
frequency
%
150 - 154
1
6.52
150 - 154
0
0
1
14.28
150 - 154
1
1
6.25
155 - 155
160 - 164
3
3
18.75
18.75
155 - 159
160 - 164
0
0
0
0
3
3
42.85
42.85
155 - 159
160 - 164
3
3
4
7
18.75
18.75
165 - 169
170 - 174
6
3
37.5
18.75
165 - 169
170 - 174
6
3
66.67
33.33
0
0
0
0
165 - 169
170 - 174
6
3
13
16
37.5
18.75
16
100
out of,
9
100
7
100
16
100
Diagrammatic Method
Frequency polygon
7
6
5
5
Height (cm)
170 - 174
0
165 - 169
0
160 - 164
1
155 - 159
1
Height (cm)
170 - 174
2
165 - 169
2
3
160 - 164
3
4
155 - 159
4
150 - 154
Frequency (f)
7
6
150 - 154
Frequency (f)
Histogram
Bar Chart
Pie Chart
Frequency (f)
7
6
5
Female
4
Male
3
44%
2
56%
1
170 - 174
165 - 169
160 - 164
155 - 159
150 - 154
0
Height (cm)
Numerical Method
Computing the Average :
Estimation of the average values are known as measures of central tendency.
These include the Mean, Median and the Mode
Mean
:
When calculating the mean, all the observed values are added up and divided by
their number.
The values may or may not be arranged in a numerical order.
The mean of a group of values, however, is governed by the individual values.
i.e. Any outstandingly high or low value makes a big impact on the mean.
 (“X bar”)
=
X
n
Median
:
Represents the middle or central value of a group of observations.
When obtaining the median value, all the observations should first be arranged in
ascending or descending order.
In an odd number of observations, the middle observation can be directly taken as
the median of that sample.
In an even number of observations, the two middle numbers are taken, added and
divided by 2 to obtain the median value.
The median is a more accurate measure of central tendency as its value is not
affected by any outstandingly high or low observation within the sample.
Mode
:
This is the most frequently occurring value in a set of observations.
Eg. In a set of observations as follows,
150, 155, 156, 159, 160, 160, 161, 165, 168, 168, 169, 169, 169, 170
The mode will be 169, as it is the most frequently occurring value.
There maybe more than one mode (bimodal) in a population.
Computing the measure of variability :
Range
:
Shows the minimum and maximum values in a set of observations
It thereby establishes the boundaries for that specific set of observations.
Range
Interquartile Range
:
=
Xmin - Xmax
Is similar to the range, however, the highest and lowest values in an
interquartile range correspond to the 25th and 75th percentile respectively,
when data is arranged in ascending or descending order.
Eg, In the following set of observations,
150, 155, 156, 159, 160, 160, 161, 165, 168, 168, 169, 169, 169, 170, 173, 174
01
4
(25%)
8
(50%)
12
(75%)
Interquartile Range
Variance
:
This is the Mean Square Deviation in a set of observations
Variance
 (X - )2
=
n-1
Standard Deviation
:
This is the square root of the variance and it gives a more accurate
indication about the spread of observations around the mean.
Standard Deviation
=
 (X - )2
n-1
The calculation of standard deviation is as follows,
X

(X - )
X1
X2
X3
X4
X5










X1 –
X2 –
X3 –
X4 –
X5 –
(X - )2
(X1 –
(X2 –
(X3 –
(X4 –
(X5 –
)
)2
)2
)2
)2
Variance (V)
Standard Deviation
 (X – )2
5
 (X – )2
5
()
2
16
The normal distribution (Gaussian Curve)
 ± 1 SD
 ± 2 SD
Frequency
 ± 3 SD

68% of the
population
95 % of the population
99.7 % of the population
In a normal distribution,
The curve is symmetrically bell shaped
Mean = Mode = Median
The sum of the positive observations + negative observations = 0
Hypotheses and Hypothesis testing
The statistical significance of a difference
- When analyzing data that come from two different populations, it is important to know the
degree in which they are related as well as the degree by which an observation differs from or
relates to the rest of the observations in that population.
- Hereby, we can make a statement as to the probability of such an observation occurring within
the specified population.
- When a set of observations has a normal distribution, multiples of the standard deviation mark
certain limits on the scatter of observations.
Eg. 1.96 SD (or approx. 2SD) above and below the mean mark the points within
which 95% of the population lie.
I.e. 5% of the population lie beyond these points.
If a certain observation falls in this 5%, we can say that the probability of such an
observation occurring is 5% or less.
Probability is expressed as ‘P’, an as a fraction of 1 rather than 100.
 in the above instance P< 0.05
Hypothesis testing
A set of observations is plotted on a graph (normal distribution)
Choose the observation to be tested
Form a hypothesis (known as ‘null hypothesis’)
Calculation :
Value of Observation – Mean value of all observations
Standard Deviation
X-
SD
Answer :
How many standard deviations away from the mean does the observation lie ?
Compare this value with probability table
Number of standard deviations
Probability of observation showing at least as
large a deviation from the population mean
0.674
1.0
1.645
1.96
2.0
2.576
3.0
3.291
0.50
0.317
0.10
0.05
0.046
0.01
0.0027
0.0001
Find out the Probability of the observation occurring within the population
If, P > 0.05
No significant difference
0.05 > P > 0.01
Probably significant
0.01 > P > 0.001
Significant difference
P < 0.001
Highly significant difference
Accept or reject the null hypothesis
Standard Error of Mean
- In statistical analysis of a population, several samples maybe drawn instead of one, and
analyses performed on each, separately.
- Even though all samples maybe drawn at random, the means of these samples will not
necessarily be the same but will generally conform to a “normal distribution”.
- Thus, there is a variation between samples.
- This variation will depend on the variation in the population and the size of the sample.
- We do not know the variation in the population.
- But an estimate of it can be obtained by the variation within the sample, which is its standard
deviation.
Therefore,
The Standard Error of mean
=
Standard Deviation
n
- This series of means, thus, will also have a standard deviation.
- In this manner we can compare two samples and identify if they are from the same or
different populations.
Eg. Consider two samples 1 and 2.
As we do not know the population mean they came from, we estimate it from the
standard deviation of one of the samples using the above formula.
The mean of this sample has a 95% chance of falling within 1.96 standard errors above or
below the population mean.
If the second sample also comes from the same population there is a 95% chance that its
mean will also lie within +/- 1.96 standard errors of the population mean.
In order to assess this, we calculate the standard error of difference between the means.
SE of Difference
between means
=
SD12
n1
2
+ SD2
n2
After obtaining the Standard Error, we need to find the differences between the two
sample means,
i.e.
1 - 2
We now need to find out how many multiples of the Standard Error, this represents.
i.e.
1 - 2
SE
More than 3.291 standard deviations (or multiples) away from the mean represents a
probability of 0.001 of the two samples being from the same population.
Therefore, if (1 - 2) / SE is more than 3.291, we can say that the probability of the
two samples belonging to the same population is less than 0.001 or P 0.001.