Download 3. Statistics that describe the central position in a data set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Resampling (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Module 7
Data Presentation and Statistical Methods
Trainee Manual
MODULE 7: DATA PRESENTATION AND STATISTICAL METHODS
Timetable
09h00 – 11h00
11h00 – 11h15
11h15 – 12h30
12h30 – 13h30
13h30 – 15h00
15h00 – 15h15
15h15 – Close
Lecture material interspersed with Excel exercises 1 and 2.
Tea break.
Lecture material interspersed with Excel exercises 3 and 4.
Lunch.
Lecture material interspersed with Excel exercises 5 and 6.
Tea break.
Mastery test
Page 7-1
Module 7
Data Presentation and Statistical Methods
TABLE OF CONTENTS
1. Module Objectives .................................................................................................................3
2. Introduction ............................................................................................................................4
3. Statistics that describe the central position in a data set ........................................................4
4. Describing the dispersion or spread of data about the central position statistics ...................4
5. The difference between standard deviation and standard error of the average .....................5
6. How to determine if a data series is normally distributed? .....................................................5
7. Using Excel to calculate descriptive statistics ........................................................................6
7.1. EXERCISE ONE ..............................................................................................................7
8. Comparing the sample average against a water quality standard ..........................................7
8.1. EXERCISE TWO ..............................................................................................................7
8.2. EXERCISE THREE ..........................................................................................................9
9. Comparing samples from two different sites ..........................................................................9
9.1. EXERCISE FOUR ............................................................................................................9
10. Comparing two samples that are not independent ...............................................................11
10.1. EXERCISE FIVE ............................................................................................................11
11. Data presentation and time series analysis ..........................................................................12
11.1. EXERCISE SIX ..............................................................................................................12
12. Reference Books .................................................................................................................16
Page 7-2
Module 7
Data Presentation and Statistical Methods
1. Module Objectives
In this module, you will be introduced to some basic statistics that are relevant to marine
pollution control, in particular, to the interpretation of water sample measurements in relation to
water quality guidelines and standards. The emphasis of the module is on (1) computing the
necessary statistics, (2) interpreting the statistics obtained and (3) the presentation of data. To
compute the statistics Excel functions that are preprogrammed within the spreadsheet are
used. The actual computational equations of the various statistics will not be given -- any basic
statistics textbook can be consulted for them (see reference books).
Page 7-3
Module 7
Data Presentation and Statistical Methods
2. Introduction
Good decision-makers base their decisions on the best available information on hand.
Information is derived from data and best available implies that there is some loss of
information, probably because of sampling or measurement error during collection of the data.
In other words, there is some uncertainty. All information (whether from books, journals etc.) is
ultimately derived from data and the above can be summarized by the following formula:
DATA = INFORMATION + ERROR
It should be obvious that good data results in good information with little error. Poor data, on
the other hand, results in a loss of information due to large errors.
Information to be extracted from data is obtained in the following ways or methods:



The data can be plotted, by means of graphs, to determine trends. When trends or
patterns are identified they can be used to predict future values or states, or
The data can be analyzed i.e. some statistic, such as the average, can be computed which
is able to summarize a large amount of data into information that is easy to understand, or
Data can be modeled, usually using a simply mathematical model that helps explain how a
system works.
Statistics is therefore, a discipline that attempts to increase information from data by fully
understanding how error is incorporated into data.
3. Statistics that describe the central position in a data set
When confronted with a large body of data, one is inclined to summarize the data into smaller
bits of information that are easier to understand. This also facilitates comparison with other
data sets.
As a first step, one tries to describe the central features of the data. Statistics that describe the
central position of a data set are the average and the median. There are other measures of
central locations such as the mode, which is not used that often and is not very useful. We will
concentrate on the average and the median.
The average is only used when the underlying frequency distribution of the sampled data
follows a normal distribution i.e. the frequency distribution is bell-shaped. If the distribution is
not symmetrical about the average i.e. the distribution is skewed either to the left or to the right
or forms some other shape (the distribution is non-normal), then the median is used.
4. Describing the dispersion or spread of data about the central position statistics
Two statistics (the standard deviation and the variance) describe the spread of data points
about the average. These two statistics are related in the following way: the standard deviation
is equal to the square root of the variance.
A small standard deviation indicates that the spread (or range) about the average is small while
a large standard deviation indicate that the spread is large. The standard deviation is used to
Page 7-4
Module 7
Data Presentation and Statistical Methods
calculate a confidence interval about the average, which contains a given percentage of the
data points. For example:



AVERAGE  1 STANDARD DEVIATION -- within this range 68% of the sample data
points are included.
AVERAGE  2 STANDARD DEVIATIONS -- within this range 95% of the sample data
points are included. Note that the AVERAGE – 2 STANDARD DEVIATIONS describes the
left 95% confidence interval and AVERAGE + 2 STANDARD DEVIATIONS describes the
right 95% confidence interval.
AVERAGE  3 STANDARD DEVIATIONS -- within this range 99% of the sample data
points are included.
In the case of the median, the FIRST QUARTILE (25th percentile) and the THIRD QUARTILE
(75th percentile) describe the spread of data points around the median. In fact, the median is
the 50th percentile. The Xth percentile is a data point at which X% of the points lie below this
data value. For example, the first quartile indicates that 25% of the sample data points are
below its value while 75% of the data points are above its value. Other percentiles can be
interpreted in a similar fashion. The difference between the third and first quartile is referred to
as the interquartile range.
5. The difference between standard deviation and standard error of the average
As previously explained the standard deviation describes the variability or spread of data
around the average. On the other hand, the standard error indicates how precisely the average
is estimated. If the standard error is a large value then the average is poorly estimated. The
standard deviation and the standard error are related by the following equation:
Standard error 
Standard deviation
No. of observations in sample
It should be apparent that as the number of observations increase in the sample, the smaller
the estimate of the standard error. Hence, a large sample always provides a more precise
estimate of the average.
Given an estimate of the average and the standard error of the average (both derived from a
sample), one can say with 95% confidence that the true population average lies between the
interval AVERAGE  2 STANDARD ERRORS.
6. How to determine if a data series is normally distributed?
Earlier it was mentioned that the average and standard deviation statistics should only be used
to describe data sets that are normally distributed while the median and its first and third
quartiles should be used to describe non-normal distributions. How do we decide if a
distribution is normally distributed? A normal distribution has a number of characteristics that
can be tested for. In a normal distribution:


The average and median are the same or very close together, and
The skew statistic is close to or equal to zero. Skewness characterizes the degree of
asymmetry of a distribution around its average. Positive skewness (a large positive skew
Page 7-5
Module 7

Data Presentation and Statistical Methods
statistic) indicates a distribution with an asymmetric tail extending toward the right.
Negative skewness (a large negative skew statistic) indicates a distribution with an
asymmetric tail extending toward the left.
A third means of testing whether a distribution is normal or not is to draw up a frequency
histogram and to visually check if the curve is bell-shaped. A histogram can be drawn in
Excel by clicking on the tools/data analysis and then choosing the histogram submenu.
One has to decide, prior to drawing the histogram, the number of size classes to be used in
the bin. A general rule is that the number of size classes = square root of the number of
observations. This value is then rounded up or down for convenience. The increment to
which each size class is increased = ((the maximum value in the sample – the minimum
value in the sample)/ the square root of the number of observations). This value is then
rounded up or down for convenience.
7. Using Excel to calculate descriptive statistics
The statistics described so far are usually referred to as descriptive statistics. The following
table summarizes some of them and includes the Excel functions that calculate them.
Statistic
Average
Excel function
=AVERAGE(array1)
Standard deviation
=STDEV(array)
Variance
=VAR(array)
Standard error of
the average
=STDEV(array)/SQRT(COUNT(array))2
Median
First Quartile
Minimum value
=MEDIAN(array)
=QUARTILE(array, 1)3
=PERCENTILE(array, 0.25)
=QUARTILE(array, 3)
=PERCENTILE(array, 0.75)
=MIN(array)
Maximum value
=MAX(array)
Third Quartile
Uses and remarks
Estimates central
location in a normally
distributed sample
Describes variability
about the average and
can be used to
calculate a range within
which a certain
percentage of the
samples are included
Equals the standard
deviation squared
Used to calculate
confidence range for
population average
The 50th percentile
The 25th percentile
The 75th percentile
Calculates the
minimum value in a
sample
Calculates the
maximum value in a
sample
Array refers to the cell address range that stores the sample data points. For example if observations are entered
in column A in rows 1 to 100, then the array address is A1: A100.
2 This formula consists of a number of nested Excel functions: =STDEV(), =COUNT() and SQRT() with the latter
function being the square root function.
3 There are two different ways of calculating the quartiles in Excel.
1
Page 7-6
Module 7
Data Presentation and Statistical Methods
Number of size
classes
=SQRT(COUNT(ARRAY))
Size class
increment
=(MAX(array)MIN(array))/SQRT(COUNT(array))
Number of data
points
=COUNT(array)
Skew
=SKEW(array)
7.1.
To determine the
number of size classes
to include when
drawing a histogram.
Can be rounded up or
down
To determine the size
class increment. Can
be rounded up or down
Counts the number of
observations in a
sample
To test the symmetry of
a sample
EXERCISE ONE
Open the Excel file, MODULE 7 STATISTICS.XLS and go to the sheet named EXERCISE
ONE. In the column headed E. coli a hundred data points are given. Each observation is from a
water sample and each value represents the number of E. coli bacteria counted in 100ml of
water. The water samples come from a mariculture farm that is growing mussels for human
consumption. The 100 water samples were drawn randomly i.e. a 100 days in a year were
chosen randomly and a single water sample was taken for the selected day. The water sample
was drawn from the same area of the pond each time.
1. Determine by means of a histogram if the bacterial counts are normally distributed?
Indicate how you determined the number of size classes and the class increment.
2. Confirm your choice of distribution of 1 above by estimating the following statistics: the
average, the median and the skewness. Give reasons for your choice of distribution
(normal, skewed left or right).
3. Use the data set to practice entering the Excel functions given in the table above.
8. Comparing the sample average against a water quality standard
8.1.
EXERCISE TWO
At times, one would like to know if a sample average is below a certain standard value taking
into account that, there is variability about the sample average. For example, according to the
South African water quality guidelines for coastal marine waters the maximum acceptable
concentration of mercury in seawater is 0.3 g/l. In EXERCISE TWO of your Excel
spreadsheet the concentration of 20 mercury samples are given in g/l. The samples were
taken from a 40km stretch of beach on the same day. One would like to know - is the average
of these samples below or equal to the standard value, at a confidence level of 95%? This
statement is referred to as the null hypothesis. The alternate hypothesis is that the sample
average is greater than the standard value at a confidence level of 95%.
To answer the question we make use of the Student’s t-test. Essentially the test involves
calculating a t-statistic (using the given data) and then comparing the t-statistic to a t-critical
value. The latter is calculated at a particular confidence level (usually 95%). From this
Page 7-7
Module 7
Data Presentation and Statistical Methods
comparison we can either reject or accept the null hypothesis. The t-statistic is calculated as
follows:
t  statistic 
Sample Average - Standard Value
Variance of the sample/number of data points
In Excel the t-statistic formula is written as:
=(AVERAGE(A2:A21)-0.3)/SQRT(VAR(A2:A21)/20)
where the cells range A2:A21 contains the sample mercury concentrations.
STEP 1
State the null hypothesis and the alternate hypothesis
Null Hypothesis: The sample average is less than or equal to 0.3
Alternate hypothesis: The sample average is greater than 0.3
STEP 2
Check to see if the sample is normally distributed. The Student’s t-test is only applicable to a
normally distributed sample.
STEP 3
Calculate the t-statistic in Excel. The answer should equal -1.05
STEP 4
Calculate the t-critical value. To calculate the t-critical value we use a function in Excel called
the =TINV()
function. This function requires two input values termed the probability and degrees of freedom.
The function is therefore written as =TINV(X, Y) where X is the probability and Y is the degrees
of freedom. The probability is calculated as follows:
Probability = (1-Confidence level as a fraction)*2
The probability at a confidence level of 95% is therefore (1-0.95)*2 = 0.1. Note that the above
probability equation ONLY APPLIES TO A ONE-TAIL TEST. If a probability for a two-tailed test
is required then the probability = (1- Confidence level as a fraction).
The degrees of freedom = (number of data points –1) i.e. df = 20-1=19.
The final formula should look like this =TINV(0.1,19) and should equal 1.73.
STEP 5
Compare the t-statistic with the t-critical value. If the t-statistic is greater than or equal to the tcritical value, we reject the null hypothesis. In this case the t-statistic (-1.05) is less than the tcritical value of 1.73 we therefore cannot reject the null hypothesis. We therefore conclude that
the sample average is statistically significantly less than or equal to 0.3.
Page 7-8
Module 7
8.2.
Data Presentation and Statistical Methods
EXERCISE THREE
Repeat the above Student’s t-test for a confidence level of 99%.
9. Comparing samples from two different sites
Suppose one collects data from two different sites and wants to compare the two sites to see if
there is a statistical difference between the average values for these sites. Again, a Student’s ttest is employed to make such a comparison. In this case, the test is based on a number of
assumptions that need to be tested:


The observations in each sample must follow a normal distribution.
The observations from each sample must be independent of each other.
Consideration must also be given to the variances of the samples. If the variances are equal,
we then apply the Student’s t-test called “t-test: two-sample assuming equal variances”. If the
variances are not equal, we apply the test called “t-test: two-sample assuming unequal
variances”. Both of these statistical tests are found in the tools/data analysis submenu in Excel.
In the following exercise, a statistical test called an F-test will be employed to check if the
variances of two samples are equal. This test is also found under the tools/data analysis
submenu in Excel. It is called the “F-test two-sample for variances” in Excel.
9.1.
EXERCISE FOUR
In EXERCISE FOUR, pH values are given. They come from two rivers in KwaZulu-Natal: the
Umgeni and the Umfolozi. The pH was measured with an electronic pH meter. The pH values
were measured over the months of December 2000 to January 2001 and days were randomly
chosen over this period. Notice that there are an unequal number of samples taken from the
two rivers. The following steps check if the variances of the two samples are equal at a 95%
confidence level.
STEP 1
Check if both samples come from a normal distribution. Now state the null hypothesis and the
alternate hypothesis.
Null hypothesis: The variances are equal.
Alternate hypothesis: The variances are not equal.
STEP 2
Click on the tools/data analysis menu and move to the “F-test two-sample for variances”
submenu. Fill in the input values as follows. Note Alpha refers to the probability level, which is
equal to (1- confidence level as a fraction). Now click OK.
Page 7-9
Module 7
Data Presentation and Statistical Methods
STEP 3
The output should look the following diagram.
STEP 4
Now compare the F-statistic to the F-critical value. The F-statistic is less than the F-critical
value and hence we cannot reject the null hypothesis. We therefore conclude that the two
samples have the same variances.
We continue the exercise to see if there is a significant difference, at a confidence level of 95%,
between the sample averages.
STEP 1
State the null hypothesis and the alternate hypothesis.
Null hypothesis: The average pH of the Umgeni = the average pH of the Umfolozi.
Alternate hypothesis: The average pH of the Umgeni  the average pH of the Umfolozi.
STEP 2
Open the submenu “T-test: two-sample with equal variance” found under the tools/data
analysis menu. Fill in the input values as follows. Now click OK.
Page 7-10
Module 7
Data Presentation and Statistical Methods
STEP 3
The output should look like the following diagram.
STEP 4
Now compare the t-statistic with the two-tailed t-critical value. If the t-statistic is less than the tcritical value then reject the null hypothesis.
10. Comparing two samples that are not independent
As stated previously when comparing two averages it is assumed that the two samples are
independent. There are situations when this assumption may not be valid. A case in question
may be the measurement of a pollutant before and after a clean up. To test if there is a
significant difference between these samples a “t-test: paired two-sample test for means” is
then applied. This test is also found under the tools/data analysis menu in Excel.
10.1.
EXERCISE FIVE
A sewage pipeline was damaged by a storm and raw sewage has leaked into an estuary.
During the contamination 10 samples of E. coli counts were made from ten different sites in the
estuary. These data are found in column A, in the sheet headed EXERCISE FIVE.
Subsequently, the pipe was repaired and after a few weeks a further ten water samples where
taken from exactly the same sites. These samples were also sent for analysis of E. coli. These
Page 7-11
Module 7
Data Presentation and Statistical Methods
counts are given in column B. Determine at a 95% confidence level whether there is a
significant difference between these two samples.
11. Data presentation and time series analysis
Water quality data can be presented either in tabular or graphical form. Tables are used to
convey quantitative information, such as the exact values of observations. Graphs on the other
hand convey the general behavior of a data set, highlighting patterns and trends. Trends and
patterns can be temporal or seasonal. In the following exercise we will use graphical means to
(1) identify seasonal patterns and (2) to present the data in a form that easily conveys this
information to the reader. It is important to note that time series analysis is a very complicated
statistical procedure and is beyond this introductory course.
11.1.
EXERCISE SIX
In EXERCISE SIX, the date and concentration of dissolved oxygen (in mg/l) are given. The
readings were taken from the Umgeni River over a period of time. We suspect that this data is
very seasonal but a visual examination of the data does not reveal so (see below).
To reveal these seasonal patterns we need to plot the data in a type of graph called an XY
chart, where on the X-axis, time is plotted and on the Y-axis, the dissolved oxygen
concentration is plotted. To plot such a graph we “block-off” the data to be plotted (see below).
Page 7-12
Module 7
Data Presentation and Statistical Methods
Then click on insert and choose the submenu charts. It should look like the following:
The next step would look like this:
Page 7-13
Module 7
Data Presentation and Statistical Methods
Fill in the following:
Your final chart or graph should look like this:
Page 7-14
Module 7
Data Presentation and Statistical Methods
Time series analysis
12
Dissolved oxygen (mg/l)
10
8
6
4
2
0
31-Jan-93
19-Aug-93
7-Mar-94
23-Sep-94
11-Apr-95
28-Oct-95
15-May-96
1-Dec-96
19-Jun-97
Date (dd-mm-yy)
Now modify your chart until it resembles
the following chart:
To modify any aspect of your chart click on the particular aspect that needs modification and
then right-click to perform the modification.
The chart shows a curve with peaks and troughs. To check if the peaks occur over a particular
season hold the cursor over a peak data point and the X and Y co-ordinates of that point will be
shown. Do the same for the troughs. Can you draw any general conclusions?
Page 7-15
5-Jan-98
Module 7
Data Presentation and Statistical Methods
Some tips when drawing graphs
Content: the chart should contain the minimum amount of detail to convey the desired
information. Complex charts obscure important features. Use separate charts for complicated
subjects.
Size of chart: try to fit a chart into a page or less. Charts that extend over two pages are
difficult to read.
Title and/or legend: Every chart must have a title. The legend must have enough detail to
interpret the chart without reading the surrounding text.
Axes numbers: these should be large and easy to read.
Axes labels: all axes must have labels with the unit of measurement.
Scale: the chosen scale must be appropriate for the size of the chart and to facilitate
interpolation if required. Gridlines are helpful when conveying quantitative information. If
multiple charts are used, they should preferably have the same scale for easy comparison.
12. Reference Books
Phillips, L.P, Jr. 1992. How to think about statistics. Revised edition.
Zar, J.H. 1984. Biostatistical Analysis. 2nd Edition.
Page 7-16