Download Problem Statement:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
I. Problem Statement
Given the following graph (Figure 1):
Figure 1: Hypothetical experimental data and best fit line
How confident are you that this line is an accurate representation of the true line?
Things to consider:
What is the probability that this is the true line representing the underlying process?
What is the probability that the underlying process is constant?
Does the process/function actually increase with x?
II. Graphical Approach to Error Analysis
This approach is often employed in teaching laboratories where there is often not enough time to complete a rigorous
statistical analysis of the data. The method provides a quick visual estimation for the accuracy and reliability of the best fit
line.
Methodology:
Draw a parallelogram around the data points such that the top and bottom sides are parallel to the line of best-fit (Figure
2). These lines should pass through the most extreme upper and lower error bars. The side-lines should pass through the
most extreme horizontal error bars, if present, or the data points. Any subsequent data points collected should fall within
this area.
Figure 2: Parallelogram for error analysis
Next, to evaluate the possible deviation in your line of best-fit draw in the two diagonals in the parallelogram. These lines
represent the most steep and least steep ‘best-fit’ lines that could be drawn:
y = mminx + bmax
(1)
y = mmaxx + bmin
(2)
Uncertainty in the original line, can be calculated as:
Δm = (mmax – mmin)/2
Δb = (bmax – bmin)/2
(3, 4)
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
By comparing these values to those of acceptable error in your experiment, you can then pass some judgment as to
whether or not the best-fit line is acceptable and/or representative of the true solution.
III. Hypothesis Testing
Although a linear fit to the data is the simplest and sometimes the appropriate solution, it is not always the best model for
data collected from our experiments. We therefore need to have some understanding of the underlying process that we are
studying in order to make a good assumption as to the model that may fit. We can use statistical hypothesis testing to
enable us to reject or confirm models (Introduction to Models, 2003).
For example: We may test the hypothesis of whether y (the dependent variable) varies with x (the independent variable).
Given the line y = mx + b, we would have two hypotheses: the null hypothesis H0 which claims that the slope m = 0, and
the hypothesis H1 which claims that the slope m ≠ 0 (slope is not equal to zero). They would be stated like so:
H0: m = 0
H1: m ≠ 0
It is the null hypothesis that we want to prove false and H1 is only used as a benchmark. Each test contains a test statistic
(e.g, sample mean, sample variance, etc.) and a critical region, specifying which values of the test statistic will result in the
rejection of H0. A type I error occurs when we reject H0 when it is in fact true. The consequences of such an error are
serious and the probability of making this error should be minimized.
To perform the test, we must determine the cutoff point for the critical region which gives a Type I error probability of 5%
for example. To do this, we must calculate the test statistic, analogous to the parameter in the hypothesis and is
calculated as:
t = (Estimate – Hypothesized Value)
Estimated Error
where the hypothesized value is zero (m = 0.)
To determine the critical value, we need to pick the significance level (e.g., p=0.05). The p-value measures the
significance level at which the observed value of the test statistic first becomes significant. The smaller the p-value, the
stronger the evidence against the null hypothesis.
Using the test statistic (t) and the df, we may look up the corresponding p-value in a t-distribution table. If the observed p
value is less than our significance level, the null hypothesis is rejected. If the observed p value is less than or equal to the
significance level, then the null hypothesis is rejected; if the probability is greater than the significance level then the null
hypothesis is not rejected. When the null hypothesis is rejected, the outcome is said to be statistically significant.
The use of probability and statistics can help us to estimate quantitatively our level of uncertainty about the likelihood of the
events we observe. The limitation is that statistical tools do not have the power to tell us unequivocally that the model we
chose to represent what we observed is in fact the truth.
For the remaining discussion below, we assume first that a straight line fits the data presented and begin with ways of
analyzing it. Subsequently we will discuss nonlinear fits to the data.
IV. Best-fit Lines
While the above analysis may be used for rough estimation of the uncertainty in a set of measurements, using more
analytical approaches will provide better results. And although it is simple enough to take a ruler and draw a best-fit line
through some data to observe a trend, analytical approaches appreciably increase the accuracy of the prediction of the
observed function. By applying these methods, it is easier to quantitatively evaluate the accuracy of the line.
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
A common approach to analyzing apparently linear data is to develop a linear-regression model for the data set. The
method presented here is a fairly rigorous approach, however, it deals well with variable uncertainty associated with y
values. This error (Δyi), as shown in Figure 1, corresponds to the standard deviation (σi) for the set of data collected at that
x value.
A regression line can be obtained by minimizing the sum of the squares of the error for each yi. This rigorous approach
(also known as the method of least squares) is as follows as adapted from MUOhio (2004). This approach takes into
consideration the fact that each data point has its own specific σi by applying a weighted value wi to each error term:
(5)
So in fitting a straight line y = m x + b to N data points, the sum of the weighted errors is minimized by:
(6)
with respect to the slope m and the y-intercept b. This leads to the equations,
(7)
(8)
In order to solve equations 7 and 8, the additional parameters A through E are defined as follows:
(9-13)
Therefore, by substitution of equations 9-13 into (7) and (8) we obtain:
0 = E - mD - bA
(14)
0 = C - mA - bB.
(15)
respectively.
This pair of equations can be solved for m and b by Cramer’s rule, such that:
(16,17)
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
To look at the error associated with the slope and y intercept, the following two relationships are used:
(18,19)
V. Analysis of the best-fit line
Once the model has been fit, it is important to evaluate whether or not the model is a ‘good’ fit. The analytical tool that is
typically used to do this is the chi-square test (2). “The objective of the test is to assign a probability that if the
measurements were repeated, the weighted sum of the squared deviations [yi – f(xi)]2 (where f(xi) is the predicted line)
would be larger, that is, the "miss" of the fit would be more” (MUOhio, 2004). In other words, this test looks at the
probability of repeating the exact experimental data if the line you have chosen to model the data is accurate. For a
detailed breakdown of how to determine the Chi-Squared value for a set of data we refer you to a statistics reference
source (e.g. http://nedwww.ipac.caltech.edu/level5/Leo/Stats7_2.html).
The reduced 2 is used to determine the probability that the predicted line-of-best fit represents the data set, and is simply
2 divided by the degrees of freedom. The expected value for a reduced 2 is 1, which implies that the model is a good fit
and the data is described by the model with good statistical uncertainties. It means that within the error bars, one appears
to have a good fit to the model. If the 2 value is significantly larger than 1, then your model does not fit. (Alternatively, this
could also mean that there may be some unaccounted for systematic error.) If the reduced 2 value is significantly less
than 1 you have most likely overestimated the size of your statistical errors. However, in the case being considered here,
the error expressed for each data point is the standard deviation for the data set. Therefore, it is necessary to check that a
statistically relevant amount of data was collected. In order to properly analyze data and apply regression (which include
error analysis) techniques, the collected data should follow a Gaussian distribution. Should the data set (for a given x
value) not follow this distribution it is likely that there were not enough data points collected. Therefore, both the data point
and the error bars may not be a true reflection of the model.
Using a reduced 2 and degrees of freedom table, one can also assign a probability for whether the data is consistent with
the best-fit line.
A good way to visualize how well the line of best fit agrees with the data is to prepare a plot of residuals. Residuals are the
difference between each data point and the predicted line or curve of best fit at the corresponding value of x, as shown in
figure 2 below. For a reasonable fit, about 2/3’s of the data points should lie within one error bar from the horizontal line at
zero. In figure 2, the predicted line of best fit is does not follow this criteria. Furthermore, from an example of such, i.e.
large error bars, it is easy to see that meeting this criterion necessarily implies a good fit. Two-thirds of the points could
easily be within one error-bar, however, still be quite spread out.
Taken from: http://kossi.physics.hmc.edu/Courses/p23a/Analysis/Fitting.html
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
VI. Coefficient of Determination (R2)
Another statistic similar to the 2 is the coefficient of determination, or R2. This method does not take into account the
error associated with each point on the graph, and is more applicable to experiments with one sample. However, it too can
be used as to assess how well the best-fit line fits the original data. To do so, we need to know how well the independent
variable, x, explains or accounts for the variations in the dependent variable, y. If x has no effect on the value of y, then
the best estimate of y is the mean ( y ) this is the baseline for comparison. There are three types of variations:
Total Variation = observed – mean or ( y i  y )
2
Explained Variation = predicted – mean or ( yˆ  y )
Unexplained Variation = observed – predicted or ( y i  yˆ )
The variations at each point are squared and summed to obtain the overall total, explained and unexplained variation.
Next, the coefficient of determination (R2) is calculated as the proportion of explained variation to total variation. This
coefficient has a value between 0 and 1, where a high R2 means a good fit and a low R2 means the converse. R2 values of
>0.90 are considered good fits, while those <0.90 indicate that the model does not describe the data (de Paula, 2001).
The significance of R2 is then determined using hypothesis testing. The test statistic for the hypothesis is F, which
compares the explained variation to the unexplained variation. If a large proportion of the total variation is explained, then
F will be large and we can infer that a significant proportion of the variation in y is explained by x.
VII. Confidence Level
If the R2 is significant and the residuals have an acceptable pattern, it is likely that the model is reasonable. However, there
will always be some error associated with the line. A confidence interval specifies an interval on either side of the
regression line where the true value is expected to lie. If the confidence interval is very narrow, then the parameter has
been estimated with high precision. If instead, the confidence interval is wide, then the parameter has been estimated with
low precision.
If the data contains mean observations compiled from multiple datasets, then the confidence interval is given as:
(X i  X )2
1
Y  t  , S

n  ( X i  X )2
^
*
(20)
tα,df = the two-tailed t-value at α and degrees of freedom γ=n-1
S = the standard error of the estimate (ie. sqrt(residual of sum of squares/(n-2))
n = sample size
Xi* = the X observation used to predict Y
Xi = each X observation in the sample
VIII. Fitting Non-linear Curves
It may be that a linear curve fit is not representative of the data and using a non-linear curve (for example, logarithmic,
exponential, cubic, etc.) may yield more promising results using the aforementioned analysis techniques. In such cases, a
better fit of the curve can be found using the method of least squares.
For a quadratic, cubic or higher order polynomial equation, a curve optimizing the chi square value can be found by solving
a system of equations for more unknown coefficients, similar to the linear analysis. Logarithmic, exponential and
trigonometric equations can also be solved this way. An example of this method applied to a polynomial is as follows:
Assuming a quadratic equation y = mx2 + bx + c, the sum of weighted errors becomes:
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
   wi y i  mxi2  bxi  c  , leading to the equations:
N
2
i 1
N

 0   2 wi y i  mxi2  bxi  c xi2
m
i 1

 

N

 0   2wi y i  mxi2  bxi  c xi 
b
i 1


N

 0   2wi y i  mxi2  bxi  c
c
i 1




This system of equations can be solved using matrices or direct substitution. From there, the same analysis techniques for
a linear equation can be used.
For curves such that involve non-constant coefficients such as y=sinAx + BxC, the equation cannot be handled in a
straightforward manner with the least squares method. For this type of curve, one may vary the coefficient values until an
optimal or near-optimal chi square value is found. This method can also be used with other types of curves.
IX. Conclusions
A variety of graphical and analytical analysis techniques have been presented. These techniques provide a means of
trying to determine whether or not the trend-line fitted to a set of data is an accurate representation of true behavior.
Knowing the true behavior of the data can lead to predictions and recommendations of future behavior.
Using the Chi Square method, one can attempt to fit different curves to the data to find the optimal reduced chi square
value. With the reduced chi square and the degrees of freedom, one can determine the probability of how well the line fits
the data. Calculating and plotting the confidence interval will give visual clues as to how well the line fits the data.
Because the confidence interval determines the region in which the line will fall within a given confidence, it is also a fair
indicator of whether the process is constant (i.e., slope = 0) or if the process/function increases/decreases with increasing
x value. For example, if any line that can be drawn within the confidence interval always has a positive slope, then it is
unlikely that the process is constant.
With any measurement, there is always some error and statistical methods provide some guidance on the uncertainty
associated with experimental measurements. Although one can assign probabilities to the fit of the line through the data,
these values should be used with caution. Systematic error or bias in the data will propagate in the statistical analysis and
may lead the experimenter astray. It is the “garbage in, garbage out” principle. Increasing the number of datasets (or
samples) minimizes this risk and hence, increases the reliability of the methods outlined above.
IX. References
Caltech. (n.d.)
http://nedwww.ipac.caltech.edu/level5/Leo/Stats7_2.html
Accessed: October 7, 2004
Introduction to models, numbers, & errors, part 1 (2003).
Washington University, Department of Earth and Planetary Sciences Web Site:
http://geodynamics.wustl.edu/classes/epsc109_2003/lectures/mod_num_err_2003_P1.pdf
Accessed October 19, 2004
BME1450
Group 5
Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan
Miami University, Ohio – Contemporary Physics Prelab from
http://www.cas.muohio.edu/~marcumsd/p293/lab0/lab0.htm
Accessed: October 8, 2004
de Paula, J.C. (2001). Experimental errors and data analysis. Accessed October 19, 2004, from Haverford College,
Department of Chemistry Web Site: http://www.haverford.edu/chem/302/data.pdf
Saeta, P.N. (n.d.). Physics 23A Home Page – Fitting Data from
http://kossi.physics.hmc.edu/Courses/p23a/Analysis/Fitting.html
Accessed October 10, 2004
Stoecker, W.F. (1989). Design of Thermal Systems. 3rd Ed. McGraw-Hill: New York.
Walpole, R.E., Myers R.H., & Myers, S.L. (1998). Probability and Statistics for Engineers and Scientists. 6th Ed. Prentice
Hall: New Jersey.