Download Problem Statement:

BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan I. Problem Statement Given the following graph (Figure 1): Figure 1: Hypothetical experimental data and best fit line How confident are you that this line is an accurate representation of the true line? Things to consider: What is the probability that this is the true line representing the underlying process? What is the probability that the underlying process is constant? Does the process/function actually increase with x? II. Graphical Approach to Error Analysis This approach is often employed in teaching laboratories where there is often not enough time to complete a rigorous statistical analysis of the data. The method provides a quick visual estimation for the accuracy and reliability of the best fit line. Methodology: Draw a parallelogram around the data points such that the top and bottom sides are parallel to the line of best-fit (Figure 2). These lines should pass through the most extreme upper and lower error bars. The side-lines should pass through the most extreme horizontal error bars, if present, or the data points. Any subsequent data points collected should fall within this area. Figure 2: Parallelogram for error analysis Next, to evaluate the possible deviation in your line of best-fit draw in the two diagonals in the parallelogram. These lines represent the most steep and least steep ‘best-fit’ lines that could be drawn: y = mminx + bmax (1) y = mmaxx + bmin (2) Uncertainty in the original line, can be calculated as: Δm = (mmax – mmin)/2 Δb = (bmax – bmin)/2 (3, 4) BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan By comparing these values to those of acceptable error in your experiment, you can then pass some judgment as to whether or not the best-fit line is acceptable and/or representative of the true solution. III. Hypothesis Testing Although a linear fit to the data is the simplest and sometimes the appropriate solution, it is not always the best model for data collected from our experiments. We therefore need to have some understanding of the underlying process that we are studying in order to make a good assumption as to the model that may fit. We can use statistical hypothesis testing to enable us to reject or confirm models (Introduction to Models, 2003). For example: We may test the hypothesis of whether y (the dependent variable) varies with x (the independent variable). Given the line y = mx + b, we would have two hypotheses: the null hypothesis H0 which claims that the slope m = 0, and the hypothesis H1 which claims that the slope m ≠ 0 (slope is not equal to zero). They would be stated like so: H0: m = 0 H1: m ≠ 0 It is the null hypothesis that we want to prove false and H1 is only used as a benchmark. Each test contains a test statistic (e.g, sample mean, sample variance, etc.) and a critical region, specifying which values of the test statistic will result in the rejection of H0. A type I error occurs when we reject H0 when it is in fact true. The consequences of such an error are serious and the probability of making this error should be minimized. To perform the test, we must determine the cutoff point for the critical region which gives a Type I error probability of 5% for example. To do this, we must calculate the test statistic, analogous to the parameter in the hypothesis and is calculated as: t = (Estimate – Hypothesized Value) Estimated Error where the hypothesized value is zero (m = 0.) To determine the critical value, we need to pick the significance level (e.g., p=0.05). The p-value measures the significance level at which the observed value of the test statistic first becomes significant. The smaller the p-value, the stronger the evidence against the null hypothesis. Using the test statistic (t) and the df, we may look up the corresponding p-value in a t-distribution table. If the observed p value is less than our significance level, the null hypothesis is rejected. If the observed p value is less than or equal to the significance level, then the null hypothesis is rejected; if the probability is greater than the significance level then the null hypothesis is not rejected. When the null hypothesis is rejected, the outcome is said to be statistically significant. The use of probability and statistics can help us to estimate quantitatively our level of uncertainty about the likelihood of the events we observe. The limitation is that statistical tools do not have the power to tell us unequivocally that the model we chose to represent what we observed is in fact the truth. For the remaining discussion below, we assume first that a straight line fits the data presented and begin with ways of analyzing it. Subsequently we will discuss nonlinear fits to the data. IV. Best-fit Lines While the above analysis may be used for rough estimation of the uncertainty in a set of measurements, using more analytical approaches will provide better results. And although it is simple enough to take a ruler and draw a best-fit line through some data to observe a trend, analytical approaches appreciably increase the accuracy of the prediction of the observed function. By applying these methods, it is easier to quantitatively evaluate the accuracy of the line. BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan A common approach to analyzing apparently linear data is to develop a linear-regression model for the data set. The method presented here is a fairly rigorous approach, however, it deals well with variable uncertainty associated with y values. This error (Δyi), as shown in Figure 1, corresponds to the standard deviation (σi) for the set of data collected at that x value. A regression line can be obtained by minimizing the sum of the squares of the error for each yi. This rigorous approach (also known as the method of least squares) is as follows as adapted from MUOhio (2004). This approach takes into consideration the fact that each data point has its own specific σi by applying a weighted value wi to each error term: (5) So in fitting a straight line y = m x + b to N data points, the sum of the weighted errors is minimized by: (6) with respect to the slope m and the y-intercept b. This leads to the equations, (7) (8) In order to solve equations 7 and 8, the additional parameters A through E are defined as follows: (9-13) Therefore, by substitution of equations 9-13 into (7) and (8) we obtain: 0 = E - mD - bA (14) 0 = C - mA - bB. (15) respectively. This pair of equations can be solved for m and b by Cramer’s rule, such that: (16,17) BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan To look at the error associated with the slope and y intercept, the following two relationships are used: (18,19) V. Analysis of the best-fit line Once the model has been fit, it is important to evaluate whether or not the model is a ‘good’ fit. The analytical tool that is typically used to do this is the chi-square test (2). “The objective of the test is to assign a probability that if the measurements were repeated, the weighted sum of the squared deviations [yi – f(xi)]2 (where f(xi) is the predicted line) would be larger, that is, the "miss" of the fit would be more” (MUOhio, 2004). In other words, this test looks at the probability of repeating the exact experimental data if the line you have chosen to model the data is accurate. For a detailed breakdown of how to determine the Chi-Squared value for a set of data we refer you to a statistics reference source (e.g. http://nedwww.ipac.caltech.edu/level5/Leo/Stats7_2.html). The reduced 2 is used to determine the probability that the predicted line-of-best fit represents the data set, and is simply 2 divided by the degrees of freedom. The expected value for a reduced 2 is 1, which implies that the model is a good fit and the data is described by the model with good statistical uncertainties. It means that within the error bars, one appears to have a good fit to the model. If the 2 value is significantly larger than 1, then your model does not fit. (Alternatively, this could also mean that there may be some unaccounted for systematic error.) If the reduced 2 value is significantly less than 1 you have most likely overestimated the size of your statistical errors. However, in the case being considered here, the error expressed for each data point is the standard deviation for the data set. Therefore, it is necessary to check that a statistically relevant amount of data was collected. In order to properly analyze data and apply regression (which include error analysis) techniques, the collected data should follow a Gaussian distribution. Should the data set (for a given x value) not follow this distribution it is likely that there were not enough data points collected. Therefore, both the data point and the error bars may not be a true reflection of the model. Using a reduced 2 and degrees of freedom table, one can also assign a probability for whether the data is consistent with the best-fit line. A good way to visualize how well the line of best fit agrees with the data is to prepare a plot of residuals. Residuals are the difference between each data point and the predicted line or curve of best fit at the corresponding value of x, as shown in figure 2 below. For a reasonable fit, about 2/3’s of the data points should lie within one error bar from the horizontal line at zero. In figure 2, the predicted line of best fit is does not follow this criteria. Furthermore, from an example of such, i.e. large error bars, it is easy to see that meeting this criterion necessarily implies a good fit. Two-thirds of the points could easily be within one error-bar, however, still be quite spread out. Taken from: http://kossi.physics.hmc.edu/Courses/p23a/Analysis/Fitting.html BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan VI. Coefficient of Determination (R2) Another statistic similar to the 2 is the coefficient of determination, or R2. This method does not take into account the error associated with each point on the graph, and is more applicable to experiments with one sample. However, it too can be used as to assess how well the best-fit line fits the original data. To do so, we need to know how well the independent variable, x, explains or accounts for the variations in the dependent variable, y. If x has no effect on the value of y, then the best estimate of y is the mean ( y ) this is the baseline for comparison. There are three types of variations: Total Variation = observed – mean or ( y i  y ) 2 Explained Variation = predicted – mean or ( yˆ  y ) Unexplained Variation = observed – predicted or ( y i  yˆ ) The variations at each point are squared and summed to obtain the overall total, explained and unexplained variation. Next, the coefficient of determination (R2) is calculated as the proportion of explained variation to total variation. This coefficient has a value between 0 and 1, where a high R2 means a good fit and a low R2 means the converse. R2 values of >0.90 are considered good fits, while those <0.90 indicate that the model does not describe the data (de Paula, 2001). The significance of R2 is then determined using hypothesis testing. The test statistic for the hypothesis is F, which compares the explained variation to the unexplained variation. If a large proportion of the total variation is explained, then F will be large and we can infer that a significant proportion of the variation in y is explained by x. VII. Confidence Level If the R2 is significant and the residuals have an acceptable pattern, it is likely that the model is reasonable. However, there will always be some error associated with the line. A confidence interval specifies an interval on either side of the regression line where the true value is expected to lie. If the confidence interval is very narrow, then the parameter has been estimated with high precision. If instead, the confidence interval is wide, then the parameter has been estimated with low precision. If the data contains mean observations compiled from multiple datasets, then the confidence interval is given as: (X i  X )2 1 Y  t  , S  n  ( X i  X )2 ^ * (20) tα,df = the two-tailed t-value at α and degrees of freedom γ=n-1 S = the standard error of the estimate (ie. sqrt(residual of sum of squares/(n-2)) n = sample size Xi* = the X observation used to predict Y Xi = each X observation in the sample VIII. Fitting Non-linear Curves It may be that a linear curve fit is not representative of the data and using a non-linear curve (for example, logarithmic, exponential, cubic, etc.) may yield more promising results using the aforementioned analysis techniques. In such cases, a better fit of the curve can be found using the method of least squares. For a quadratic, cubic or higher order polynomial equation, a curve optimizing the chi square value can be found by solving a system of equations for more unknown coefficients, similar to the linear analysis. Logarithmic, exponential and trigonometric equations can also be solved this way. An example of this method applied to a polynomial is as follows: Assuming a quadratic equation y = mx2 + bx + c, the sum of weighted errors becomes: BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan    wi y i  mxi2  bxi  c  , leading to the equations: N 2 i 1 N   0   2 wi y i  mxi2  bxi  c xi2 m i 1     N   0   2wi y i  mxi2  bxi  c xi  b i 1   N   0   2wi y i  mxi2  bxi  c c i 1     This system of equations can be solved using matrices or direct substitution. From there, the same analysis techniques for a linear equation can be used. For curves such that involve non-constant coefficients such as y=sinAx + BxC, the equation cannot be handled in a straightforward manner with the least squares method. For this type of curve, one may vary the coefficient values until an optimal or near-optimal chi square value is found. This method can also be used with other types of curves. IX. Conclusions A variety of graphical and analytical analysis techniques have been presented. These techniques provide a means of trying to determine whether or not the trend-line fitted to a set of data is an accurate representation of true behavior. Knowing the true behavior of the data can lead to predictions and recommendations of future behavior. Using the Chi Square method, one can attempt to fit different curves to the data to find the optimal reduced chi square value. With the reduced chi square and the degrees of freedom, one can determine the probability of how well the line fits the data. Calculating and plotting the confidence interval will give visual clues as to how well the line fits the data. Because the confidence interval determines the region in which the line will fall within a given confidence, it is also a fair indicator of whether the process is constant (i.e., slope = 0) or if the process/function increases/decreases with increasing x value. For example, if any line that can be drawn within the confidence interval always has a positive slope, then it is unlikely that the process is constant. With any measurement, there is always some error and statistical methods provide some guidance on the uncertainty associated with experimental measurements. Although one can assign probabilities to the fit of the line through the data, these values should be used with caution. Systematic error or bias in the data will propagate in the statistical analysis and may lead the experimenter astray. It is the “garbage in, garbage out” principle. Increasing the number of datasets (or samples) minimizes this risk and hence, increases the reliability of the methods outlined above. IX. References Caltech. (n.d.) http://nedwww.ipac.caltech.edu/level5/Leo/Stats7_2.html Accessed: October 7, 2004 Introduction to models, numbers, & errors, part 1 (2003). Washington University, Department of Earth and Planetary Sciences Web Site: http://geodynamics.wustl.edu/classes/epsc109_2003/lectures/mod_num_err_2003_P1.pdf Accessed October 19, 2004 BME1450 Group 5 Submitted by: Rosalie Wang, Allison Lever, Anushiya Sathiananthan, Elizabeth Logan, Tayyab Khan Miami University, Ohio – Contemporary Physics Prelab from http://www.cas.muohio.edu/~marcumsd/p293/lab0/lab0.htm Accessed: October 8, 2004 de Paula, J.C. (2001). Experimental errors and data analysis. Accessed October 19, 2004, from Haverford College, Department of Chemistry Web Site: http://www.haverford.edu/chem/302/data.pdf Saeta, P.N. (n.d.). Physics 23A Home Page – Fitting Data from http://kossi.physics.hmc.edu/Courses/p23a/Analysis/Fitting.html Accessed October 10, 2004 Stoecker, W.F. (1989). Design of Thermal Systems. 3rd Ed. McGraw-Hill: New York. Walpole, R.E., Myers R.H., & Myers, S.L. (1998). Probability and Statistics for Engineers and Scientists. 6th Ed. Prentice Hall: New Jersey.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Problem Statement: