Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
College of Engineering and Computer Science Computer Science Department Computer Science 106 Computing in Engineering and Science Spring 2006 Class number: 11672 Instructor: Larry Caretto Programming Exercise Eight (and last) Objective This assignment provides an example in the use of one-dimensional arrays and introduces the concept of regression analysis, which is used to estimate a relationship between two variables. Mathematical Background If several measurements are made on pairs of experimental data {(x i,yi), i = 1,...,N}, we can use a technique, known as regression analysis, to determine an approximate equation of a straight line that gives a best fit to the data. The equation of this best-fit line is written as follows. y^ = a + b x In this equation, we use the symbol y^ instead of y to indicate that the predicted value found from the equation, y^ = a + b x is an approximate result. For a given data point, (x ,y ), the value of y i i i represents the actual data and we would obtain the predicted value of y, at the point x = x i from the equation y^i = a + b xi. The difference between the measured and predicted value is |yi - y^i|. yi ŷi y xi Fitted Line x indicates data points In the chart at the left, the data points are indicated by the small ellipses. The coordinates of one of a typical data point are shown by the dotted lines indicating the coordinates xi and yi. The solid line is the fitted regression line, y^ = a + b x. The point where the dotted line at x = xi crosses the regression line has the coordinates (xi,y^i). In this particular example the value of y^i is less than the value of yi. There is a large scatter of data points about the regression line in this example. The example plot above might represent calibration data on an instrument. The x values would denote the instrument reading and the y values would indicate the true value of the quantity being measured. Once the calibration tests were completed, it would be useful to have a simple equation to relate the instrument reading (x) to the actual quantity being measured(y). In addition to finding the values of a and b that give the best-fit line, we would also like to have some measure of how well the line fits the data. Two different goodness-of-fit measures, the standard error and the coefficient of variation are presented below in the equations section. Jacaranda (Engineering) 3333 E-mail: [email protected] Mail Code 8348 Phone: 818.677.6448 Fax: 818.677.7062 Programming exercise eight Comp 106, L. S. Caretto, Spring 2006 Page 2 Equations used The equations used to calculate a and b can be found by an analysis which minimizes the distances between the actual data points, y , and the fitted points, y^ = a + b x . The results of this i i i analysis are shown below. The equations to compute the intercept, a, and the slope, b, in terms of the entire set of data, {xi,yi}, use the following the definitions of mean values: y N 1 N yi 1 N x and i 1 N x i 1 i With these definitions, the slope, b, and the intercept, a, are found as follows. N b x y i i 1 i i 1 a y bx and N x N ( x )( y ) 2 i N (x) 2 A statistical estimate of the variability can be found from the difference between the actual data points y and the estimated value y^ = a + b x . This measure, which is called the standard error i i i and has the symbol sy|x, is defined as follows: N (y sy|x = i 1 i yˆ i ) 2 N 2 Another measure, called the R2 value or the coefficient of variation is considered to be a measure of the amount of variation in the data which is explained by the regression equation. An R 2 value of zero means that the regression cannot explain any of the variation in y; an R2 value of one means that all the variation in y can be explained by the regression equation. The value of R 2 is computed from the following equation: R 1 2 ( N 2) s 2y|x N y i 1 2 i N y 2 Task One You can use a previously written program for this task. Download the program file from the exercise page on the course web site. Review that program and see how the various functions are used to enter array data and do calculations with array data in loops. Note that the program determines the number of data points (N in the equations above) by reading the data. The user is not required to count the data and input a value for N. The program has summary output to the screen and detailed output of a, b, sy|x, R2, and a table of xi, yi, and ŷi. Prepare a data file for the test case below. Review the input statements to see how you should prepare this file. Run the program with your test data file to make sure you are using the program correctly by matching the results below. Programming exercise eight Comp 106, L. S. Caretto, Spring 2006 Page 3 Test Data and Results for Linear Regression xi yi 510 533 603 670 750 1.3 0.1 1.5 1.8 3.9 Results: a = -5.77566; b = 0.0122238; R2 = 0.768457 Copy the output file from the test data set in the table above to your submission file. Do not copy the code or the full output file from the downloaded data set to the submission file. Task Two Download the data file for this exercise from the course web site. This data file has several pairs of (xi, yi) data points. In this task you will obtain some overall statistics (1) for the entire data file, (2) for the (xi, yi) data points in which xi ≥ 1000, and (3) for the (xi, yi) data points in which xi < 1000. You can use some of the code from task one for this task. You do not have to keep the same function structure used for task one. However, you should be able to use the function that reads data from an input data file with no changes. The program you write for this task should compute and print out the results listed below for the data in the data file that you download: The count, mean value and standard deviation of all xi data. The count, mean value and standard deviation of all yi data. The maximum and minimum values of xi and yi for the full set of data. The count, mean value and standard deviation of the subset of xi data for which xi ≥ 1000. The count, mean value and standard deviation of the subset of yi data for which the corresponding value of xi ≥ 1000. The maximum and minimum values of xi and yi for the subset of data in which the value of xi ≥ 1000. The count, mean value and standard deviation of the set of xi data for which xi < 1000. The count, mean value and standard deviation of the set of yi data for which the corresponding value of xi < 1000. The maximum and minimum values of xi and yi for the set of data in which the corresponding values of xi <1000. Copy your code and your output to the submission file for this task. You can obtain the correct answers from the course web site. Make sure that your results match these correct answers before submitting your assignment. Equation and computational technique for the task two The standard deviation, s, is defined by the first equation below. Introductory statistics texts show that this defining equation may be converted to the equivalent computational form shown in the second square root sign. N s ( yi y ) 2 i 1 N 1 N y i 1 2 i Ny 2 N 1 Programming exercise eight Comp 106, L. S. Caretto, Spring 2006 Page 4 Using the computational form in the second equation, it is possible to compute the sum of the yi required for the mean and the sum of yi2, required for the standard deviation in the same loop. When we have N elements in a C++ array, the subscripts for the array elements run from zero to N-1. All the elements in the array are accessed by a for loop such as the following: for ( int i = 0; i < N; i++ ). Requirements for Submission Due Date: May 4. Submit a copy of the submission file with all the elements asked for in the each task above. A copy of the output file from task one Your code for task two A copy of the output for task two that has the correct answers Printed submissions are due on or before the end of the laboratory period on the due date. Alternatively, you can mail submissions by 11:59 pm on the due date. Only one submission, written or email, is required.