Download Jacaranda (Engineering) 3333 Mail Code Phone: 818.677.6448 E

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
College of Engineering and Computer Science
Computer Science Department
Computer Science 106
Computing in Engineering and Science
Spring 2006 Class number: 11672 Instructor: Larry Caretto
Programming Exercise Eight (and last)
Objective
This assignment provides an example in the use of one-dimensional arrays and introduces the
concept of regression analysis, which is used to estimate a relationship between two variables.
Mathematical Background
If several measurements are made on pairs of experimental data {(x i,yi), i = 1,...,N}, we can use a
technique, known as regression analysis, to determine an approximate equation of a straight line
that gives a best fit to the data. The equation of this best-fit line is written as follows.
y^ = a + b x
In this equation, we use the symbol y^ instead of y to indicate that the predicted value found from
the equation, y^ = a + b x is an approximate result. For a given data point, (x ,y ), the value of y
i i
i
represents the actual data and we would obtain the predicted value of y, at the point x = x i from
the equation y^i = a + b xi. The difference between the measured and predicted value is |yi - y^i|.
yi
ŷi
y
xi
Fitted Line
x
indicates data points
In the chart at the left, the data points are indicated by
the small ellipses. The coordinates of one of a typical
data point are shown by the dotted lines indicating the
coordinates xi and yi. The solid line is the fitted
regression line, y^ = a + b x. The point where the dotted
line at x = xi crosses the regression line has the
coordinates (xi,y^i). In this particular example the value of
y^i is less than the value of yi. There is a large scatter of
data points about the regression line in this example.
The example plot above might represent calibration data on an instrument. The x values would
denote the instrument reading and the y values would indicate the true value of the quantity being
measured. Once the calibration tests were completed, it would be useful to have a simple
equation to relate the instrument reading (x) to the actual quantity being measured(y).
In addition to finding the values of a and b that give the best-fit line, we would also like to have
some measure of how well the line fits the data. Two different goodness-of-fit measures, the
standard error and the coefficient of variation are presented below in the equations section.
Jacaranda (Engineering) 3333
E-mail: [email protected]
Mail Code
8348
Phone: 818.677.6448
Fax: 818.677.7062
Programming exercise eight
Comp 106, L. S. Caretto, Spring 2006
Page 2
Equations used
The equations used to calculate a and b can be found by an analysis which minimizes the
distances between the actual data points, y , and the fitted points, y^ = a + b x . The results of this
i
i
i
analysis are shown below. The equations to compute the intercept, a, and the slope, b, in terms of
the entire set of data, {xi,yi}, use the following the definitions of mean values:
y
N
1
N
 yi
1
N
x
and
i 1
N
x
i 1
i
With these definitions, the slope, b, and the intercept, a, are found as follows.
N
b
x y
i
i 1
i
i 1
a  y  bx
and
N
x
 N ( x )( y )
2
i
 N (x)
2
A statistical estimate of the variability can be found from the difference between the actual data
points y and the estimated value y^ = a + b x . This measure, which is called the standard error
i
i
i
and has the symbol sy|x, is defined as follows:
N
(y
sy|x =
i 1
i
 yˆ i ) 2
N 2
Another measure, called the R2 value or the coefficient of variation is considered to be a measure
of the amount of variation in the data which is explained by the regression equation. An R 2 value
of zero means that the regression cannot explain any of the variation in y; an R2 value of one
means that all the variation in y can be explained by the regression equation. The value of R 2 is
computed from the following equation:
R 1
2
( N  2) s 2y|x
N
y
i 1
2
i

N y
2
Task One
You can use a previously written program for this task. Download the program file from the
exercise page on the course web site. Review that program and see how the various functions
are used to enter array data and do calculations with array data in loops. Note that the program
determines the number of data points (N in the equations above) by reading the data. The user is
not required to count the data and input a value for N. The program has summary output to the
screen and detailed output of a, b, sy|x, R2, and a table of xi, yi, and ŷi.
Prepare a data file for the test case below. Review the input statements to see how you should
prepare this file. Run the program with your test data file to make sure you are using the program
correctly by matching the results below.
Programming exercise eight
Comp 106, L. S. Caretto, Spring 2006
Page 3
Test Data and Results for Linear Regression
xi
yi
510
533
603
670
750
1.3
0.1
1.5
1.8
3.9
Results: a = -5.77566; b = 0.0122238; R2 = 0.768457
Copy the output file from the test data set in the table above to your submission file. Do not copy
the code or the full output file from the downloaded data set to the submission file.
Task Two
Download the data file for this exercise from the course web site. This data file has several pairs
of (xi, yi) data points. In this task you will obtain some overall statistics (1) for the entire data file,
(2) for the (xi, yi) data points in which xi ≥ 1000, and (3) for the (xi, yi) data points in which xi <
1000.
You can use some of the code from task one for this task. You do not have to keep the same
function structure used for task one. However, you should be able to use the function that reads
data from an input data file with no changes.
The program you write for this task should compute and print out the results listed below for the
data in the data file that you download:

The count, mean value and standard deviation of all xi data.

The count, mean value and standard deviation of all yi data.

The maximum and minimum values of xi and yi for the full set of data.

The count, mean value and standard deviation of the subset of xi data for which xi ≥
1000.

The count, mean value and standard deviation of the subset of yi data for which the
corresponding value of xi ≥ 1000.

The maximum and minimum values of xi and yi for the subset of data in which the value
of xi ≥ 1000.

The count, mean value and standard deviation of the set of xi data for which xi < 1000.

The count, mean value and standard deviation of the set of yi data for which the
corresponding value of xi < 1000.

The maximum and minimum values of xi and yi for the set of data in which the
corresponding values of xi <1000.
Copy your code and your output to the submission file for this task. You can obtain the correct
answers from the course web site. Make sure that your results match these correct answers
before submitting your assignment.
Equation and computational technique for the task two
The standard deviation, s, is defined by the first equation below. Introductory statistics texts show
that this defining equation may be converted to the equivalent computational form shown in the
second square root sign.
N
s
 ( yi  y ) 2
i 1
N 1
N

y
i 1
2
i
 Ny 2
N 1
Programming exercise eight
Comp 106, L. S. Caretto, Spring 2006
Page 4
Using the computational form in the second equation, it is possible to compute the sum of the yi
required for the mean and the sum of yi2, required for the standard deviation in the same loop.
When we have N elements in a C++ array, the subscripts for the array elements run from zero to
N-1. All the elements in the array are accessed by a for loop such as the following:
for ( int i = 0; i < N; i++ ).
Requirements for Submission
Due Date:
May 4.
Submit a copy of the submission file with all the elements asked for in the each task above.

A copy of the output file from task one

Your code for task two

A copy of the output for task two that has the correct answers
Printed submissions are due on or before the end of the laboratory period on the due date.
Alternatively, you can mail submissions by 11:59 pm on the due date. Only one submission,
written or email, is required.