Download Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
DATA ANALYSIS
1. Error
1. A. Error is always present
Scientific experiments are carried out to measure quantities of interest and to develop and
test theories.
Error is present in all experiments and prevents one from obtaining the "true value" of
any measurable quantity.
Although the true value of a quantity is unknowable due to error, well-defined bounds can
be placed on experimental uncertainty.
1. B. Terminology about errors
Systematic Error
Random Error
Accuracy
Precision
Reproducible inaccuracy (always the same sign and magnitude);
can be discovered and corrected in principle
Indeterminate fluctuations (positive and negative); can be reduced
by averaging independent measurements
Nearness to "truth"; depends on how well systematic errors are
controlled or compensated for
Reproducibility; depends on how well random error can be
overcome
Pictorial Example:
imprecise
(but fairly accurate
on average)
precise
(but inaccurate)
From the above definitions, it is seen that

Minimizing systematic error increases the accuracy of a measurement

Minimizing random error increases the precision of a measurement
Example: The Hubble telescope was precise (flat to /50) but inaccurate (focal length
error of 1 mm). However, since the error was systematic, NASA was able to correct it
with compensating lenses.
2. One-Dimensional Measurements
One-dimensional measurements are measurements of a value of a physical property. A
data set consists of a set of repeated measurements, {x1, x2, ..., xn}. An example is the
determination of the mass of a sample by several weighings.
2. A. Distribution of one-dimensional measurements
Repeated experiments will yield a histogram of measurements centered about an average
value (mean) with a characteristic spread (standard deviation). In the limit of an infinite
number of measurements, the probability distribution is observed to be a Gaussian
distribution (or normal error distribution).

lim
n
P(x)

x
x
68% of the area under a Gaussian distribution lies between ; 95% of the area lies
within .
2. B. Parent distribution
The parent distribution is the “true” distribution that would be obtained if an infinite
number of measurements could be conducted.
parent mean:
1 n
  lim  xi
n n
i 1
parent standard deviation:
1 n
2
  lim   xi   
n n
i 1
The mean,  of the distribution is the average value. The standard deviation,  is the
square root of the average squared deviation from the mean.
2. C. Sample distribution
The sample distribution is an observed distribution obtained from a finite number of
measurements.
sample mean:
1 n
m  x   xi
n i 1
sample standard deviation:
1 n
s
 xi  x  2

n  1 i 1
One “degree of freedom” is used to determine the mean of the distribution; hence, the
divisor is n1 in the sample standard deviation.
Note the use of greek letters for the parent distribution and roman letters for the sample
distribution.
2. D. Reporting values
When reporting values, always report the mean, standard deviation, and units:
x  s units
Use two significant figures for s and match precision for x .
Example: l = 12.5  1.3 mm. Note that the mean has three significant figures and the
standard deviation has two significant figures; however, the precision of both quantities is
0.1 mm.
2. E. Significant figures
Use of standard deviations can be thought of as “advanced significant figure theory”
because the standard deviation specifies the uncertainty in a value more precisely. We
will also see that there are methods to propagate uncertainty during calculations.
130
132  6
Preview:
 2500
2600
 2452  18
2584  19
3. Two-Dimensional Measurements
Two-dimensional measurements are measurements that describe how one physical
property depends on another. A data set consists of (x,y) pairs, {(x1,y1), (x1,y1), ...,
(xn,yn)}. For example, a set of (T,p) data points describes how pressure depends on
temperature.
3. A. Linear least square fitting
Linear least squares fitting is a method which finds the best straight line fit to a set of
(x,y) data points, i.e., finds the slope m and intercept b of the function mx+b which best
fits the observed data. (Actually the method finds the best fit values for parameters which
appear linearly in the fitting function, but a straight line is the most common case.)
3. B. Derivation of the least squares best fit for a straight line
If 1) the two variables are linearly related, i.e., by y = mx + b, 2) the parent distribution is
Gaussian, and 3) all standard deviations are equal, then the best fit of the data {(x1,y1),
(x1,y1), ..., (xn,yn)} is obtained by minimizing the sum of squared differences between the
observed data and predicted fit
n

R  residual   yi  yfit i
i 1
If the fitting function is a straight line

2
y fit  mx  b
then the residual may be written as
R    yi  mxi  b
n
2
i 1
R is minimized with respect to variations in fitting parameters m and b by setting its
partial derivatives equal to zero
R

2

yi  mxi  b  0


m m
R

2

yi  mxi  b  0


b b
Evaluating these derivatives yields
 2 yi  mxi  b xi   0




 2 y
 mxi  b1  0
i
which can be simplified by dividing by 2, separating the summations, and recognizing
that 1=n
 yi xi  m xi2  b xi  0
y
 m xi  b1  0
This leaves two equations and two unknowns. Solving for m and b yields
1
m
n xi yi   xi  yi

1
b
 xi2  yi   xi  xi yi

where
i



  n xi2 

 x 
2
i
Furthermore, the standard deviations may be shown to be
1
s  std dev of fit 
 yi  mxi  b
n2

 s2n
sm  std dev of slope  

  

2
12
 s 2  xi2 

sb  std dev of intercept  



Observe that two “degrees of freedom” are used to determine the slope and intercept of
the fitting function; hence, the divisor is n2 in the standard deviation of the fit.
12
3. C. Using the least squares best fit formulas
In practice, one uses a computer program or spreadsheet to accumulate the summations
n,  xi ,  xi2 ,  yi ,  xi yi
and then calculate
m  sm , b  sb , s
The units of m and sm are the units of the slope, i.e., the y units divided by the x units.
The units of b, sb, and s are the same as the y units.
3. D. Intuitive definitions of s, sm, and sb
The following figure shows the best fit to a set of data points as a solid line. Two
limiting “reasonable” fits are also shown as dashed lines.
sm = std dev of slope
s = std dev of fit
sb = std dev of intercept
The standard deviation of the fit s is approximately the average difference in y between
each data point and the best fit line. The standard deviation of the slope sm is
approximately the difference in slope between the best fit line and a limiting reasonable
fit line. The standard deviation of the intercept sb is approximately the difference in
the y-intercept between the best fit line and a limiting reasonable fit line.