* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lecture2
Survey
Document related concepts
Transcript
EART20170 Computing, Data Analysis & Communication Dr Paul Connolly (F18 – Sackville Building) skills Lecturer:[email protected] 1. Data analysis (statistics) 3 lectures & practicals statistics open-book test (2 hours) 2. Computing (Excel statistics/modelling) 2 lectures assessed practical work Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly Recommended reading: Cheeney. (1983) Statistical methods in Geology. George, Allen & Unwin Recap – last lecture The four measurement scales: nominal, ordinal, interval and ratio. There are two types of errors: random errors (precision) and systematic errors (accuracy). Basic graphs: histograms, frequency polygons, bar charts, pie charts. Gaussian statistics describe random errors. The central limit theorem Central values, dispersion, symmetry Weighted mean. Some common problems X 1,4,6,3,7,4 [ x1 , x2 , x3 , x4 , x5 , x6 ] N x i 1 N i 2 ( x x ) i i 1 Use tables xx ( x x )2 1 -3.1667 10.0278 4 -0.1667 0.0278 6 1.8333 3.3611 3 -1.1667 1.3611 7 2.8333 8.0278 4 -0.1667 0.0278 25 0 22.8333 x Lecture 2 Correlation between two variables Classical linear regression Reduced major axis regression Propagation of errors in compound quantities. Correlation Many real-life quantities have a dependence on some thing else. E.g dependence of rock permeability on porosity. How can we quantify the strength and direction of a linear relationship between X and Y variables? Correlation Linear correlation (Pearson’s coefficient) x y N r 2 x 2 2 y 2 x y N N xy y = sum of all y-values x = sum of all x-values x2 = sum of all x2 values y2 = sum of all y2 values xy = sum of the x times y values Like other numerical measures, the population correlation coefficient is (the Greek letter ``rho'‘, ) and the sample correlation coefficient is denoted by r. Correlation Values of r y r = +1 y x Perfect positive correlation r = -1 y r=0 x Perfect negative correlation x No correlation Correlation r2 is the amount of variation in x and y that is explained by the r2, fraction of explained variation linear relationship. It is often called the `goodness of fit’ E.g. if an r = 0.97 is obtained then r2 = 0.95 so 100x0.95=95% of the total variation in x and y is explained by the linear relationship, but the remaining 5% variation is due to “other” causes. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 +1.0 +0.5 +0.0 -0.5 Correlation coefficient, r -1.0 Regression analysis How can we fit an equation to a set of numerical data x, y such that it yields the best fit for all the data? Classical linear regression An approximate fit yields a straight line that passes through the set of points in the best possible manner without being required to pass exactly through any of the points. Classical linear regression Linear Regression Y=mx+c y { m ei c x Where ei is the deviation of the data point from the fit line, c is the intercept, m is the gradient. Assumes that the error is present only in y. How do we define a good fit? If the sum of all deviations is a minimum? ei If the sum of all the absolute deviations is a minimum? |ei| If the maximum deviation is a minimum? emax If the sum of all the squares of the deviations is a minimum? ei2 Classical linear regression The best way is to minimise the sum of the squares of the deviation. Formally this involves some Mathematics: At each value of xi: yi mxi c Therefore the deviations from the curve are: ei (Yi yi ) The sum of the squares: S (c, m) e i 1 (Yi c mxi )2 N 2 i 1 i N Classical linear regression How do you find the minimum of a function? Use calculus Differentiate and set to zero S (c, m) N i 1 2(Yi c mxi )( 1) 0 c S (c, m) N i 1 2(Yi c mxi )( xi ) 0 m Two simultaneous equations cN mi 1 xi i 1 Yi N N ci 1 xi m x i 1 xiYi N N 2 i 1 i N Classical linear regression Solving the two equations yields: c N Y i 1 i x x N x x 2 N i 1 i N N i 1 i i 1 i i 2 N 2 i 1 i i 1 i N i 1 xiYi i 1 xi i 1 Yi N m N N N x N 2 i 1 i N x N i 1 i 2 xY Classical linear regression x y xy x2 ? ? ? ? Classical linear regression Classical linear regression only considered errors in the Y values of the data. How can we consider errors in both x and y values? Use Reduced major axis regression Reduced major axis regression dx { y dy { c x Method to quantify a linear relationship where both variables are dependent and have errors Instead of minimising e2=(Y-y)2 we minimise e2=dy2+dx2. Reduced major axis regression y 2 y y m x 2 2 N 2 x x N c y mx Reduced major axis regression x y x-x’ y-y’ (x-x’)2 (y-y’)2 ? ? ? ? ? ? Error propagation Every measurement of a variable has an error. Often the error quoted is one standard deviation of the mean (mean ± standard deviation) The standard deviation of the sample mean is usually our best estimate of the population standard deviation Error propagation Error propagation is a way of combining two or more random errors together to get a third. The equations assume that the errors are Gaussian in nature. It can be used when you need to measure more than one quantity to get at your final result. For example, if you wanted to predict permeability from a measured porosity and grainsize. The equations introduced here let you propagate the uncertainties on your data through the calculation and come up with an uncertainty on your results. How then do we combine variables which have errors? Error propagation - quoted Relationship zx y z x y z xy x y z z kx Error propagation 2 z 2 z z 2 z z 2 z 2 x x z xn z n x z log ex z x z e z x x x x z x x z k x z 2 x x 2 2 2 2 y y y y y y (k=constant) 2 2 Example of propagation of error Suppose we measure the thickness of a rock bed using a tape measure. The tape measure is shorter then the bed thickness so we have to do it in two steps x and y. We repeat the measurements 100 times and obtain the following mean and standard deviation values for x and y: x=12.1±0.3 cm y=4.2±0.2 cm The thickness of the bed should be simply: x+y=16.3 cm But what about the error on the total thickness? Example of propagation of error It is given by propagating the individual errors as follows: So the final answer for the total thickness of the bed is: 16.3±0.4 cm Error propagation formulae are non-intuitive and understanding how they are derived requires some mathematical knowledge More complex examples What if we have several functions of several variables? E.g. calculating density using Archimedes Principle: wt . in air (A) wt. in air(A)- wt in water(W) This equation contains two functions and two variables Density Error propagation is best done in parts, so first work out value and error in denominator: Then the value and error of: x A W A x In a few of weeks we will use a Monte Carlo method for solving more complex functions Density Reminder Statistics practical #2 Those not taking BIOL20451: Roscoe 3.5 1100 – 1300 Tuesday Those taking BIOL20451: Williamson 1.12 1400 – 1600 Tuesday Some common problems Weighted mean f x What does adding two variables really mean?