Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tenerife Python Data Analysis Worksheet March 2, 2015 The following instructions and excercises are intended to show you how to import data into python and fit a model to this data, including calculation of errors. It is split into four sections: importing your data, minimising a function, fitting a model using chi-squared minimisation and ‘bootstrapping’, which is a method of estimating errors on fit parameters. This should enable you to be able to analyse the data you extracted in the first workshop, which is the final exercise. 1 Pandas and Pylab: 1.1 Loading data, selective specific data, plotting The python module ‘PANDAS’ (Python Data Analysis Library) is a good way to import and manipulate cvs files, like the one produced by aperture photometry in AstroImageJ (despite its ‘.xls’ file extension). The following will show you how. 1.2 Import Modules First we need to import the modules we need: import numpy as np In [80]: import pylab as plt import pandas as pd # This is just so the plots work in this notebook %matplotlib inline The ‘import. . . as. . . ’ notation ensures that you always use commands from the modules you intended to, by forcing you to use the selected prefix every time you use a module command. For example, after importing the numpy module with ‘import numpy as np’, if you wanted to use the numpy ‘mean’ command you would have to write ‘np.mean’ instead of just ‘mean’. This means that you can still use the command ‘plt.mean’, which is a command in the pylab module, without not knowing which version of ’mean you’re using. 1.3 Load Data with Pandas Once we have defined the location and name of the file we want to use, it can be loaded as a Pandas data object with the command ‘read_table’, under pandas.io.parsers: route = "/home/sdc1g08/Dropbox/Work/Tenerife/astropy/" # file location # file name In [81]: infile = "example1.dat" data = pd.io.parsers.read_table(route+infile) # load file 1.4 Retrieve Specific Columns, Plot Data Now that the data is loaded, we can retreive specific columns of data by passing the column heading to the data object, in square brackets. Columns can also be accessed by index, but as the output from AstroImageJ includes so many columns, it’s easier to use the headings. We can check the header names by printing ‘data.columns.values’. Here we extract the x and y values and the errors on the y values, then plot them with pylab: print data.columns.values In [84]: x = data[’TIME’] y = data[’FROGS’] errs = data[’FROG_ERR’] # plot with errorbars, no line, round markers plt.errorbar(x, y, yerr=errs,linestyle=’none’,marker=’o’) plt.show() [’TIME’ ’FROGS’ ’FROG_ERR’] We could also make a log plot if we wanted to exentuate changes over different scale magnitudes: plt.errorbar(x, y, yerr=errs,linestyle=’none’,marker=’o’) In [87]: plt.yscale(’log’) plt.show() 1.5 Selecting Sections of Data You are likely to need to select specific sections of data in order, for example, to fit models or for other reasons. Sections can be selected by index by specifying the start and/or end position you want to include, seperated by a ‘:’ start,end = 10,25 In [22]: # data between the start and end indices plt.errorbar(x[start:end],y[start:end],yerr=errs[start:end],\ linestyle=’none’,marker=’o’) plt.show() # data up to the end index In [23]: plt.errorbar(x[:end],y[:end],yerr=errs[:end],linestyle=’none’,marker=’o’) plt.show() # data from the start index In [24]: plt.errorbar(x[start:],y[start:],yerr=errs[start:],\ linestyle=’none’,marker=’o’) plt.show() You can also select data based on ranges of a specific value, using Boolean comparisons (>,<,==,!= etc.). A Boolean comparison inside round bracks [i.e. (x >= 3)] will produce an array of indices of the data which fit this comparison. These can be combined by multiplications: # x value after 10 for which the y value is below zero In [27]: indices = (y < 0)*(x > 10) plt.errorbar(x[indices],y[indices],yerr=errs[indices],\ linestyle=’none’,marker=’o’) plt.show() Q1. Load ‘example2.dat’ and plot parts of the 3rd peak which lie below 0.5. 2 Functions & Minimisation 2.1 Declaring a function, then finding its minimum with scipy In this section you will learn how to minimise a given function with the command ‘minimize’ from scipy.optimise. In this case we only need the ‘minimize’ command from scipy, so we’ll just import that: from scipy.optimize import minimize In [28]: 2.2 Declaring a function Functions are declared using the ‘def’ (define) command, followed by the function name and its input variables in round bracks [i.e. Function(a,b,c,d)]. If the function is going to be minimised, it needs to have a single, numerical return value calculated from the input parameters. This output value will be minimised by changing the input parameters. Here’s an example of a function giving the value of a parabola at a given x value: In [29]: def Parab(v): ’’’ A parabola with an offset of 1. inputs: v (array) - v[0] (the first element) is the x value, v[1] is the y value outputs: z (float) - the calculated value of the function at (x,y) ’’’ z = v**2.0 + 1 return z You can see what the function looks like if we plot it, so we know the minimum will be a value of 1, at x=0: x = np.arange(-10,10,0.1) In [30]: y = Parab(x) plt.plot(x,y) plt.show() 2.3 Minimising a Function To find the minimum with the minimise function, we just pass the function and an initial guess of what the parameter(s) are at the minimum to the ‘minimize’ function: m = minimize(Parab, [3]) In [31]: All of the information from the minimisation is then stored in the output object, in the case called ‘m’: print m In [32]: status: success: njev: nfev: hess_inv: fun: x: message: jac: 0 True 4 12 array([[ 0.5]]) 1.0000000000000002 array([ -1.25301488e-08]) ’Optimization terminated successfully.’ array([ -1.49011612e-08]) The most important bits are the success/failure of the fit (‘succes’), the value of the function at the minimum (‘fun’) and the value of the input parameter(s) at the minimum (‘x’). As expected, the minimum is at an x value of 0 and has a vale of 1, though we can see that the output values are not totally accurate, though very close. Specific bits of information in the output can be accessed by passing the name of the field in square brackets: print "Did the fit work?",m[’success’] In [33]: print "Value at minimum: ",m[’fun’] print "Position of minimum:", m[’x’] # The success/failure of the fit # Function value at the minimum # The paramater(s) at the minimum Did the fit work? True Value at minimum: 1.0 Position of minimum: [ -1.25301488e-08] 2.4 Extra Parameters, Extra Arguments You’re likely to have to fit more complex model, however, with more input paramaters, some of which you may not want to modify. For example, here’s a 2-D parabola, which has two inputs (x & y) and a seperately-specified offset: def TwoDParab(v,c): ’’’ In [34]: A 2D parabola (x squared plus y squared) with an offset of ’c’. inputs: v (array) - v[0] (the first element) is the x value, v[1] is the y value c (float) - The offset of the parabola outputs: z (float) - the calculated value of the function ’’’ z = v[0]**2.0 + v[1]**2.0 + c return z ‘minimise’ can only handle functions which take their input parameters in a single array, so the x and y coordinates are supplied in the array ‘v’. The offset, ‘c’, which we want to remain constant, is not in this array, but is a separate argument. The function is minimised in the same way, by supplying the function to minimise and an array of initial guesses of the parameters at the minimum. In this case the function also needs extra arguments, however, so these are passed to the function via the parameter ‘args’ as a tuple (in round brackets). A tuple always has to be comma-seperated, so even if there’s only one number in it you need to put a comma afterwards, as below. In this case, we’ve also specified the minimisation method we want to be used (‘Nelder-Mead’). Different methods work better with different types of function, so if your minimisation is failing try switching to a different one, but this method should work well in most cases. m = minimize(TwoDParab, [3,2],args=(3,), method = ’Nelder-Mead’) In [35]: print m[’success’] print m[’fun’] print m[’x’] True 3.00000000192 [ 1.62039581e-05 4.07173472e-05] Once again, the minimum is found to be at about 0,0 and at a value of 3, our offset, as expected! Q2. Use ‘minimize’ to the find the minimum of the following function: 3 Fitting data 3.1 Calculating the chi squared of a fit, minimising it In this section, you’ll learn how to calculate the chi squared value of a model and a data set, which tells you how well a given model fits a given data set, then how to change your model parameters to minimise the chi squared value in order to obtain the best fit paramters. Once again, we’ll need scipy.optimize.minimize, plus numpy and pylab. 3.2 A Straight Line Model As an example, we’ll fit a straight line to some data, but the method works for any model. First we need to load the data, as in part 1 - the data is in ‘example3.dat’ - and plot it to see what it looks like: data = pd.io.parsers.read_table(route+’example3.dat’) In [61]: # load file x = data[’EGGS’] y = data[’JOY’] errs = data[’ERR’] plt.errorbar(x,y,yerr=errs,linestyle=’none’,marker=’o’,color=’blue’) plt.show() We can now see that the do should be well represented by a straight line model, with a gradient of around 2 and an intercept close to zero. Next we need to define our model function, as in the last section: In [37]: def Model(x,m,c): ’’’ This is my model - in this case a straight line. inputs: x (array) - x values m (float) - gradient of the line c (float) - offset of the line outputs: y (array) - output model values ’’’ x = np.array(x) # convert to a numpy array, to make # sure the function can handle arrays y = m*x + c return y 3.3 Calculating the Chi Squared In order to get the best fit, we need a measure of how well a given set of parameters fit the data. For this, we use the chi squared value of the fit. This is calculated by taking the difference between each data point and the model value for the same x value, dividing by the error and squaring it all: (Where x is the data, mu is the model and sigma are the errors). We divide by the error so that data point with small errors contribute to the fit more than those with larger error bars, and square to ensure the numbers are all positive. We need to define a function that will calculate this for a set of model and data values: In [38]: def ChiSquared(y,e,m): ’’’ Function to calculate the chi-squared value of a set of data and a model. inputs: y (array) - measured data points e (array) - errors on measured data points m (array) - model data points outputs: X (float) - chi squared value ’’’ diff = y - m # difference between data and model values weight = diff / e # weighted difference, using errors X = sum(weight**2.0) # sum of squares of each value return X If we try out different model parameters, we can see that the chi-squared value is smaller the closer we are to the true parameteres, i.e. the better the fit to the data: m1 = Model(x,2,0) In [49]: print ChiSquared(y,errs,m1) 122.108194854 m2 = Model(x,0,0) In [50]: print ChiSquared(y,errs,m2) 53589.186241 m3 = Model(x,5,50) In [51]: print ChiSquared(y,errs,m3) 185689.708721 m4 = Model(x,2,-50) In [52]: print ChiSquared(y,errs,m4) 10763.8906 plt.errorbar(x,y,yerr=errs,linestyle=’none’,marker=’o’,color=’blue’) In [53]: plt.plot(x,m1,color=’red’) plt.plot(x,m2,color=’blue’) plt.plot(x,m3,color=’green’) plt.plot(x,m4,color=’black’) plt.show() 3.4 Minimise Chi-Squared - Fit the Model! We now have our Chi-squared function, which calculates the chi-squared for a given set of data and model values, but we need to minimise a function that takes the model parameters as input. We therefore define a function that calculates model values from the model parameters, passes these values to the chi-squared function and returns the chi-squared value: def ModelChiSquared(vals,model,data): ’’’ In [54]: Calculate the chi squared value for a given set of model parameters. inputs: vals (array) - array containing the parameters needed to calculate model values for each data point. model (function) - the model to calculate the model values. data (array) - data to which the model will be fitted outputs: X (float) - chi squared value ’’’ # calculate model values. mod = model(data[0],*vals) # The ’*’ means fill the rest of the # function’s arguments with the # values in array ’vals’ # calculate chi squared for these model values and the data X = ChiSquared(data[1],data[2],mod) # return X Now we just need to pass this function to ‘minimize’ with some guesses at the initial parameters: m = minimize(ModelChiSquared,[1,0], \ method = ’Nelder-Mead’,args=(Model,[x,y,errs],)) In [55]: print print print print ’Success = ’,m[’success’] ’Chi-squared = ’,m[’fun’] ’Best-fitting g = ’,m[’x’][0] ’Best-fitting c = ’,m[’x’][1] Success = True Chi-squared = 109.647808131 Best-fitting g = 1.97452365549 Best-fitting c = 2.86550734171 The value of the function at the minimum is now you’re chi-squared value. We can see that the our slope, ‘m’ is very well recovered, but the intercept, ‘c’, not so well, due to the random variation/errors we added to the data. Now that we have fit our data and have a chi-squared value, how do we know that it’s actually a good fit, or a low enough chi-squared value? The best way to find out is by dividing the chi-squared by the number of ‘degrees of freedom’ of the fit. This is the number of data points minus the number of free parameters in your model. In this case, there are two free parameters, ‘m’ and ‘c’, so the number od degress of freedom is 100 - 2 = 98. The reduced chi-squared value should be as close as possible to 1, but not less than one. A chi-squared of one means that all of your data are close enough to your model to be well described by it within the expected random variation from your errors. If it’s a lot less than one, it therefore means either than your model is overcomplex, i.e. has too many free parameters, or that your errors have been overestimated and are too large. If it’s much larger than one, your model is probably not a very good fit to your data, but it could also be that you’ve underestimated your errors. Our reduced chi-squared in this case is: DoF = (len(x) - 2) In [56]: print ’Reduced chi-squared = ’,m[’fun’]/DoF Reduced chi-squared = 1.11885518501 We can now see that the reduced chi-squared is close to one and therefore that we have a pretty good fit to the data. The only thing that remains to be done is to plot our model alongside our data to check visually that the model fit looks ok, and that nothing went wrong: # create an array of model values from our best fit params In [57]: model = Model(x,m[’x’][0],m[’x’][1]) # plot the data plt.errorbar(x,y,yerr=errs, marker=’o’,linestyle=’none’) # plot the model plt.plot(x,model,color=’red’) plt.show() Q3. Fit a straight line model to ‘example4.dat’ and find the best fitting parameters and the reduced chi squared. 4 Bootstrapping - Calculating Errors 4.1 Calculating the errors on the parameters of your model fit In this section, you’ll learn how to calculate the approximate (but decent!) errors on the parameters of your model fit. This will involve using a method called ‘bootstrapping’, which involves resampling and then refitting your data, so we’ll need to do chi-squared minimisation again. Bootstrapping is statistical method which allows you to estimate the errors on the parameters of your fit, by resampling your data set then refitting it. The distribution of best-fit parameters from these samples gives you an idea of the error on each parameter, and even their interdependence. In the (standard) method which we will use, you will create a new data set by randomly drawing data points from your original sample, then replacing them so that it is possible for the same data point to be picked multiple times. In terms of coding, a good way of doing this is therefore to generate sets of random integers between 0 and the length of your data (-1). Then you can just pass these to your data array to extract random samples, which can be fit with your model to obtain the best fit parameters for each set. In [66]: def Bootstrap(data,params,model,n): ’’’ Bootrap function - resample the data then refit multiple times. Inputs: data (array) - data to bootstrap - list containing three arrays of [x,y,yerr] vals (array) - array of initial values for each param model (function) - The model function to be bootstrapped n (int) - The number of random samples to fit Outputs: Params (array) - list of arrays containing the fitted values of the parameters for each random sample ’’’ data = np.array(data) # make sure the data is a numpy array # create array to contain params from random samples Params = [[] for i in range(len(params))] # create list of n arrays of random indices between 0 and # the length of the data set (-1) indices = np.random.randint(len(data[0]),size=(n,len(data[0]))) # create n sets of randomly samples X = data[0][indices] Y = data[1][indices] Err = data[2][indices] # refit each random sample with your model for i in range(n): # chi squared fit mi = minimize(ModelChiSquared,params, method = ’Nelder-Mead’,\ args=(model,[X[i], Y[i], Err[i]])) par = mi[’x’] # append best-fit params to ’Params’ array for p in range(len(Params)): Params[p].append(par[p]) return np.array(Params) # run the function! Params = Bootstrap([x,y,errs],[0,1],Model,300) If we plot a histogram of these parameters, we can see the kind of distribution we get for each: # gradient In [72]: plt.hist(Params[0],bins=15) plt.show() # intercept In [71]: plt.hist(Params[1],bins=15) plt.show() We can now see that the values of both ‘g’ and ‘c’ that we got in our fit are well within the distribution of each parameter we get from resampling the data. Presenting this distribution is the best way of showing your errors, because it gives all of the information we have about the errors. If we want to present a value with numerical errors, we need to do more though. We can get a rough width by eye, but ideally we want to do something a bit more numerical to obtain our errors, so we’ll find the standard deviation of each distribution and use these as errors: # declare a Gaussian function, which can be fit to the param distribution In [76]: def SD(x): ’’’ Find the Standard deviation of a set of data. inputs: x (float) - input values outputs: std (float) - standard deviation ’’’ mean = np.mean(x) # calculate mean diff = x - mean # subtract from data dev = (diff **2.0) / len(x) # square, divide by N std = np.sqrt(sum(dev)) # sum, square root return std print SD(Params[0]) print SD(Params[1]) 0.0179048640108 1.09089954309 Even without knowing the input model, we can therefore say with confidence that the gradiant is 1.97 +/- 0.02 and the intercept is 2.86 +/- 1.09. Q4. Run a bootstrap on your straight line fit to ‘example4.dat’ and find the errors on each parameter. Q5. Plot the best-fit parameters from your bootstrap fits against one another. Is there any trend? What would it mean if there was? Q6. Using the data you reduced in the last workshop, find the orbital period of the observed system, with errors. 5 Homework Fit the supernova (from the last workshop’s homework) with an exponential (rise), a Gaussian (peak) and a power law (tail) and find the relevant bestfitting parameters, with errors. Exponential: Where A, a, b and c are all constants. Gaussian: Where mu is the mean (centre), sigma is the width and A is a constant. Power law: Where a is a constant. Created by Sam Connolly March 2015.