Download Tenerife Python Data Analysis Worksheet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
Tenerife Python Data Analysis
Worksheet
March 2, 2015
The following instructions and excercises are intended to show you how to import data into python and fit a model
to this data, including calculation of errors. It is split into four sections: importing your data, minimising a function,
fitting a model using chi-squared minimisation and ‘bootstrapping’, which is a method of estimating errors on fit
parameters. This should enable you to be able to analyse the data you extracted in the first workshop, which is the
final exercise.
1 Pandas and Pylab:
1.1 Loading data, selective specific data, plotting
The python module ‘PANDAS’ (Python Data Analysis Library) is a good way to import and manipulate cvs files, like
the one produced by aperture photometry in AstroImageJ (despite its ‘.xls’ file extension). The following will show
you how.
1.2 Import Modules
First we need to import the modules we need:
import numpy as np
In [80]: import pylab as plt
import pandas as pd
# This is just so the plots work in this notebook
%matplotlib inline
The ‘import. . . as. . . ’ notation ensures that you always use commands from the modules you intended to, by forcing
you to use the selected prefix every time you use a module command. For example, after importing the numpy module
with ‘import numpy as np’, if you wanted to use the numpy ‘mean’ command you would have to write ‘np.mean’
instead of just ‘mean’. This means that you can still use the command ‘plt.mean’, which is a command in the pylab
module, without not knowing which version of ’mean you’re using.
1.3 Load Data with Pandas
Once we have defined the location and name of the file we want to use, it can be loaded as a Pandas data object with
the command ‘read_table’, under pandas.io.parsers:
route = "/home/sdc1g08/Dropbox/Work/Tenerife/astropy/" # file location
# file name
In [81]: infile = "example1.dat"
data = pd.io.parsers.read_table(route+infile)
# load file
1.4 Retrieve Specific Columns, Plot Data
Now that the data is loaded, we can retreive specific columns of data by passing the column heading to the data object,
in square brackets. Columns can also be accessed by index, but as the output from AstroImageJ includes so many
columns, it’s easier to use the headings. We can check the header names by printing ‘data.columns.values’. Here we
extract the x and y values and the errors on the y values, then plot them with pylab:
print data.columns.values
In [84]:
x = data[’TIME’]
y = data[’FROGS’]
errs = data[’FROG_ERR’]
# plot with errorbars, no line, round markers
plt.errorbar(x, y, yerr=errs,linestyle=’none’,marker=’o’)
plt.show()
[’TIME’ ’FROGS’ ’FROG_ERR’]
We could also make a log plot if we wanted to exentuate changes over different scale magnitudes:
plt.errorbar(x, y, yerr=errs,linestyle=’none’,marker=’o’)
In [87]: plt.yscale(’log’)
plt.show()
1.5 Selecting Sections of Data
You are likely to need to select specific sections of data in order, for example, to fit models or for other reasons.
Sections can be selected by index by specifying the start and/or end position you want to include, seperated by a ‘:’
start,end = 10,25
In [22]:
# data between the start and end indices
plt.errorbar(x[start:end],y[start:end],yerr=errs[start:end],\
linestyle=’none’,marker=’o’)
plt.show()
# data up to the end index
In [23]: plt.errorbar(x[:end],y[:end],yerr=errs[:end],linestyle=’none’,marker=’o’)
plt.show()
# data from the start index
In [24]: plt.errorbar(x[start:],y[start:],yerr=errs[start:],\
linestyle=’none’,marker=’o’)
plt.show()
You can also select data based on ranges of a specific value, using Boolean comparisons (>,<,==,!= etc.). A Boolean
comparison inside round bracks [i.e. (x >= 3)] will produce an array of indices of the data which fit this comparison.
These can be combined by multiplications:
# x value after 10 for which the y value is below zero
In [27]: indices = (y < 0)*(x > 10)
plt.errorbar(x[indices],y[indices],yerr=errs[indices],\
linestyle=’none’,marker=’o’)
plt.show()
Q1. Load ‘example2.dat’ and plot parts of the 3rd peak which lie below 0.5.
2 Functions & Minimisation
2.1 Declaring a function, then finding its minimum with scipy
In this section you will learn how to minimise a given function with the command ‘minimize’ from scipy.optimise. In
this case we only need the ‘minimize’ command from scipy, so we’ll just import that:
from scipy.optimize import minimize
In [28]:
2.2 Declaring a function
Functions are declared using the ‘def’ (define) command, followed by the function name and its input variables in
round bracks [i.e. Function(a,b,c,d)]. If the function is going to be minimised, it needs to have a single, numerical return
value calculated from the input parameters. This output value will be minimised by changing the input parameters.
Here’s an example of a function giving the value of a parabola at a given x value:
In [29]:
def Parab(v):
’’’
A parabola with an offset of 1.
inputs:
v (array) - v[0] (the first element) is the x value,
v[1] is the y value
outputs:
z (float) - the calculated value of the function at (x,y)
’’’
z = v**2.0 + 1
return z
You can see what the function looks like if we plot it, so we know the minimum will be a value of 1, at x=0:
x = np.arange(-10,10,0.1)
In [30]: y = Parab(x)
plt.plot(x,y)
plt.show()
2.3 Minimising a Function
To find the minimum with the minimise function, we just pass the function and an initial guess of what the parameter(s)
are at the minimum to the ‘minimize’ function:
m = minimize(Parab, [3])
In [31]:
All of the information from the minimisation is then stored in the output object, in the case called ‘m’:
print m
In [32]:
status:
success:
njev:
nfev:
hess_inv:
fun:
x:
message:
jac:
0
True
4
12
array([[ 0.5]])
1.0000000000000002
array([ -1.25301488e-08])
’Optimization terminated successfully.’
array([ -1.49011612e-08])
The most important bits are the success/failure of the fit (‘succes’), the value of the function at the minimum (‘fun’)
and the value of the input parameter(s) at the minimum (‘x’). As expected, the minimum is at an x value of 0 and has
a vale of 1, though we can see that the output values are not totally accurate, though very close.
Specific bits of information in the output can be accessed by passing the name of the field in square brackets:
print "Did the fit work?",m[’success’]
In [33]: print "Value at minimum: ",m[’fun’]
print "Position of minimum:", m[’x’]
# The success/failure of the fit
# Function value at the minimum
# The paramater(s) at the minimum
Did the fit work? True
Value at minimum: 1.0
Position of minimum: [ -1.25301488e-08]
2.4 Extra Parameters, Extra Arguments
You’re likely to have to fit more complex model, however, with more input paramaters, some of which you may not
want to modify. For example, here’s a 2-D parabola, which has two inputs (x & y) and a seperately-specified offset:
def TwoDParab(v,c):
’’’
In [34]:
A 2D parabola (x squared plus y squared) with an offset of ’c’.
inputs:
v (array) - v[0] (the first element) is the x value,
v[1] is the y value
c (float) - The offset of the parabola
outputs:
z (float) - the calculated value of the function
’’’
z = v[0]**2.0 + v[1]**2.0 + c
return z
‘minimise’ can only handle functions which take their input parameters in a single array, so the x and y coordinates
are supplied in the array ‘v’. The offset, ‘c’, which we want to remain constant, is not in this array, but is a separate
argument.
The function is minimised in the same way, by supplying the function to minimise and an array of initial guesses of
the parameters at the minimum. In this case the function also needs extra arguments, however, so these are passed to
the function via the parameter ‘args’ as a tuple (in round brackets). A tuple always has to be comma-seperated, so
even if there’s only one number in it you need to put a comma afterwards, as below. In this case, we’ve also specified
the minimisation method we want to be used (‘Nelder-Mead’). Different methods work better with different types of
function, so if your minimisation is failing try switching to a different one, but this method should work well in most
cases.
m = minimize(TwoDParab, [3,2],args=(3,), method = ’Nelder-Mead’)
In [35]:
print m[’success’]
print m[’fun’]
print m[’x’]
True
3.00000000192
[ 1.62039581e-05
4.07173472e-05]
Once again, the minimum is found to be at about 0,0 and at a value of 3, our offset, as expected!
Q2. Use ‘minimize’ to the find the minimum of the following function:
3 Fitting data
3.1 Calculating the chi squared of a fit, minimising it
In this section, you’ll learn how to calculate the chi squared value of a model and a data set, which tells you how well
a given model fits a given data set, then how to change your model parameters to minimise the chi squared value in
order to obtain the best fit paramters. Once again, we’ll need scipy.optimize.minimize, plus numpy and pylab.
3.2 A Straight Line Model
As an example, we’ll fit a straight line to some data, but the method works for any model. First we need to load the
data, as in part 1 - the data is in ‘example3.dat’ - and plot it to see what it looks like:
data = pd.io.parsers.read_table(route+’example3.dat’)
In [61]:
# load file
x = data[’EGGS’]
y = data[’JOY’]
errs = data[’ERR’]
plt.errorbar(x,y,yerr=errs,linestyle=’none’,marker=’o’,color=’blue’)
plt.show()
We can now see that the do should be well represented by a straight line model, with a gradient of around 2 and an
intercept close to zero. Next we need to define our model function, as in the last section:
In [37]:
def Model(x,m,c):
’’’
This is my model - in this case a straight line.
inputs:
x (array)
- x values
m (float)
- gradient of the line
c (float)
- offset of the line
outputs:
y (array)
- output model values
’’’
x = np.array(x) # convert to a numpy array, to make
# sure the function can handle arrays
y = m*x + c
return y
3.3 Calculating the Chi Squared
In order to get the best fit, we need a measure of how well a given set of parameters fit the data. For this, we use the
chi squared value of the fit. This is calculated by taking the difference between each data point and the model value
for the same x value, dividing by the error and squaring it all:
(Where x is the data, mu is the model and sigma are the errors). We divide by the error so that data point with small
errors contribute to the fit more than those with larger error bars, and square to ensure the numbers are all positive. We
need to define a function that will calculate this for a set of model and data values:
In [38]:
def ChiSquared(y,e,m):
’’’
Function to calculate the chi-squared value of a
set of data and a model.
inputs:
y (array) - measured data points
e (array) - errors on measured data points
m (array) - model data points
outputs:
X (float) - chi squared value
’’’
diff
= y - m
# difference between data and model values
weight = diff / e
# weighted difference, using errors
X = sum(weight**2.0)
# sum of squares of each value
return X
If we try out different model parameters, we can see that the chi-squared value is smaller the closer we are to the true
parameteres, i.e. the better the fit to the data:
m1 = Model(x,2,0)
In [49]: print ChiSquared(y,errs,m1)
122.108194854
m2 = Model(x,0,0)
In [50]: print ChiSquared(y,errs,m2)
53589.186241
m3 = Model(x,5,50)
In [51]: print ChiSquared(y,errs,m3)
185689.708721
m4 = Model(x,2,-50)
In [52]: print ChiSquared(y,errs,m4)
10763.8906
plt.errorbar(x,y,yerr=errs,linestyle=’none’,marker=’o’,color=’blue’)
In [53]: plt.plot(x,m1,color=’red’)
plt.plot(x,m2,color=’blue’)
plt.plot(x,m3,color=’green’)
plt.plot(x,m4,color=’black’)
plt.show()
3.4 Minimise Chi-Squared - Fit the Model!
We now have our Chi-squared function, which calculates the chi-squared for a given set of data and model values, but
we need to minimise a function that takes the model parameters as input. We therefore define a function that calculates
model values from the model parameters, passes these values to the chi-squared function and returns the chi-squared
value:
def ModelChiSquared(vals,model,data):
’’’
In [54]:
Calculate the chi squared value for a given set of model parameters.
inputs:
vals (array)
- array containing the parameters needed to
calculate model values for each data point.
model (function) - the model to calculate the model values.
data (array)
- data to which the model will be fitted
outputs:
X (float)
- chi squared value
’’’
# calculate model values.
mod = model(data[0],*vals) # The ’*’ means fill the rest of the
# function’s arguments with the
# values in array ’vals’
# calculate chi squared for these model values and the data
X = ChiSquared(data[1],data[2],mod) #
return X
Now we just need to pass this function to ‘minimize’ with some guesses at the initial parameters:
m = minimize(ModelChiSquared,[1,0], \
method = ’Nelder-Mead’,args=(Model,[x,y,errs],))
In [55]:
print
print
print
print
’Success = ’,m[’success’]
’Chi-squared = ’,m[’fun’]
’Best-fitting g = ’,m[’x’][0]
’Best-fitting c = ’,m[’x’][1]
Success = True
Chi-squared = 109.647808131
Best-fitting g = 1.97452365549
Best-fitting c = 2.86550734171
The value of the function at the minimum is now you’re chi-squared value. We can see that the our slope, ‘m’ is very
well recovered, but the intercept, ‘c’, not so well, due to the random variation/errors we added to the data.
Now that we have fit our data and have a chi-squared value, how do we know that it’s actually a good fit, or a low
enough chi-squared value? The best way to find out is by dividing the chi-squared by the number of ‘degrees of
freedom’ of the fit. This is the number of data points minus the number of free parameters in your model. In this case,
there are two free parameters, ‘m’ and ‘c’, so the number od degress of freedom is 100 - 2 = 98.
The reduced chi-squared value should be as close as possible to 1, but not less than one. A chi-squared of one means
that all of your data are close enough to your model to be well described by it within the expected random variation
from your errors. If it’s a lot less than one, it therefore means either than your model is overcomplex, i.e. has too many
free parameters, or that your errors have been overestimated and are too large. If it’s much larger than one, your model
is probably not a very good fit to your data, but it could also be that you’ve underestimated your errors.
Our reduced chi-squared in this case is:
DoF = (len(x) - 2)
In [56]: print ’Reduced chi-squared = ’,m[’fun’]/DoF
Reduced chi-squared =
1.11885518501
We can now see that the reduced chi-squared is close to one and therefore that we have a pretty good fit to the data.
The only thing that remains to be done is to plot our model alongside our data to check visually that the model fit looks
ok, and that nothing went wrong:
# create an array of model values from our best fit params
In [57]: model = Model(x,m[’x’][0],m[’x’][1])
# plot the data
plt.errorbar(x,y,yerr=errs, marker=’o’,linestyle=’none’)
# plot the model
plt.plot(x,model,color=’red’)
plt.show()
Q3. Fit a straight line model to ‘example4.dat’ and find the best fitting parameters and the reduced chi squared.
4 Bootstrapping - Calculating Errors
4.1 Calculating the errors on the parameters of your model fit
In this section, you’ll learn how to calculate the approximate (but decent!) errors on the parameters of your model fit.
This will involve using a method called ‘bootstrapping’, which involves resampling and then refitting your data, so
we’ll need to do chi-squared minimisation again.
Bootstrapping is statistical method which allows you to estimate the errors on the parameters of your fit, by resampling
your data set then refitting it. The distribution of best-fit parameters from these samples gives you an idea of the error
on each parameter, and even their interdependence. In the (standard) method which we will use, you will create a new
data set by randomly drawing data points from your original sample, then replacing them so that it is possible for the
same data point to be picked multiple times. In terms of coding, a good way of doing this is therefore to generate
sets of random integers between 0 and the length of your data (-1). Then you can just pass these to your data array to
extract random samples, which can be fit with your model to obtain the best fit parameters for each set.
In [66]:
def Bootstrap(data,params,model,n):
’’’
Bootrap function - resample the data then refit multiple times.
Inputs:
data (array)
- data to bootstrap - list containing
three arrays of [x,y,yerr]
vals (array)
- array of initial values for each param
model (function) - The model function to be bootstrapped
n (int)
- The number of random samples to fit
Outputs:
Params (array)
- list of arrays containing the fitted values
of the parameters for each random sample
’’’
data = np.array(data) # make sure the data is a numpy array
# create array to contain params from random samples
Params = [[] for i in range(len(params))]
# create list of n arrays of random indices between 0 and
# the length of the data set (-1)
indices = np.random.randint(len(data[0]),size=(n,len(data[0])))
# create n sets of randomly samples
X = data[0][indices]
Y = data[1][indices]
Err = data[2][indices]
# refit each random sample with your model
for i in range(n):
# chi squared fit
mi = minimize(ModelChiSquared,params, method = ’Nelder-Mead’,\
args=(model,[X[i], Y[i], Err[i]]))
par = mi[’x’]
# append best-fit params to ’Params’ array
for p in range(len(Params)):
Params[p].append(par[p])
return np.array(Params)
# run the function!
Params = Bootstrap([x,y,errs],[0,1],Model,300)
If we plot a histogram of these parameters, we can see the kind of distribution we get for each:
# gradient
In [72]: plt.hist(Params[0],bins=15)
plt.show()
# intercept
In [71]: plt.hist(Params[1],bins=15)
plt.show()
We can now see that the values of both ‘g’ and ‘c’ that we got in our fit are well within the distribution of each parameter
we get from resampling the data. Presenting this distribution is the best way of showing your errors, because it gives
all of the information we have about the errors. If we want to present a value with numerical errors, we need to do
more though. We can get a rough width by eye, but ideally we want to do something a bit more numerical to obtain
our errors, so we’ll find the standard deviation of each distribution and use these as errors:
# declare a Gaussian function, which can be fit to the param distribution
In [76]: def SD(x):
’’’
Find the Standard deviation of a set of data.
inputs:
x (float) - input values
outputs:
std (float) - standard deviation
’’’
mean = np.mean(x)
# calculate mean
diff = x - mean
# subtract from data
dev = (diff **2.0) / len(x) # square, divide by N
std = np.sqrt(sum(dev))
# sum, square root
return std
print SD(Params[0])
print SD(Params[1])
0.0179048640108
1.09089954309
Even without knowing the input model, we can therefore say with confidence that the gradiant is 1.97 +/- 0.02 and the
intercept is 2.86 +/- 1.09.
Q4. Run a bootstrap on your straight line fit to ‘example4.dat’ and find the
errors on each parameter.
Q5. Plot the best-fit parameters from your bootstrap fits against one another.
Is there any trend? What would it mean if there was?
Q6. Using the data you reduced in the last workshop, find the orbital period
of the observed system, with errors.
5 Homework
Fit the supernova (from the last workshop’s homework) with an exponential
(rise), a Gaussian (peak) and a power law (tail) and find the relevant bestfitting parameters, with errors.
Exponential:
Where A, a, b and c are all constants.
Gaussian:
Where mu is the mean (centre), sigma is the width and A is a constant.
Power law:
Where a is a constant.
Created by Sam Connolly March 2015.