Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU Supercomputer Facility • • • • • • • • The MACHO project Woods data set Data exploration and data properties Data preprocessing Feature sets Classification using additive models Training process Web site The MACHO Project • To find evidence of dark matter from gravitational lensing effect • Observations at Mt Stromlo 1992 - 1999 • 6.10^7 observed stars • 100000 CDD images Woods Data Set • 792 stars identified as long period variable • Chosen from the full MACHO data set • Original data processed by SODOPHOT to give red and blue light curves • Missing data • Large errors • Unequal sampling Stars from the Woods data set Two typical long-period stars Data Preprocessing • Data sampling is not uniform so cannot use Fourier transforms. • Periodic stars satisfy f(t+p) = f(t) for some period p, say. • Long period variable starts are not exactly periodic e.g. f(t)=f(t+p)+g(t) where g is small compared with f. • Use “periodic smoothing” to estimate missing data. Periodic Smoothing An estimate for f can be determined by minimizing the function n J( f ) ( f (t i ) y i ) 2 f (t) 2 dt i1 ( f (t p) f (t)) 2 dt The function is f is modeled as a piecewise linear function. In practice p is not known but it can be estimated by a method such as Pisarenko’s method. For now the second penalty function multiplier is much smaller than the first. Feature Sets • Features are calculated to characterize the light curves. • Magnitudes are observed for both red and blue frequency range. • The difference between these is the logarithm of the ratio of intensities of blue and red light. Called the colour index. • Summary features of the light curves are obtained from the colour and magnitudes by forming the average (or median) over time, the amplitude of the fluctuations, the average frequency or time scale and a measure of the time scale fluctuations. Features cont’d. • Correlation between red and blue magnitudes. • 9 features calculated and stored for each light curve. • Use these features as predictor variables for the classifier, NOT original light curve data. Classification using additive models ANOVA decomposition, Friedman (MARS 1991), HastieTibshirani (GAM 1990), Wahba 1990 f (x) f 0 f i (xi ) f i, j (xi , x j ) ... For example, such a function could approximate a classification function P(Y 1 | x) f (x) log P(Y 0 | x) to decide which of two classes (0 or 1) a particular star belongs. Additive Models • In general an additive model is expressed as a sum of unknown smooth functions that have to be estimated from the data. • The model is fitted by using a local scoring algorithm which iteratively fits weighted additive models by a back-fitting algorithm. • This is a Gauss-Seidel method which iteratively smooths partial residuals by weighted linear least squares. Possible basis functions for the approximation space in 1D. Indicator functions Hat functions ADDFIT uses 1D basis functions Hierarchical hat fns Boosting • Boosting is a machine learning procedure which improves the accuracy of any learning algorithm. • The AdaBoost procedure used in this code calls a weak learning procedure several times and maintains a distribution of weights over the training set. • Initially all weights are zero but then weights of incorrectly classified examples are increased so that the weak learner concentrates more on them. Training Program • Start with an initial training set of “accepted” stars, that is, stars of the type of interest. • Helpful to also have a set of “unacceptable” stars to help the trainer. • Additive models are used to form a classification function using the feature set data from the initial training set. • This function is then applied to the full data set and the stars ordered based on the function values. • The light curves are displayed in decreasing order of function value. Ideally the training set stars should appear first. • Further “acceptable” and “unacceptable” stars can be chosen by clicking on the relevant button and then a new classification carried out. • Continue the process until satisfied with the star sorting. Web based data mining tool http://datamining.anu.edu.au Software link to Macho demo. This software contains Python code to read ASCII star data files, process them by removing any with insufficient good data then calculate several features from each star. These features are then used for the training program to select groups of like stars. The programs have incorporated a method of caching data so that it is kept in binary form for quicker access. The caching software was written by Ole Nielsen and can be downloaded from the ANU datamining web page. Procedure to run: • Determine initial training set. • When prompted enter the star numbers for acceptable stars. Stars 1 and 2 are already entered as a default. • When the web browser appears with the top ranked 60 stars, those that have already been deemed acceptable will have the “accept” button disabled and those that have been rejected will have the “reject” button disabled. • The user can then choose more acceptable or unacceptable stars by clicking on the relevant button. Previous decisions can be changed. • After choosing a few stars click on the “continue” button to see the next 60 top ranked stars or go down to further pages to make more choices. • Continue until satisfied with the initial ranked stars. Stop by clicking “quit” or “restart”.