Download Mining the MACHO dataset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

International Ultraviolet Explorer wikipedia , lookup

R136a1 wikipedia , lookup

Star catalogue wikipedia , lookup

CoRoT wikipedia , lookup

Astronomical spectroscopy wikipedia , lookup

Star formation wikipedia , lookup

Stellar kinematics wikipedia , lookup

Hipparcos wikipedia , lookup

Transcript
Mining the MACHO dataset
Markus Hegland, Mathematical
Sciences Institute, ANU
Margaret Kahn, ANU Supercomputer
Facility
•
•
•
•
•
•
•
•
The MACHO project
Woods data set
Data exploration and data properties
Data preprocessing
Feature sets
Classification using additive models
Training process
Web site
The MACHO Project
• To find evidence of dark matter from
gravitational lensing effect
• Observations at Mt Stromlo 1992 - 1999
• 6.10^7 observed stars
• 100000 CDD images
Woods Data Set
• 792 stars identified as long period variable
• Chosen from the full MACHO data set
• Original data processed by SODOPHOT to give
red and blue light curves
• Missing data
• Large errors
• Unequal sampling
Stars from the Woods data set
Two typical long-period stars
Data Preprocessing
• Data sampling is not uniform so cannot use
Fourier transforms.
• Periodic stars satisfy f(t+p) = f(t) for some period
p, say.
• Long period variable starts are not exactly
periodic e.g. f(t)=f(t+p)+g(t) where g is small
compared with f.
• Use “periodic smoothing” to estimate missing
data.
Periodic Smoothing
An estimate for f can be determined by minimizing the function
n
J( f )   ( f (t i )  y i ) 2    f (t) 2 dt
i1
  ( f (t  p)  f (t)) 2 dt
The function is f is modeled as a piecewise linear function.
 In practice p is not known but it can be estimated by a method
such as Pisarenko’s method. For now the second penalty
function multiplier is much smaller than the first.
Feature Sets
• Features are calculated to characterize the light
curves.
• Magnitudes are observed for both red and blue
frequency range.
• The difference between these is the logarithm of
the ratio of intensities of blue and red light. Called
the colour index.
• Summary features of the light curves are obtained
from the colour and magnitudes by forming the
average (or median) over time, the amplitude of
the fluctuations, the average frequency or time
scale and a measure of the time scale fluctuations.
Features cont’d.
• Correlation between red and blue
magnitudes.
• 9 features calculated and stored for each
light curve.
• Use these features as predictor variables for
the classifier, NOT original light curve data.
Classification using additive models
ANOVA decomposition, Friedman (MARS 1991), HastieTibshirani (GAM 1990), Wahba 1990
f (x)  f 0   f i (xi )   f i, j (xi , x j )  ...
For example, such a function could approximate a
classification function
P(Y 1 | x)
f (x)  log
P(Y  0 | x)

to decide which of two classes (0 or 1) a particular star
belongs.

Additive Models
• In general an additive model is expressed as a sum
of unknown smooth functions that have to be
estimated from the data.
• The model is fitted by using a local scoring
algorithm which iteratively fits weighted additive
models by a back-fitting algorithm.
• This is a Gauss-Seidel method which iteratively
smooths partial residuals by weighted linear least
squares.
Possible basis functions for the approximation space in 1D.
Indicator functions
Hat functions
ADDFIT uses 1D basis functions
Hierarchical hat fns
Boosting
• Boosting is a machine learning procedure which
improves the accuracy of any learning algorithm.
• The AdaBoost procedure used in this code calls a
weak learning procedure several times and
maintains a distribution of weights over the
training set.
• Initially all weights are zero but then weights of
incorrectly classified examples are increased so
that the weak learner concentrates more on them.
Training Program
• Start with an initial training set of “accepted”
stars, that is, stars of the type of interest.
• Helpful to also have a set of “unacceptable” stars
to help the trainer.
• Additive models are used to form a classification
function using the feature set data from the initial
training set.
• This function is then applied to the full data set
and the stars ordered based on the function values.
• The light curves are displayed in decreasing
order of function value. Ideally the training
set stars should appear first.
• Further “acceptable” and “unacceptable”
stars can be chosen by clicking on the
relevant button and then a new classification
carried out.
• Continue the process until satisfied with the
star sorting.
Web based data mining tool
http://datamining.anu.edu.au
Software link to Macho demo.
This software contains Python code to read ASCII star data files,
process them by removing any with insufficient good data then
calculate several features from each star. These features are then
used for the training program to select groups of like stars.
The programs have incorporated a method of caching data so that
it is kept in binary form for quicker access. The caching software
was written by Ole Nielsen and can be downloaded from the
ANU datamining web page.
Procedure to run:
• Determine initial training set.
• When prompted enter the star numbers for
acceptable stars. Stars 1 and 2 are already entered
as a default.
• When the web browser appears with the top
ranked 60 stars, those that have already been
deemed acceptable will have the “accept” button
disabled and those that have been rejected will
have the “reject” button disabled.
• The user can then choose more acceptable or
unacceptable stars by clicking on the relevant
button. Previous decisions can be changed.
• After choosing a few stars click on the
“continue” button to see the next 60 top
ranked stars or go down to further pages to
make more choices.
• Continue until satisfied with the initial ranked
stars. Stop by clicking “quit” or “restart”.