Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
False Alarm Probability
based on bootstrap and extreme-value methods
for periodogram peaks
Maria Süveges
ISDC Data Centre for Astrophysics
Observatory of Geneva
7th Conference on Astronomical Data Analysis
17 May 2012, Cargèse
Thursday, May 17, 2012
Variable star analysis
Analysis of variable objects:
valuable input for several fields
• asteroseismology
• exoplanet search
• cosmic distance scale
• multiple systems
• ...
Fundamental step is the identification of the period (or periods)
Method: estimate the periodogram
Thursday, May 17, 2012
Mag
17.5 17.0 16.5 16.0 15.5
Period identification
●
●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
51000
Thursday, May 17, 2012
51500
52000
52500
53000
Time [days]
53500
54000
54500
Period identification
Thursday, May 17, 2012
Period identification
Thursday, May 17, 2012
Period identification
Is this source periodic?
Thursday, May 17, 2012
Hypothesis testing
Is the found peak significant?
H0 : the time series does not contain any periodic signal
(white noise)
H1 : the time series contains a periodic component
Thursday, May 17, 2012
Hypothesis testing
Test statistic: The maximum of the periodogram
Decision: based on the False Alarm Probability (FAP): the
probability that a (white) noise sequence produces a similar or
higher maximum
Distribution of the maximum of independent random
variables X1, ..., XM with marginal distribution F(x):
�
G(x) = Pr maxi∈{1,...,M } Xi ≤ x
�
= Pr (X1 ≤ x) . . . Pr (XM ≤ x)
= F (x)M
Thursday, May 17, 2012
Hypothesis testing
Ingredients for F(x)M:
• marginal distribution F(x) of the periodogram
✦
theoretical F(x) different for each method (Lomb 1976, Scargle 1982,
Schwarzenberg-Czerny 1998); problems (loss of orthogonality)
Thursday, May 17, 2012
Hypothesis testing
Ingredients for F(x)M:
• marginal distribution F(x) of the periodogram
✦
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.3
Magnitude deviation
−0.1
0.1 0.2
0.3
✦
theoretical F(x) different for each method (Lomb 1976, Scargle 1982,
Schwarzenberg-Czerny 1998); problems (loss of orthogonality)
Gaussian assumption is not appropriate:
0
2
4
6
Time[days]
H0 : the observed time series is noise
Thursday, May 17, 2012
8
10
Hypothesis testing
Ingredients for F(x)M:
• marginal distribution F(x) of the periodogram
✦
✦
theoretical F(x) different for each method (Lomb 1976, Scargle 1982,
Schwarzenberg-Czerny 1998); problems (loss of orthogonality)
Gaussian assumption is not appropriate:
6
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
Sample Quantiles
0
2
4
●
−0.3
Magnitude deviation
−0.1
0.1 0.2
0.3
Normal Q−Q Plot
●
0
2
4
6
8
10
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
Time[days]
−2
−1
0
1
Theoretical Quantiles
2
H0 : the observed time series is noise ...but Gaussian?
⇒ theoretical periodogram distributions are not valid, especially not in the tails
Thursday, May 17, 2012
Hypothesis testing
Ingredients for F(x)M:
• number of independent test frequencies M:
✦
✦
loss of the orthogonal (“independent”) Fourier frequency system, due to
the irregular sampling
simulations (Horne and Baliunas 1986, Frescura et al. 2008): it is not equal
to the expected N/2 even under regularly sampled Gaussian case!
M should be estimated.
Thursday, May 17, 2012
Estimation
Limitations of F(x)M
• In an irregularly sampled case, no independent frequency
set, so the formula F(x)M is not valid.
Thursday, May 17, 2012
Estimation
Limitations of F(x)M
• In an irregularly sampled case, no independent frequency
set, so the formula F(x)M is not valid.
• In addition, we look for the distribution of the maximum
of a strongly oversampled spectrum.
Thursday, May 17, 2012
Estimation
Limitations of F(x)M
• In an irregularly sampled case, no independent frequency
set, so the formula F(x)M is not valid.
• In addition, we look for the distribution of the maximum
of a strongly oversampled spectrum.
M̂
• Use of F̂ (x) is extremely unstable:
with M = 25, for FAP = 0.01, we need
M̂
FAP = 1 − F̂ (x) ≈ 1 − 0.999625
Thursday, May 17, 2012
Extreme-value distributions
Remedy may be:
Use generally valid limiting distributions for the maxima:
Generalized Extreme-Value (GEV) distributions (Leadbetter, Lindgren and
Rootzén 1983, Coles 2001)
� �
�−1/ξ �
x−µ
G(x) = Pr (max{X1 , . . . , XM } ≤ x) = exp − 1 + ξ
σ
First application:
Baluev (2008), based on upcrossings of stochastic processes with chi-squared, F
and beta margins
•
relies on theoretically calculated distributions and Gaussianity
•
theory for stochastic processes, not continuous transformations of random
variables
Thursday, May 17, 2012
Extreme-value distributions
Dependent variables X1, ..., XM:
Their maximum follows the same family of limiting GEV
distributions as if they were independent (with different parameters
though).
Conditions:
•
•
translation invariance of the joint distributions (= stationarity when
X1, ..., XM is a time series)
decreasing dependence between Xi and Xj, when they are far apart
Thursday, May 17, 2012
Extreme-value distributions
Long-range dependence: correlations do not decay ⇒ different theory
Short-range dependence: appearance of wide peaks due to spectral leakage
Thursday, May 17, 2012
Extreme-value distributions
Advantages:
•
the family fits the maxima of practically every continuous distribution
•
valid not only for independent, but (weakly) dependent samples too
•
the parameters ξ, σ, μ can be estimated: take sets of M periodogram values,
select the maximum of each set, and fit a GEV to these maxima
•
max-stable: like the limiting Gaussian distribution of the mean of
independent variables, it tends to a stable GEV distribution when the number
of tested frequencies L increases
•
If we tested a large enough number L of frequencies, we can use then more
safely Gwhole spectrum(x) = Gfitted(x)n/L
Thursday, May 17, 2012
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}
Magnitude deviation
−0.4 −0.1
0.2
Procedure:
1. Bootstrap R times the observed time series
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
Thursday, May 17, 2012
5
10
15
20
25
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}
Magnitude deviation
−0.4 −0.1
0.2
Procedure:
1. Bootstrap R times the observed time series
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
Thursday, May 17, 2012
5
10
15
20
25
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}.
Normalised χ2
0.00
0.10
0.20
Procedure:
1. Bootstrap R times the observed time series
2. Compute the frequency spectra of each repetition around M random
frequencies; M is such that n / (K M) < N/2
0
20
40
60
Frequency
Thursday, May 17, 2012
80
100
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}.
Normalised χ2
0.00
0.10
0.20
Procedure:
1. Bootstrap R times the observed time series
2. Compute the frequency spectra of each repetition around M random
frequencies; M is such that n / (K M) < N/2
0
20
40
60
Frequency
Thursday, May 17, 2012
80
100
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}.
Normalised χ2
0.00
0.10
0.20
Procedure:
1. Bootstrap R times the observed time series
2. Compute the frequency spectra of each repetition around M random
frequencies; M is such that n / (K M) < N/2
0
20
40
60
80
100
Frequency
3. Take the maximum of the periodogram values from each repetition, and fit a
GEV distribution Gfitted(x; ξ, σ, μ) to the sample of R maxima
Thursday, May 17, 2012
FAP from GEV
Periodogram X1, ..., Xn; number of observations N; width of peaks from leakage K;
Xmax = max{X1, ..., Xn}.
Normalised χ2
0.00
0.10
0.20
Procedure:
1. Bootstrap R times the observed time series
2. Compute the frequency spectra of each repetition around M random
frequencies; M is such that n / (K M) < N/2
0
20
40
60
80
100
Frequency
3. Take the maximum of the periodogram values from each repetition, and fit a
GEV distribution Gfitted(x; ξ, σ, μ) to the sample of R maxima
4. Compute the GEV for the maximum of the whole periodogram by
Gwhole spectrum(x; ξ, σ, μ) = Gfitted(x; ξ, σ, μ)n/(KM)
Thursday, May 17, 2012
Simulations
Time span
[0, T] with T = 25 days
Time grid
dt = 0.005 day (~10 min)
•
•
•
•
•
•
with even sampling:
N = 5001 observations would occur
upper frequency limit: fmax = 100 d-1
Fourier frequencies:
fk = k / (N dt), k = 0,1,... (N-1)/2
peak width:
1/T = 0.04 d-1
oversampled frequency grid:
df = 0.0025 d-1
sparse sampling in time domain:
N1 = 100, N2 = 25
Thursday, May 17, 2012
Simulations
Sinusoidal (pulsating RRc-like)
g(t) = A sin(2π F1 t)
•
•
A = 0.15, 0.05, 0.025 mag
F1 = 3.379865
Noise
•
•
independent, randomly generated error bars σi (mean{σi} = 0.05);
simulated observations:
Yi = g(ti) + εi,
Thursday, May 17, 2012
εi ~ N(0, σi2)
Simulations
A = 0.15 mag
Magnitude deviation
−0.4 −0.1
0.2
Sinusoidal, 100 random nightly observations: time series
●
●
●
●
●
●●
●
●
Magnitude deviation
−0.4 −0.1
0.2
0
A = 0.05 mag
Magnitude deviation
−0.4 −0.1
0.2
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●●
●
●
4
●
●
●
●
●
●
●●
●●●
●
●
2
6
●
●
●
●
●
2
●
●●●
●
●●
●●
●
●
4
●
●●
●●
●
●
●
●
●
8
●
●
0
Thursday, May 17, 2012
●
2
0
A = 0.025 mag
●
●
●
●
●
●
4
Time [days]
●
●●
●
●
●
●
6
●●
●
●
●
●●●
●
6
●●
●
●
8
●
●
●
●●
●
●
8
Simulations
A = 0.025 mag
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
Normalised χ2
0.0
0.4
0.8
A = 0.05 mag
0
Normalised χ2
0.0
0.4
0.8
A = 0.15 mag
Normalised χ2
0.0
0.4
0.8
Sinusoidal, 100 random nightly observations: periodograms (GLS)
Frequency
Thursday, May 17, 2012
Simulations
A = 0.025 mag
20
40
60
80
100
0
20
40
60
80
100
Normalised χ2
0.0
0.4
0.8
A = 0.05 mag
0
M = 50
M = 100
Normalised χ2
0.0
0.4
0.8
A = 0.15 mag
Normalised χ2
0.0
0.4
0.8
Sinusoidal, 100 random nightly observations: 0.99 quantile
0
20
40
60
Frequency
Thursday, May 17, 2012
M = 200
M = 300
M = 400
M = 500
80
100
Simulations
A = 0.15 mag
Magnitude deviation
−0.4 −0.1
0.2
Sinusoidal, 25 random nightly observations
●
●●
Magnitude deviation
−0.4 −0.1
0.2
2
4
6
●
Magnitude deviation
−0.4 −0.1
0.2
2
●
●
●
0
Thursday, May 17, 2012
●
●
●
●
0
A = 0.025 mag
●
●
0
A = 0.05 mag
●
2
4
●
8
●
●
6
●
●
4
Time [days]
6
8
●
●
8
Simulations
A = 0.025 mag
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
Normalised χ2
0.0
0.4
0.8
A = 0.05 mag
0
Normalised χ2
0.0
0.4
0.8
A = 0.15 mag
Normalised χ2
0.0
0.4
0.8
Sinusoidal, 25 random nightly observations: periodograms (GLS)
Frequency
Thursday, May 17, 2012
Simulations
A = 0.025 mag
20
40
60
80
100
0
20
40
60
80
100
Normalised χ2
0.0
0.4
0.8
A = 0.05 mag
0
M = 50
M = 100
Normalised χ2
0.0
0.4
0.8
A = 0.15 mag
Normalised χ2
0.0
0.4
0.8
Sinusoidal, 25 random nightly observations: 0.99 quantile
0
20
40
60
Frequency
Thursday, May 17, 2012
M = 200
M = 300
M = 400
M = 500
80
100
Bias of FAP estimates
M = 500
• Short-range dependence accounted for by taking local maxima: we found the
GEV distribution of 1/5 of the whole spectrum
• Neglect long-range dependence: consider the whole spectrum as composed of 5
such independent blocks.
• Then: Gwhole spectrum(x) = Gfitted(x)5
A = 0.15
p nominal
A = 0.05
A=0.025
0.99 0.995 0.99 0.995 0.99
0.995
p empirical
0.99 0.995 0.995 0.997 0.991 0.996
(N = 100)
p empirical
0.992 0.995 0.992 0.996 0.994 0.997
(N = 25)
Thursday, May 17, 2012
Summary
Advantages:
•
no need to know the underlying distribution of the periodogram:
the bootstrap reproduces the empirical distribution of the observations, and the
GEV is a generally valid limit distribution
•
the parameters ξ, σ, μ under H0 can be estimated:
no need of theoretical calculations to obtain an approximate distribution of the
maxima
•
valid not only for independent, but dependent samples too:
M̂
✦ no need to worry about the validity of F̂ (x) ;
✦
•
the effect of short- and long-range dependence can be taken into account heuristically,
by careful selection of test frequencies;
max-stability:
✦
✦
avoids the instability of F̂ (x)M̂;
no need to compute the whole periodogram for each bootstrap repetition of the noise
process
Thursday, May 17, 2012
Summary
Limitations:
•
computationally demanding:
bootstrap repetitions + local maxima around M selected frequencies = equivalent
to computation of the complete spectrum several times;
•
independence is not true anywhere in the spectrum:
Gwhole spectrum(x) = Gfitted(x)L is only approximate (but less sensitivity, nevertheless);
long-range memory should be accounted for more carefully
Thursday, May 17, 2012
Thank you!
Thursday, May 17, 2012