Download Statistics and Data Analysis: Wk 6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistics and Data Analysis: Wk 6
Recap Wk 5
Midterm: Tue 5th May
Results: Fri 8th May
● Pearson’s χ2 Statistic, DoF (k = N - n)
● Significance of χ2 statistic: P-value, P-value table
● χ2 example: distribution of stars in a stellar cluster
● Reduced χ2 : model ‘OK’ if χ2red ~ 1
● Penalised goodness of fit: AIC, BIC
● PDF, CDF of the χ2 distribution
● Why is P = 0.05 significant? (Pearson vs Fisher)
1
Bayesian Information Criterion
The Bayesian Information Criterion (BIC, Schwarz 1978) is also
a penalised goodness of fit estimator.
Assuming Gaussianity, for a given χ2 value, number of fitted
parameters n, and number of data points N, BIC is given by:
As a general rule of thumb:
Not significant:
ΔBIC < 2
Significant:
ΔBIC > 2
2
BIC Example: Galaxy Model Fits
The light profile of galaxies are often parameterised using the
Sérsic (1968) function, as given by:
where I0 is the central intensity (flux), r is the radius, re is the halflight radius, n is the Sérsic index, and bn is a function of n
(bn ~ 2n - ⅓).
Alongside size r, Sérsic index n, and central intensity I0, the
Sérsic model also minimises position (x,y), position angle θ and
ellipticity e -- a total of 7 fitted parameters.
3
BIC Example: Galaxy Model Fits
4
BIC Example: Galaxy Model Fits
In this case, the latter model (more complex) is preferred!
5
Point Estimates
Using a sample to estimate a population parameter, e.g., mean,
is known as a point estimate.
How sure are we that the estimated sample mean is truly
representative of the population mean?
What is the standard error of the population mean, and how
may we estimate this from our sample statistics?
How may we estimate the confidence interval of our point
estimates?
6
Point Estimate Example: Run
The 2012 Cherry Blossom 10 mile run in Washington contains
16,924 runners.
The age, gender, place of origin and time taken to complete the
run are recorded:
7
Point Estimate Example: Run
We randomly sample 100 runners from the race:
Can we use this information to infer global population parameters?
8
Point Estimate Example: Run
Perhaps the most intuitive way to estimate the population mean
is to calculate the sample mean.
The sample mean is known as a point estimate of the
population mean.
Usually, one does not know nor is able to measure data from the
entire population, therefore, the point estimate is our ‘best guess’
at the population parameter.
9
Running Mean
Point estimates are not exact, and vary greatly depending on
your sample size.
Generally, the estimate of summary statistics becomes more
robust as the sample size becomes larger (n >~ 30).
10
Standard Error of the Mean
Whilst the standard deviation tells us about the distribution of the
values for a given data set, the point estimate (e.g., of the
mean), also has an associated error.
A simple yet effective method to calculate the standard error is
via repeat sampling.
For example: suppose we generate 1000 random samples from
our Cherry Blossom run data set, and calculate the sample mean
for each.
11
Sampling Distribution
We generate 1000 random samples from our Cherry Blossom
run data set, and calculate the sample mean for each.
Our sample means describe a sampling distribution for our
population:
12
Central Limit Theorem
As we increase the number of random samples taken, the
sampling distribution becomes more normal in shape.
This is known as the Central Limit Theorem.
13
SE of the Mean
The sampling distribution is approximately normally distributed,
with a mean centred on the true population mean (μ = 94.52).
The standard deviation of the sampling distribution, σ = 1.59,
quantifies the variability around the population mean, and
describes the typical error of the point estimate.
We call the standard deviation of the sampling distribution the
standard error of the mean.
14
SE of the Mean
Given n independent observations from a population with
standard deviation σ, the standard error of the sample mean is
given by:
However: often, σ is unknown - in this case, use the point
estimate of the standard deviation, s, as a proxy for σ!
Assuming: - large enough sample size (n >~ 30)
- no strong skewness
15
Example: Galaxy Model Parameters
Many Sérsic models
are generated and
thrown randomly into
real data.
The magnitude, sizes,
position angles and
other parameters are
fitted.
This process is
repeated ~100 times for
each model type.
The standard deviation
for each fitted
parameter provides an
estimate of the error.
Credit: Andreas Heimer, UIBK
16
Confidence Intervals
A plausible range of
values of the population
parameter is known as a
confidence interval.
Since the standard error
(SE) represents the
standard deviation of the
population mean, then we
may say that ~95% of our
point estimates will be
within 2×SE.
17
Confidence Intervals
Exact values to use come from the Gaussian PDF, i.e.:
C-Level
50%
90%
95%
99%
99.9%
z*
0.67
1.64
1.96
2.58
3.29
Generally, the confidence interval for a given point estimate is
given by:
18