Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
German tank problem wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Statistics and Data Analysis: Wk 6 Recap Wk 5 Midterm: Tue 5th May Results: Fri 8th May ● Pearson’s χ2 Statistic, DoF (k = N - n) ● Significance of χ2 statistic: P-value, P-value table ● χ2 example: distribution of stars in a stellar cluster ● Reduced χ2 : model ‘OK’ if χ2red ~ 1 ● Penalised goodness of fit: AIC, BIC ● PDF, CDF of the χ2 distribution ● Why is P = 0.05 significant? (Pearson vs Fisher) 1 Bayesian Information Criterion The Bayesian Information Criterion (BIC, Schwarz 1978) is also a penalised goodness of fit estimator. Assuming Gaussianity, for a given χ2 value, number of fitted parameters n, and number of data points N, BIC is given by: As a general rule of thumb: Not significant: ΔBIC < 2 Significant: ΔBIC > 2 2 BIC Example: Galaxy Model Fits The light profile of galaxies are often parameterised using the Sérsic (1968) function, as given by: where I0 is the central intensity (flux), r is the radius, re is the halflight radius, n is the Sérsic index, and bn is a function of n (bn ~ 2n - ⅓). Alongside size r, Sérsic index n, and central intensity I0, the Sérsic model also minimises position (x,y), position angle θ and ellipticity e -- a total of 7 fitted parameters. 3 BIC Example: Galaxy Model Fits 4 BIC Example: Galaxy Model Fits In this case, the latter model (more complex) is preferred! 5 Point Estimates Using a sample to estimate a population parameter, e.g., mean, is known as a point estimate. How sure are we that the estimated sample mean is truly representative of the population mean? What is the standard error of the population mean, and how may we estimate this from our sample statistics? How may we estimate the confidence interval of our point estimates? 6 Point Estimate Example: Run The 2012 Cherry Blossom 10 mile run in Washington contains 16,924 runners. The age, gender, place of origin and time taken to complete the run are recorded: 7 Point Estimate Example: Run We randomly sample 100 runners from the race: Can we use this information to infer global population parameters? 8 Point Estimate Example: Run Perhaps the most intuitive way to estimate the population mean is to calculate the sample mean. The sample mean is known as a point estimate of the population mean. Usually, one does not know nor is able to measure data from the entire population, therefore, the point estimate is our ‘best guess’ at the population parameter. 9 Running Mean Point estimates are not exact, and vary greatly depending on your sample size. Generally, the estimate of summary statistics becomes more robust as the sample size becomes larger (n >~ 30). 10 Standard Error of the Mean Whilst the standard deviation tells us about the distribution of the values for a given data set, the point estimate (e.g., of the mean), also has an associated error. A simple yet effective method to calculate the standard error is via repeat sampling. For example: suppose we generate 1000 random samples from our Cherry Blossom run data set, and calculate the sample mean for each. 11 Sampling Distribution We generate 1000 random samples from our Cherry Blossom run data set, and calculate the sample mean for each. Our sample means describe a sampling distribution for our population: 12 Central Limit Theorem As we increase the number of random samples taken, the sampling distribution becomes more normal in shape. This is known as the Central Limit Theorem. 13 SE of the Mean The sampling distribution is approximately normally distributed, with a mean centred on the true population mean (μ = 94.52). The standard deviation of the sampling distribution, σ = 1.59, quantifies the variability around the population mean, and describes the typical error of the point estimate. We call the standard deviation of the sampling distribution the standard error of the mean. 14 SE of the Mean Given n independent observations from a population with standard deviation σ, the standard error of the sample mean is given by: However: often, σ is unknown - in this case, use the point estimate of the standard deviation, s, as a proxy for σ! Assuming: - large enough sample size (n >~ 30) - no strong skewness 15 Example: Galaxy Model Parameters Many Sérsic models are generated and thrown randomly into real data. The magnitude, sizes, position angles and other parameters are fitted. This process is repeated ~100 times for each model type. The standard deviation for each fitted parameter provides an estimate of the error. Credit: Andreas Heimer, UIBK 16 Confidence Intervals A plausible range of values of the population parameter is known as a confidence interval. Since the standard error (SE) represents the standard deviation of the population mean, then we may say that ~95% of our point estimates will be within 2×SE. 17 Confidence Intervals Exact values to use come from the Gaussian PDF, i.e.: C-Level 50% 90% 95% 99% 99.9% z* 0.67 1.64 1.96 2.58 3.29 Generally, the confidence interval for a given point estimate is given by: 18