Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical distributions Lars Valter Statistician LARC and Unit for Health Analysis, County Council of Östergötland Statistical distributions The basic A variable • is able or likely to change or be changed • has not always the same value (it is not a constant) • is able to vary between subjects Statistical distributions Variables Numerical (quantitative) Discrete Continuous Categorical (qualitative) Ordinal Nominal Statistical distributions Variable examples • Number of children, Visits at PHC (Numerical discrete) • Weight, Blood pressure (Numerical continuous) • Disease stage, Education (Categorical ordinal) • Sex, Blood group, Education (Categorical nominal) Statistical distributions Example: In a sample (13000) for a population survey n=6997 responded. One of many questions was ”What is your length?” Statistical distributions Example: Statistical distributions Example: Statistical distributions Statistical distributions can be described in mathematical terms. The normal distribution 𝑓𝑓 𝑥𝑥 = 1 2𝜋𝜋𝜎𝜎 2 ⁄2𝜎𝜎 2 − 𝑥𝑥−𝜇𝜇 𝑒𝑒 Statistical distributions The two most important measures for numerical variables are • 𝜇𝜇: the mean (or expected value or average), is a measure of the central tendency of the distribution • 𝜎𝜎: the standard deviation is a measure of the dispersion of the distribution Statistical distributions σ µ In the example the mean length is 𝛍𝛍 = 𝟏𝟏𝟏𝟏𝟏𝟏. 𝟓𝟓 and the standard deviation is 𝝈𝝈 = 𝟗𝟗. 𝟑𝟑 Statistical distributions Normal distributions Two normal distributions with the same standard deviation but different means Three normal distributions with the same mean but different standard deviations Statistical distributions The normal distribution is characterised by its mean and standard deviation All normal distribution can be be transformed into the same standardised normal distribution by the formula: 𝒙𝒙 − 𝝁𝝁 𝒛𝒛 = 𝝈𝝈 Statistical distributions Standardised normal distribution (z) 𝝁𝝁 = 𝟎𝟎 𝒂𝒂𝒂𝒂𝒂𝒂 𝝈𝝈 = 𝟏𝟏 Standard deviations Statistical distributions Example: The variable age in a population is normal distributed with mean 𝝁𝝁 = 𝟓𝟓𝟓𝟓 and 𝝈𝝈 = 𝟏𝟏𝟏𝟏. Find the proportion of the population above 65. 𝒙𝒙−𝝁𝝁 𝝈𝝈 𝟔𝟔𝟔𝟔−𝟓𝟓𝟓𝟓 𝟏𝟏𝟏𝟏 Calculate 𝒛𝒛 = = = 𝟏𝟏 and look up the answer in a table for a standardised normal distributed variable Statistical distributions Statistical distributions Statistical distributions Other important quantitative and continuous distributions • Students t-distribution is characterised by the degrees of freedom (n-1) • Chi-2 distribution is used for categorical variables (both ordinal and nominal). It is characterised by its degrees of freedom • The F-distribution. It is characterised by a pair of degrees of freedom Statistical distributions A quantitative discrete variable X=Number of doctors appointments at PHC-centre (2013) Number of appointments 0 1 2 3 4 5 6 ... Proportion of population 0.467 0.263 0.130 0.065 0.033 0.018 0.010 . . . Statistical distributions Mean=1.12 Standard deviation=1.60 Statistical distributions All types (continuous, ordinal and nominal) of variables can be dichotomized into a binary variable. Some variables are binary variables from the start. Statistical distributions Example of a binary qualitative variable in a population survey What is your sex? men women sex Statistical distributions Coding an binary variable with values 0 and 1 is very useful for statistical analysis (also called a bernoulli variable) Example: 0=Male, 1=Female, The mean (π) of a binary variable coded 1 and 0 is the proportion of ones The mean of the variable gender is 0.55 Statistical distributions Dichotomizing the number of doctors appointments into a binary variable X=0 if no appointments X=1 otherwise (or at least one appointment) Statistical distributions From population to sample Population Mean: μ Standard deviation: σ Sample � Mean: 𝒙𝒙 Standard deviation: s Statistical distributions A sample must be randomly drawn This is the most important condition when creating a sample There is special cases of sampling e.g. stratified sampling but there is still a random component Statistical distributions Example: Body temperature. Sampling from a population of healthy people. Mean: μ = 37.0 Standard deviation: σ = 0.5 Body temperature Statistical distributions A sample (n=25) from the population 37.24 37.73 36.55 . . . Statistical distributions The sample size = n The sample mean: 𝑥𝑥̅ = ∑ 𝑥𝑥 37.24 + 37.73 + 36.55 + ⋯ = = 𝟑𝟑𝟑𝟑. 𝟎𝟎𝟎𝟎 𝑛𝑛 25 The sample standard deviation: s= ∑ 𝑥𝑥 − 𝑥𝑥̅ 𝑛𝑛 − 1 2 = 37.24 − 37.02 2 + 37.23 − 37.02 24 2 + 37.25 − 37.02 2 +⋯ Statistical distributions 𝑥𝑥̅ = 37.02 𝑠𝑠 = 0.38 𝑥𝑥̅ = 36.87 𝑠𝑠 = 0.50 Statistical distributions 𝑥𝑥̅ = 37.03 𝑥𝑥𝑠𝑠̅ ==0.60 37.03 𝑠𝑠 =0.60 𝑥𝑥̅ = 36.77 𝑠𝑠 = 0.50 𝑥𝑥̅ = 36.77 𝑠𝑠 =0.50 Statistical distributions Statistical distributions The sampling distribution The theoretical distribution of sample means is a fundamental concept for statistical analysis The mean of the sampling distribution is 𝝁𝝁, the same as the population mean The standard deviation of the sampling distribution is called the standard error and is 𝜎𝜎 𝑠𝑠𝑠𝑠 = 𝑛𝑛 Estimated from a sample, 𝑠𝑠 se = 𝑛𝑛 Statistical distributions Population The sampling distribution (n=25) Mean: 𝜇𝜇 = 37 Standard deviation: 𝜎𝜎 = 0.5 Mean: 𝝁𝝁 = 𝟑𝟑𝟑𝟑 Standard deviation: 𝜎𝜎 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 25 = 0.5 Statistical distributions If the population is normally distributed the sampling distribution is normal distributed with mean= 𝜇𝜇 and 𝜎𝜎 standard deviation= 𝑛𝑛 What if the population is not normally distributed? • The mean and standard deviation of the sampling 𝜎𝜎 distribution are still 𝜇𝜇 and 𝑛𝑛 What about the shape of the sampling distribution? • The fundamental Central Limit Theorem help us Statistical distributions The central limit theorem It goes something like this: If the population is normally distributed so is the sampling distribution If the population is symmetrical but not normally distributed, the sampling distribution is approximately normally distributed for a rather small sample size. If the population is skew you need a rather large sample for the sampling distribution to be approximately normally distributed Statistical distributions Statistical distributions A sample mean from a Bernoulli distributed population will also be approximately normally distributed if both 𝑛𝑛 � 𝜋𝜋 > 5 and 𝑛𝑛 � 1 − 𝜋𝜋 > 5 where 𝜋𝜋 is the proportion of ones in the population. If you have a sample use p, the proportion of ones in the sample, as an estimator of 𝜋𝜋. Statistical distributions A bernoulli distributed variable with 𝜋𝜋 = 0.2 have 𝜇𝜇 = 𝜋𝜋 and standard deviation= 𝜋𝜋 � 1 − 𝜋𝜋 Population Statistical distributions The sampling distribution of the mean from a bernoulli variable have mean= 𝜇𝜇 and standard deviation (s.e.)= 𝜋𝜋=0.2 n=10 Mean=0.2 Standard error = 0.13 𝜋𝜋 1−𝜋𝜋 𝑛𝑛 𝜋𝜋=0.2 n=20 Mean=0.2 Standard error = 0.09 Statistical distributions Some other discrete distributions • • • The binomial distribution: The sum of n bernoulli distributed variables. Mean= 𝑛𝑛𝜋𝜋 The hypergeometric distribution: Limited population. Mean= 𝑛𝑛𝜋𝜋 The poisson distribution: The number of events during a time period. Mean = the number of events during a time unit.