Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random variables Distributions and Confidence Intervals Any characteristic that can be measured or categorised is called a variable If a variable can assume a number of different values, such that any value is determined by chance, it is called a random variable Examples of random variables: The number of boys in a 2-child family The age of a randomly chosen patient at a clinic The smoking status of a randomly chosen member of this class Probability Distribution Probability distribution of a continuous random variable The probability distribution is the breakdown of the total probability i.e. a description of all the possibilities and their probabilities Where the number of possibilities is not too many, we can easily find the probability distribution. Example: The number of boys in a 2-child family The possibilities are 0, 1 or 2 = .25 Probability (1 boy) = .50 Probability (2 boys) = .25 Probability (0 boys) When there is a whole range of possibilities (e.g. weight, height , cholesterol..) then we cannot itemise all the possibilities. How can we describe the probability distribution? Remember: the probability of an event is the proportion of the time this event would occur in the long run. Total probability =1 (100%) This is the probability distribution 1 Relative frequency histogram (histogram of proportions) Probability distribution of a continuous random variable ALT (log scale) in a group of Irish Hepatitis C patients Example: If we select a person at random from a population, the probability that his/her height is between 170 and 180cm = the proportion of the population with height between 170 and 180cm .3 • the total shaded area =1. So if we can find a way to describe the proportion of the population in any specified range, we will have described the probability distribution • The sample histogram is an estimate of the population distribution, which is represented by a smooth curve. Fraction .2 • each bar area = proportion of the sample with a value in that range .1 0 2 is a curve where: • The total area =1. • The area between any two values = proportion of the population with a value in that range .2 .15 .1 .05 When the sample histogram is approximately normal, we infer that the population distribution is exactly (or almost!) normal. 5 6 (a) The centre is at the population mean, denoted by μ (b) The variation of the population is given by the standard deviation (SD), denoted by σ (c) The total enclosed area = 1 (100%) (d) The area between any two values equals the proportion of the population in that range. (e) The area between μ ±1.96σ is 95% .25 Fraction So, the curve shows how the total probability (=1) is divided. 4 logalt For any normal distribution: Probability distribution of a continuous random variable….. ……. 3 0 2 3 4 logalt 5 6 2 Exercise 1: Use Normal Tables to find…… 1. The proportion of a normal population that lies within 1 SD either side of the mean 2. How many SDs each side of the mean one must go to include exactly 90% of the population Reference Ranges (“Normal Ranges”) Diagnostic test results (especially clinical chemistry) are usually classified as normal (disease free) or abnormal (diseased) based on a cut-off value. Tests often have a range of values (i.e. an upper and lower limit) specified which should include the majority of the normal (i.e. healthy) population. This is called Reference interval Normal Range Reference Range Values outside this interval are considered abnormal How to Construct Reference Intervals If we know that the biochemical marker being measured is normally distributed in the healthy population, then we can say that 95% of the healthy population have values between Notation We always use μ μ ±1.96σ σ Or for higher specificity: 99% of the healthy population have values between μ ±2.58σ Of course, we never know μ and σ, as we never have a value for everyone in the healthy population In practice, we estimate the reference range by using the mean and SD from a sample instead of the true population values μ and σ to denote the mean value of a continuous measurement in the population, to denote the standard deviation. These quantities are called parameters In practice, we rarely know μ and σ, but estimate them by the statistics sample mean: m= 1 n ∑ xi n i =1 sample standard deviation: s= 2 1 n ∑ ( xi − x) n − 1 i =1 3 To construct a reference range: we need….. A representative sample of a reasonable size from the healthy population To check that the variable is normally distributed: for many serum constituents, the log transform is normally distributed (use histogram) If we are satisfied that we have a normal distribution, then we proceed to calculate: sample mean (m) sample standard deviation (s) Our approximate 95% reference range is then: Example of constructing a reference range 30 healthy male hospital staff have level of AAP (alanine aminopeptidase) measured, giving a mean of 1.05 and standard deviation =.32. Assuming a normal distribution for AAP (should check!) we would expect approx. 95% of healthy males to have AAP between 1.05 ± 1.96 (.32) = .44 to 1.69 ⇒ a value higher than 1.69 may suggest diabetes. m ± 1.96*s Criticisms: 30 is not such a big sample Hospital staff may not be representative of the healthy population Reference Intervals..some cautionary remarks Estimating a population mean Must consider if sample is representative of our population (e.g. kits manufactured in different country??) Have age, sex and other differences been considered? Suppose we are interested in estimating the average length of Swedish citizens. We select a number of Swedish citizens and measure their length. What is our best guess of the average length? The mean of our sample: m But you know that this might have been different in another sample, so how can you quanitfy this uncertainty it creates? Create a confidence interval! Sometimes a one-sided cut-off is of interest, and sometimes an interval (i.e. we use one tail vs two tails of the normal distribution) 4 Confidence Intervals Confidence Intervals Statistical theory shows that 95% of the time the sample mean will fall within +/- 1.96 σ n of the true population mean μ So, if you take your sample mean m and +/- 1.96 s s ⎞ ⎛ , m + 1.96 ⎜ m − 1.96 ⎟ n n⎠ ⎝ σ n you have a 95% chance of “capturing” μ in this interval (called a 95% Confidence Interval) Estimating a population proportion Suppose now we are interested in estimating the proportion of Swedish citizens taht are longer than 180 cm. Our best geuss is: p=number of ind.>180/number of ind. in sample How do we quantify the uncertainty in this estimate? 95% confidence interval is given by: But we rarely know σ, so how can we find a 95% CI? Use s to estimate σ (approx 95% CI, OK if sample is large, > 30) ⎛ p (1 − p ) p (1 − p ) ⎞ ⎜ p − 1.96 ⎟ , p + 1.96 ⎜ ⎟ n n ⎝ ⎠ Standar Error (SE) s n p(1 − p ) n is called the “standard error” of the sample mean is called the “standard error” of the sample proportion 95% Confidence Interval ≅ point estimate ± 2 standard errors 5 Interpreting confidence intervals: (see course pack, page ) We are 95% confident that the overall incidence rate in M13-M21 is between 1.59% and 3.05% If someone claims overall rate is “more than 1%”, we would accept If they claim that overall rate is 3%, we would accept If they claim that overall rate is 5%, we would reject 6