Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Master class Data, understanding it, interpreting it and using it. Ruth Harrell Liann Brookes-smith 1 Agenda 9.30am – 10.30am 10.30am break 10.45 – 11.30am 11.40 – 12.30pm 12.30 – 1.30pm lunch 1.30 – 2.30pm probability 2.30 – 2.45pm break 2.45 – 3.30pm sampling and curve 3.30 – 4.30pm confidence and risk 2 Introduction Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." ~W.A. Wallis “There are three kinds of lies: lies, damned lies, and statistics.” Disraeli (according to Mark Twain) 98% of all statistics are made up. ~Author Unknown Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. ~Aaron Levenstein If you can not measure it, it does not exist ~ Author unknown 3 Question to the Room What are statistics? Why are data important? What do you feel about stats? What do they tell us? E.g. 40% of children on XX area have dental caries, what does that tell us? List types of data you are aware of or use in your day to day 4 Practitioner competencies Obtain, verify, analyse and interpret data and/or information to improve the health and wellbeing outcomes of a population / community / group – demonstrating: a. knowledge of the importance of accurate and reliable data / information and the anomalies that might occur b. knowledge of the main terms and concepts used in epidemiology and the routinely used methods for analysing quantitative and qualitative data c. ability to make valid interpretations of the data and/or information and communicate these clearly to a variety of audiences 5 Aim for the day Aim of the day is to improve people understanding of the data they use, how to analyse it and interpret it. This session is concentrating on the data rather than things such as the study design but we are happy to discuss and answer questions on both; you can’t understand what the data is telling you without understanding how it has been collected and the potential for bias. 6 Topics covered 1. 2. 3. 4. 5. 6. 7. Types of data Basic probability and stats Understanding how data is collected Measures of odds and ratios - comparing populations and study results. Population sampling - Good samples and bad samples Understanding Confidence intervals & p values is the result reliable How I apply data to what I am doing 7 Types of data 8 Describing the data We have a responsibility to present data in a way that can be easily understood, and which does not misrepresent the true meaning of the data. Key decisions are made based on the data – or more accurately people’s impression of the data – so this has an impact on use of resources and eventually on patient care. Accurate analysis and presentation of the data saves lives! 9 Quantitative vs. Qualitative Quantitative data measures quantity ie is numerical. Qualitative data is usually more descriptive and not measured in numbers. However, data originally obtained as qualitative information about individual items may give rise to quantitative data if they are summarised by means of counts; 10 Discrete – Continuous Discrete data can only take certain particular values Continuous falls on a scale. For example height is continuous, but the number of siblings is discrete. 11 Nominal - Ordinal Nominal comes from the Latin nomen, meaning 'name', and is used to describe categorical data. There is no quantitative relationship between the different categories (though sometimes a number may be assigned for ease of analysis). An example is ethnicity. Ordinal data again describes categories but there is some order to them - though the relationship between them may not be well defined. For example, Agenda for change pay scales, since they are ordered and can therefore be put in sequence (but there is no numerical relationship between them). 12 Transforming the data Sometimes the data you have isn't the most effective way of displaying the data. E.g. You have data on weight in Kilos. Having a list of continuous weights is not intuitive, therefore you convert this to BMI I.e., those who are underweight, healthy weight, obese and morbidly obese. Continuous to ordinal. 13 Transforming the data (2) With this you can display more meaningful data BUT You lose the detail, the number of the edge of each category (borderline). You cant transform it back. What you transform it to may not be the best use of data. You can also transform data using complex calculations doing a “log” of each number, this will sometimes convert skewed data to normal curved data (discussed later) 14 Exercise Exercise 1 and 2 15 Displaying the data What are the options? Tables – simple descriptive, cross tab… (mention pivot table) Graphs – bar, line, x-y or scatter, pie chart…. 16 Basic statistics and probability Having looked at the raw data and carried out any transformations you felt necessary, you now want to describe the features of this data. Distributions – plotting the data is the first step in this. You need to consider the shape of the graph before you know how to best analyse the data. 17 Types of graph Normal 18 Types of graph Skewed 19 Types of graph Bimodal 20 Types of graph Uniform 21 15 minute Break! 22 Data measures Definitions: Range: the difference between the highest and the lowest values in a set Mean: the total value of measure values summed divided by the number of measures Median: the middle measure Mode: measure found most often Interquartile ranges: is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles Standard deviation: is a measure of how spread out numbers are. 23 Mean, median and mode Mean= (sum of observations) (number of observations) Mode = the most common observation Median = the number where 50% of observations are below and 50% are above 24 Standard Deviation and IQR Std Dev= sum of (difference squared between each observation and the mean) / (number of observations - 1) IQR= the difference between the value at the 25th percentile and 75th percentile 25 Formulas Sample mean x = ( Σ xi ) / n Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ] xi is each observation N is the number of observations Σ means ‘sum’ 26 Exercise 3 27 Exercise 4 28 How reliable is my data? Any data missing? How old is it? What is the denominator? Who collected it How was it collected? Ways to avoid making statements about inaccurate data? 29 Describing data 30 Interpret the graph This graph is a graph showing the trend of obesity in adults from 1993 – 2007 Percentage: of what (all adults presumed, all registered? All resident?) what age is defined as an adult? Is the increase due to chance or an actual increase? Data is quantitative/continuous 31 Bias When looking at data sometimes the relationship we see is one caused by the way in which we are measuring not actually what is there. 32 Fudging Rate or Number You have 50 cases of COPD in area 1, and 150 cases in COPD in area 2. should you do something in area 2? Area 1 has population of 2000 Area 2 has population of 5000 In area 1 rate in 50-74 year olds is 20/1000 In area 1 rate in 50-74 year olds is 42/1000 Area 1’s data was from 2004 Area 2’s data was from 2005-2009 Area 1 is 20/1000 confidence interval (12-48 per 1000) Area 2 is 42/1000 confidence interval (18 – 56 per 1000) Now what? 33 Exercise Exercise 5 What do these data tell you? Key message? What would you ask of these data? What further information would you want to know? 34 Basics of probability Probability is a way of quantifying the judgements that we make all the time – from ‘do I need an umbrella?’ to ‘shall I bet on that horse?’ Probability is measured on a linear scale of 0 to 1 where 0 is impossible and 1 is absolutely certain. 35 Probability Why is probability relevant to public health? Probability gives us a quantitative measurement of the chances of something happening, and there are 2 key ways in which it is used in Public Health It is another word for risk (or if it has a positive impact benefit). For example, the probability that some who smokes cigarettes will get lung cancer has been shown to be much higher than for someone who doesn’t smoke. It helps us to answer the question ‘how likely is it that the observed effect is due to our intervention not just to chance?’, and is used in all types of studies – testing medical treatments, evaluating the impact of public health interventions, assessing need of one population compared to another. 36 Probability and risk Odd – number of events divided by the number of opportunities Risk in exposed– number of events divided by the number of exposed Risk in un- exposed– number of events divided by the number of un-exposed Relative risk or Risk ratio is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group Absolute risk is the difference in risk between the exposed and unexposed. 37 Probability cont… What is the probability of a 6 if you throw an unbiased dice? What is the probability of a total of 6 if you throw two unbiased dice? 38 Welcome back!! I'm not an outlier I just haven't found my distribution yet. 39 Exercise Exercise 6 Worse and early death = 0-3/10 No change = 4-5 /10 Cure = 2-6/10 40 Population sampling (1) In the real world we don’t usually get data from everybody that we are interested in. Why not? Cost and resources may be too large People may choose to opt in or out May have incomplete data (data entry problems etc) 41 Population sampling (2) So what we need to do is measure a sample of people and infer from that sample what the population looks like. We can do this by tweaking the statistical formula used – but there are two things to consider; If your sample size is too low you are unlikely to get a reasonable result – you can still use the formula but you need to bear this in mind when interpreting it Think about who you have managed to sample – are they representative of the population? (imagine walking in to a large open plan office with a set of scales and asking people if they would mind being weighed – who is more likely to volunteer?) 42 Population sampling (3) If we have a REPRESENTATIVE sample, we can apply a statistical tweak to help us to estimate the figure for the population. If we don’t (if the sample is biased), though we can carry out the maths, it will always be flawed. 43 Population sampling (4) Principle – Measure your sample Calculate the mean and standard deviation (of the sample) Calculate the standard error = standard deviation of the sample / n To estimate your mean, we say our best guess is that the population mean is equal to the sample mean Then we can use the standard error to estimate how close we think our estimate is. First we need to talk about confidence intervals 44 Which one is an Insult. Darling, you are two standard deviations below the mean Of course your normal (mean 10, mode, 7) You are mean Your looks are in the 80% percentile The difference between you and her is a standard deviation 45 46 Probability, Population Sampling and the Normal Curve Thinking about our data that fitted the normal curve – By using the mathematical model we can easily calculate probabilities. The maths tells us that; The total area under the normal curve is equal to 1. The probability that any new observation will fall within one standard deviation of the mean is 68% The probability that any new observation will fall within two standard deviations of the mean is 95% The probability that any new observation will fall within three standard deviations of the mean is 99.7% 47 Examples 48 CERN experiments observe particle consistent with long-sought Higgs boson Geneva, 4 July 2012. “We observe in our data clear signs of a new particle, at the level of 5 sigma, in the mass region around 126 GeV. The outstanding performance of the LHC and ATLAS and the huge efforts of many people have brought us to this exciting stage,” said ATLAS experiment spokesperson Fabiola Gianotti, “but a little more time is needed to prepare these results for publication.” At five-sigma there is only one chance in nearly two million that the result is wrong, i.e. the measurement seen is a random fluctuation. 49 Confidence intervals (1) if we measure one individual’s IQ we can be 95% sure that it would fall between 70 and 130 This ‘interval’ is called the 95% confidence interval. We use 95% by convention; sometimes other figures are used such as 98%. If we measure the heights of a class of children and we have a mean of 1.2m, standard deviation of 0.1, what is your estimate for the height of a child randomly selected from the sample? 1.2 +/-0.2, ie 95% of this sample lies between 1.0 and 1.4m 50 Confidence intervals (2) Reminder; the heights of a class of children have a mean of 1.2m, standard deviation of 0.1 We measure a new child and their height is 1.5m. What does this mean? This is equal to mean + 3 standard deviations. This means we had less than a 0.5% chance that we would have this height in a child in this population. That doesn’t mean they are not part of the distribution (0.5% is not that rare) but you might be sensible to check a few things to be sure they are part of the same population (age!). 51 Confidence intervals (3) This time we are using confidence intervals to estimate our ‘true’ population characteristics based on a sample. Best estimate of the mean = measured mean of sample Best estimate of standard deviation of population = std deviation of sample/ number of measurements in the sample Therefore we can say that we are 95% confident that the mean of the population lies between the sample mean +/2xstandard error This implies that; Our estimate of the mean gets better as n increases – because our error gets smaller. This is the way we usually use confidence intervals in public health as we usually measure a sample and infer the population. Examples – Health survey for England, Household surveys, etc 52 You are a significant part of my life P value =9 53 I would never treat you differently to your sisters Sister 1 CI 4-9 Sister 2 CI 5-11 Sister 3 CI 4-13 ME CI 2-3 54 Comparing two samples The important question is – is there a difference between two populations? This question might be asked in slightly different ways for different types of study, but is fundamentally the same; For an RCT you compare control group with the intervention group For a cohort you compare the outcomes in those exposed to a risk factor compared to those not exposed For a case-control you look at the group with the disease and compare their risk factors to those without the disease You might look at before and after an intervention was put in place You might compare one city or country to another 55 Comparing two samples (2) The important question is – is there a difference between two populations? 56 Comparing two samples (3) We can calculate the difference between the two populations as; Mean difference = mean of pop 1 – mean pop2 Confidence interval = mean difference +/- 1.96*SE SE (standard error) is a combination of the standard errors for each sample (shown here as s1 and s2) SE = sqrt[ (s12 / n1) + (s22 / n2) ] (se can be slightly different for different situations – but this gives you an idea) 57 T tests Testing using t test; You need to know the mean and standard deviation of both of your samples. You start with a hypothesis; this is that there is no difference between the two samples (or populations) You then do some maths; t = [(mean of sample 1 – mean of sample 2)] / SE where SE= sqrt[ (standard dev of pop 1)2 / n1) + (standard dev of pop 2)2 / n2) ] 58 T tests (2) So what does t mean? t =the horizontal axis of a normal distribution with mean=0 and standard deviation=1 You can read the probability of the two samples coming from the same population from a table of t values Most important value if t>1.96 then the probability of them being from the same distribution is <0.05 By convention, we discard the null hypothesis if p<0.05 Its good practice to quote the p value e.g. P=0.01 If t>1.96, then the probability of the two samples coming from the same population is <0.05 (5%). This suggests that they are fundamentally different 59 T tests (3) What do these results mean? Mean difference = 0, with 95% confidence interval (-1.0, +1.0), p= 0.50 Mean difference = 0.5, with 95% confidence interval (0.1, 0.9), p= 0.049 Mean difference = 1, with 95% confidence interval (-0.1, +1.1), p= 0.055 Mean difference = 1, with 95% confidence interval (0.2, +1.8), p= 0.02 60 Risk differences Same principle – null hypothesis is that there is no difference For no difference, the 95% confidence interval would include 0 If it does not include 0, then you can be 95% confident that there is a risk difference. You can also quote a p value Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to 2.4%), p=0.02 Would you take the intervention? 61 Risk differences (2) You can also calculate the number needed to treat from this NNT is the number of people you need to treat to prevent one event from occuring Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to 2.4%), p=0.02 If you treat 100 people you avoid 2 heart attacks. NNT = 50 62 Risk ratio A relative measure of risk – very commonly used Same principle – null hypothesis is that there is no difference IN THE RATIO OF RISKS For no difference, the 95% confidence interval would include 1 Why 1 this time? Because if both had the same risk, the ratio would be 1 If it does not include 1, then you can be 95% confident that there is a risk difference. You can also quote a p value 63 Odds ratio A relative measure of risk – very commonly used Very similar to risk ratio Used for certain types of study, and the result of some calculations For no difference, the 95% confidence interval would include 1 If it does not include 1, then you can be 95% confident that there is a difference. You can also quote a p value 64 Examples Meta-analysis of the 5 prospective cohort studies (86,092 patients) indicated that individuals with periodontal disease had a 1.14 times higher risk of developing CHD than the controls (relative risk 1.14, 95% CI 1.0741.213, P < .001) the risk of VTE was 2.33 for obesity (95% CI, 1.68 to 3.24), 1.51 for hypertension (95% CI, 1.23 to 1.85), 1.42 for diabetes mellitus (95% CI, 1.12 to 1.77), 1.18 for smoking (95% CI, 0.95 to 1.46), and 1.16 for hypercholesterolemia (95% CI, 0.67 to 2.02). 65 In summary Your boss says: 1. 2. 3. “do we need a weight loss service for kids in XXX area” You collect data, definition of “kids”, is this data accurate, how was it collected, what year. Compare the areas, are you much different is there an underlying reason Is this value statistically significant? 66 In summary (2) You look at a service elsewhere (from evidence) You ask yourself, who was included in this sample, are they different to my population Looking at the odds what proportion of kids will this work on Look to see if the test group were bias compared to control group Were the results normally distributed, skewed or other 67 In summary (3) Were the results significant between the two groups. Can you rely on these findings You have just found the need. Evaluated its accuracy Reviewed a solution Looked at effectiveness WELL DONE!!! 68 Useful websites Basic maths and probability http://www.cimt.plymouth.ac.uk/pr ojects/mepres/book7/bk7i21/bk7_2 1i1.htm Tutorials on statistics http://www.stattrek.com/tutorials/s tatistics-tutorial.aspx 69