Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Right Questions about Statistics Maths Learning Centre The purpose of Statistics is to ANSWER QUESTIONS USING DATA Know the type of question and you can choose what type of statistics... Aim: DESCRIBE Type of question: What's going on? Examples: How many chapters do novels have? What possibilities are there for body temperature after a meal with or without chilli? What sort of relationship might the amount of sleep a student gets have with their grades? What sorts of things might be related to whether a person does volunteer work? Aim: DECIDE Type of question: Yes or no? Examples: Is the median number of chapters in a novel 20? Is your body temperature higher after a meal if it has chilli in it? Does getting more sleep affect a students’ grades? Are women more likely to participate in volunteer work than men? Type of Statistics: Descriptive statistics: graphs and basic numbers Type of Statistics: Hypothesis tests (p-values) Aim: ESTIMATE Type of question: What's this number? Examples: Aim: PREDICT / EXPLAIN Type of question: What's the formula? Examples: What is the median number of chapters in a novel? How much higher is your body temperature after a chilli meal compared to one without? On average, how much of an effect does 30 minutes more sleep have on a students’ grades? How much more (or less) likely is a woman to participate in volunteer work than a man? Type of Statistics: Confidence intervals How can I explain a person’s body temperature after a meal using their temperature before and the chilli content of the meal? How can I calculate a student’s grade based on their number of hours of sleep during semester? How can I use a person’s gender, age, income and religion to predict their chances of participating in volunteer work? Type of Statistics: Modelling and regression The purpose of Statistics is to ANSWER QUESTIONS USING DATA Know more about your data and you can choose what statistical method... HOW THE DATA IS COLLECTED what is done to the subjects? when is information recorded? how are the subjects chosen? HOW MUCH DATA lots of things recorded per subject? lots of subjects? missing data? VARIABLES IN THE DATA how to measure? what type? defining groups or measurements? what distribution? By Dr David Butler © 2012 The University of Adelaide 1 The Right Questions about Statistics Maths Learning Centre DATA ENTRY gender = M age = 18 gender = M age = 25 1 gender = F age = 19 2 chilli = Y temp = 37 chilli = N 3 temp = 36 chilli = Y temp = 38 BECOMES... 1 2 3 4 gender M M F F age 18 25 19 21 chilli Y N Y N temp 37 36 38 35 Statisticians say: "PLEASE make it consistent!" TYPES OF VARIABLES (things you record) Variable NUMERICAL Variable CATEGORICAL Numerical / Quantitative / Scale (numbers: how far apart has meaning) o Continuous (measured) o Discrete (counted) Categorical / Qualitative (words: how far apart has no meaning) o Nominal (names: more or less has no meaning) o Ordinal (ordered: more or less has meaning) DISTRIBUTIONS OF NUMERICAL VARIABLES (how the possible values are spread out) Approximately normal Skewed or worse – parametric tests will be fine – non-parametric tests might be better WHAT EXPLANATORY CATEGORICAL VARIABLES DEFINE: chilli = Y Independent Groups 1 chilli = Y 2 temp = 36 temp = 38 BECOMES... chilli = N 3 chilli = N 4 1 2 3 4 chilli Y Y N N temp 38 36 37 36 temp = 36 temp = 37 OR (chilli = Y) temp = 38 Repeated Measures (matched pairs) 1 (chilli = Y) temp = 36 (chilli = N) temp = 37 (chilli = N) temp = 37 (chilli = Y) (chilli = Y) temp = 37 3 temp = 37 (chilli = N) (chilli = N) temp = 36 temp = 35 By Dr David Butler © 2012 The University of Adelaide (chilli = Y) (chilli = N) 2 BECOMES... 4 1 2 3 4 temp temp 38 36 37 37 37 37 36 35 2 The Right Questions about Statistics Maths Learning Centre HOW HYPOTHESIS TESTING WORKS A hypothesis test is designed to DECIDE the answer to a YES OR NO question using DATA. This is how to do a hypothesis test: Have a yes-or-no question. Collect data. Calculate a test statistic. Figure out the distribution if you assume a particular answer. Calculate a p-value. Decide the answer based on the p-value. This is what a hypothesis test means: It tells you if your data is likely or unlikely given a particular situation (the “null hypothesis”). A low p-value means your data is unlikely and you don’t believe you’re in that situation. A high p-value means your data is likely and you do believe you could be in that situation. HOW CONFIDENCE INTERVALS WORK A confidence interval is designed to give a RANGE of possible answers for a “WHAT’S THE NUMBER?” question, using DATA from a sample. This is how to find a confidence interval: Have a “what’s the number?” question. Collect data. Choose a matching hypothesis test. Work backwards to calculate two ends. The confidence interval is between these two values. This is what a confidence interval means: The values in the CI would be retained with a matching hypothesis test. The values in the CI have a high chance of producing data like yours. The values in the CI are those you are “happy to believe” based on your data. By Dr David Butler © 2012 The University of Adelaide 3 The Right Questions about Statistics Maths Learning Centre HOW REGRESSION WORKS Regression is a method designed to create a FORMULA that uses some information to PREDICT/EXPLAIN an outcome, using DATA. This is how to perform regression: Have a “what’s the formula?” question. Collect data. Look at the pattern – usually with a scatterplot – to choose a formula. Get a computer to calculate the numbers and p-values. Check the p-values. Choose your final formula. This is what regression means: It tells you a formula for how an outcome varies based on other information. It does NOT tell you if some things CAUSE others, only how to calculate them as accurately as possible. The computer output will tell you p-values and confidence intervals to answer other types of questions. More details: DESCRIBING A RELATIONSHIP: o Scatterplot describes relationship – and helps choose a good formula o Correlation coefficient (r) measures how strong a linear relationship is. Ranges from -1 (perfect negative) to 0 (no relationship) to 1 (perfect positive). Ignores how steep the slope is, only says how close to a line. FINDING AND INTERPRETING THE FORMULA: o Computer program will use the data to find the numbers that make the formula fit best. o The coefficient says how much the outcome changes (on average) for a change of 1 in the explanatory variable. LOOKING AT P-VALUES: o The p-value that goes with the F-statistic in the ANOVA table tells you whether all the variables at once have a relationship with the outcome. Low p-value means the relationship is “significant”. o The p-value for each coefficient tells you whether that explanatory variable appears to have a relationship with the outcome. Low p-value means the effect is “significant”. LOOKING AT CONFIDENCE INTERVALS: o The confidence interval that goes with an explanatory variable tells you how large or small the real effect could be. NOTE: Regression has assumptions that must be checked in order to use it properly, especially if you plan to use the p-values and confidence intervals. By Dr David Butler © 2012 The University of Adelaide 4 The Right Questions about Statistics Maths Learning Centre Turning a research question into a statistical question. ORIGINAL QUESTION: Concept Concept ABOUT ONE CONCEPT Concept ABOUT RELATIONSHIPS BETWEEN CONCEPTS TYPE OF QUESTION: DESCRIBE – what’s going on? DECIDE – yes or no? ESTIMATE – what’s this number? PREDICT/EXPLAIN – what’s the formula? TYPES OF VARIABLES: Variable BECOMES... Concept Variable OR CATEGORICAL NUMERICAL WHAT EXPLANATORY CATEGORICAL VARIABLES DEFINE: Independent Groups Repeated Measures OR (matched pairs) DISTRIBUTION OF OUTCOME NUMERICAL VARIABLE: OR OR Note: This probably doesn’t matter if you have a lot of data. STATISTICAL QUESTION: Variable eg: Variable DESCRIBE NUMERICAL Variable eg: DECIDE CATEGORICAL NUMERICAL Independent Groups Note: In the list below, the outcome variables are usually assumed to be normal. By Dr David Butler © 2012 The University of Adelaide 5 The Right Questions about Statistics Maths Learning Centre Statistical methods for statistical questions Variable NUMERICAL Variable CATEGORICAL Variable DESCRIBE: Numbers: Mean & standard deviation ( median & IQR) Graphs: Histogram / Boxplot. DECIDE: “Is the mean equal to #?” – one sample t-test. “Is the median equal to #?” – sign test. ESTIMATE: “What is the mean?” – confidence interval for a mean. DESCRIBE: Numbers: Table of percentages or proportions. Graphs: Histogram. DECIDE: “Is this percentage equal to #?” – z-test for a single proportion. “Are percentages distributed according to #, #, #?” – chi-squared test for goodness of fit. ESTIMATE: “What is this percentage?” – confidence interval for a proportion. Variable (2 categories) CATEGORICAL NUMERICAL Independent Groups Variable Variable (2 categories) CATEGORICAL NUMERICAL Repeated Measures Variable Variable (any# categories) CATEGORICAL Independent Groups NUMERICAL DESCRIBE: Numbers: Means & standard deviations for each group ( medians & IQRs for each category). Graphs: Histograms on same scale / side-by-side boxplots. DECIDE: “Are the means equal?” – unpaired t-test ( MannWhitney U-test or Wilcoxon rank-sum test). ESTIMATE: “What is the difference between the means?” – confidence interval for the difference in means. DESCRIBE: Numbers: Mean & standard deviation of differences between measurements. Graphs: Histogram of the differences between measurements. DECIDE: “Is there a mean difference?” – paired t-test ( Wilcoxon signed ranks test). ESTIMATE: “What is the mean difference?” – confidence interval for the mean difference. DESCRIBE: Numbers: Mean & standard deviation of each group. Graphs: Histograms/boxplots on the same scale. Line graph showing mean of each group. DECIDE: “Are the means equal?” – one-way analysis of variance ANOVA with post-hoc t-tests ( Kruskal-Wallis test). ESTIMATE: “What are the differences between means?” – confidence intervals for each difference in means. By Dr David Butler © 2012 The University of Adelaide 6 The Right Questions about Statistics Maths Learning Centre Statistical methods for statistical questions Variable Variable (any# categories) CATEGORICAL NUMERICAL Repeated Measures Variable Variable (2 categories) (2 categories) CATEGORICAL CATEGORICAL Independent Groups Variable Variable (2 categories) (2 categories) CATEGORICAL CATEGORICAL Repeated Measures Variable Variable (any# categories) (any# categories) CATEGORICAL CATEGORICAL Independent Groups Variable Variable (any# categories) (2 categories) CATEGORICAL CATEGORICAL Repeated Measures DESCRIBE: Graphs: Line graph for each subject showing changing value of variable. DECIDE: “On average, does the value change for each person across categories?” – repeated measures ANOVA with post-hoc paired t-tests / mixed effects regression. ESTIMATE: “What are the mean differences between categories?” – confidence intervals for mean differences. DESCRIBE: Numbers: Two-way table of counts or %s. Odds ratios. Graphs: Histogram for each explanatory category. DECIDE: “Is the outcome just as likely for both explanatory categories?”, “Are the two variables associated?” – chisquared test for independence (small amount of data: Fisher’s exact test). ESTIMATE: “How much more likely is the outcome in this category?”– confidence interval for difference in proportions, confidence interval for odds ratio. DESCRIBE: Numbers: Two-way table of counts or %s. Graphs: Histogram for each explanatory category. DECIDE: “Is the outcome just as likely for both explanatory categories?” – McNemar’s test. ESTIMATE: “How much more likely is the outcome in one category compared to the other?”– confidence interval for difference in proportions. DESCRIBE: Numbers: Two-way table of counts or %. Graphs: Histogram for each explanatory category. DECIDE: “Do the percentages in the outcome change across the explanatory categories?”, “Are the two variables associated?” – chi-squared test for independence. DESCRIBE: Numbers: Two-way table of counts or %. Graphs: Histogram for each explanatory category. DECIDE: “Do the percentages in the outcome change across the explanatory categories?”, “Are the two variables associated?” – Cochrane’s Q-test. By Dr David Butler © 2012 The University of Adelaide 7 The Right Questions about Statistics Maths Learning Centre Statistical methods for statistical questions Variable Variable NUMERICAL NUMERICAL Variable NUMERICAL Variable DESCRIBE: Numbers: Correlation coefficient (R) Graphs: Scatterplot. DECIDE: “Does a relationship exist?” – linear regression: t-test on coefficient. ESTIMATE: “How much does the output variable change when the explanatory variable changes?” – linear regression: confidence interval for slope. PREDICT: “How can you calculate the output knowing the explanatory variable?” – linear regression formula: y = β0 + β1 x. NOTE: May need to do a nonlinear regression if the scatterplot indicates a curved sort of relationship. DESCRIBE: Numbers: Mean & standard deviation for each category of the outcome. Graphs: Histograms/boxplots on the same scale. (2 categories) CATEGORICAL DECIDE: “Does the numerical variable have an effect on the chances of the outcome?” – unpaired t-test using the outcome to define the two groups. ESTIMATE: “How much does a change in the numerical variable affect the chances of the outcome?” – logistic regression: confidence interval for odds ratio. PREDICT: “How can you calculate the chances of the outcome knowing the value of the explanatory variable?” – logistic regression formula: log(odds of y) = β0 + β1 x. Variable Time to event (any# categories) CATEGORICAL NUMERICAL Independent Groups Possible missing data! DESCRIBE: Numbers: Proportion reaching event at certain time (eg 5year survival), median times to reach event. Graphs: Kaplan-Meier curve showing survival percentages. DECIDE: “Is the time to reach the event the same in all groups?” – survival analysis: log-rank test. ESTIMATE: “What is the difference in proportions reaching the end point at this particular time?” – confidence interval for the difference in proportions. “How much more at risk of the event is this group than this group?” – Cox regression: confidence interval for relative hazard. By Dr David Butler © 2012 The University of Adelaide 8 The Right Questions about Statistics Maths Learning Centre Statistical methods for statistical questions Variable NUMERICAL Variable Variable NUMERICAL NUMERICAL Variable NUMERICAL Variable Variable DESCRIBE: Graphs: Scatterplot for each explanatory variable with the outcome variable. Numbers: multiple linear regression: R2 value DECIDE: “Does a relationship exist with any of the variables at all?” – multiple linear regression: F-test. “Does a relationship exist with this varable, taking into account the others?” – multiple linear regression: t-test on one coefficient. ESTIMATE: “How much does the output variable change when this explanatory variable changes?” – multiple linear regression: confidence interval for one slope. PREDICT: “How can you calculate the output knowing the explanatory variables?” – multiple linear regression formula: y = β0 + β1 x1 + β2 x2. NOTE: This can be done for many explanatory variables. DESCRIBE: Graphs: Scatterplot of both numerical variables for each category. Numbers: multiple regression: R2 value DECIDE: See above for multiple regression. ESTIMATE: See above for multiple regression. PREDICT: See above for multiple regression. NUMERICAL NOTE: This can be done for many explanatory variables of both types. (any# categories) CATEGORICAL Independent Groups Variable (any# categories) CATEGORICAL Independent Groups Variable Variable NUMERICAL (any# categories) CATEGORICAL Independent Groups DESCRIBE: Graphs: Histogram for each combination of explanatory categories. Line graph showing mean of each group. DECIDE: “Does a relationship exist with any of the variables at all?” – two-way ANOVA: F-test. “Does a relationship exist with this varable, taking into account the others?” – two-way ANOVA: F-test for one effect. Note: both can also answered with multiple regression (see above). PREDICT: “How can you calculate the output knowing the explanatory variables?” – multiple linear regression formula: y = β0 + β1 x1 + β2 x2. By Dr David Butler © 2012 The University of Adelaide 9 The Right Questions about Statistics Variable (any# categories) CATEGORICAL Independent Groups Variable (any# categories) CATEGORICAL Independent Groups DESCRIBE: Graphs: Histogram for each combination of explanatory categories. DECIDE: “Does a relationship exist with any of the variables at all?” – multiple logistic regression: chi-squared test for Variable covariates. “Does a relationship exist with this varable, taking into (2 categories) account the others?” – multiple logistic regression: Wald CATEGORICAL test. ESTIMATE: “How much does the chance of the outcome change when this explanatory variable changes?” – multiple logistic regression: confidence interval for odds ratio. PREDICT: “How can you calculate the chances of the outcome knowing the explanatory variables?” – multiple logistic regression formula: log(odds of y) = β0 + β1 x1 + β2 x2. NOTE: This can be done with many explanatory variables – even if some of them are numerical. Variable NUMERICAL Variable Variable NUMERICAL (any# categories) CATEGORICAL Repeated Measures Variable Variable NUMERICAL NUMERICAL Variable NUMERICAL Maths Learning Centre DESCRIBE: Numbers: multiple linear regression: R2 value DECIDE: “Does a relationship exist with any of the variables at all?” – mixed effects regression: F-test. “Does a relationship exist with this varable, taking into account the others?” – mixed effects linear regression: ttest on one coefficient. ESTIMATE: “How much does the output variable change when this explanatory variable changes?” – mixed effects regression: confidence interval for one coefficient. PREDICT: “How can you calculate the output knowing the explanatory variables?” – mixed effects regression formula. NOTE: “mixed effects” may also be called “random effects”. NOTE: This can be done for many explanatory variables, of both types, and with a mixture of repeated-measures and independentgroups DECIDE: “Does one variable change the way the other affects the outcome?”– multiple linear regression: t-test on the interaction effect. ESTIMATE: “How much does the second variable change the effect of the first on the outcome?”– multiple linear regression: confidence interval for the interaction effect. PREDICT: “How can you calculate the output knowing the explanatory variables?” – multiple linear regression formula: y = β0 + β1 x1 + β2 x2 + β12 x1x2. By Dr David Butler © 2012 The University of Adelaide 10 The Right Questions about Statistics Variable Variable NUMERICAL NUMERICAL Variable (any# categories) CATEGORICAL Independent Groups Maths Learning Centre DESCRIBE: Graphs: Scatterplot for each category, showing line of best fit in each case. DECIDE: “Does one variable change the way the other affects the outcome?”– Analysis of Covariance (ANCOVA) / multiple linear regression: t-test on the interaction effect. ESTIMATE: “How much does the second variable change the effect of the first on the outcome?”– multiple linear regression: confidence interval for the interaction effect. PREDICT: “How can you calculate the output knowing the explanatory variables?” – multiple linear regression formula: y = β0 + β1 x1 + β2 x2 + β12 x1x2. NOTE: This can be done for many explanatory variables of both types. ANCOVA refers specifically to the case where the interaction variable is categorical. NOTE: There are many other methods dealing with more specific and difficult questions including (but definitely not limited to): “Does this variable affect the variance of the outcome?” F-test for two variances “Do these variables affect this categorical outcome (which has several categories)?” Multinomial regression “Does the data come from a normal distribution?” Investigate normal quantile-quantile plot; Shapiro-Wilk test “To what degree do these two measuring systems agree?” Intraclass correlation coefficient “What is the best cut-off for this measurement in order to say someone needs medical attention?” ROC analysis “Do all these measurements vary together so that they could be considered as measuring some smaller number of underlying concepts?” Factor analysis / Principal Component Analysis “Can the subjects be grouped into a few similar groups based on the similarity in their measurements?” Cluster analysis and so on ... By Dr David Butler © 2012 The University of Adelaide 11 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 12 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 13 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 14 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 15 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 16 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 17 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 18 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 19 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 20 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 21 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 22 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 23 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 24 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 25 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 26 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 27 The Right Questions about Statistics By Dr David Butler © 2012 The University of Adelaide Maths Learning Centre 28 The Right Questions about Statistics Maths Learning Centre SAMPLE SIZE CALCULATIONS FOR HYPOTHESIS TESTS: The following five things affect the sample size you need: 1. Which hypothesis test you plan to use Hypothesis test based on categorical outcomes (as opposed to numerical outcomes) BIGGER sample size Hypothesis test uses independent groups (as opposed to repeated measures) BIGGER sample size 2. Size of the difference you are looking for Most hypothesis tests concern the differences between means or percentages. The difference you would like to see is often called: Clinically significant difference Practically significant difference Choosing how big this difference is requires KNOWLEDGE OF YOUR AREA OF RESEARCH. Looking for a SMALL DIFFERENCE BIGGER sample size 3. Variability of the results HIGH VARIABILITY means many options for what could happen in a sample of a particular size eg: for the CHI-SQUARED TEST very high or very low expected percentage low variability medium expected percentage high variability eg: for t-tests or ANOVA large standard deviation high variability You usually get this information from previous research or a pilot study. HIGH VARIABILITY By Dr David Butler © 2012 The University of Adelaide BIGGER sample size 29 The Right Questions about Statistics Maths Learning Centre 4. Significance level The cut-off for saying when a p-value is significant. Usually 5%. Also known as α (alpha) or the “Type I Error rate”. LOW SIGNIFICANCE LEVEL BIGGER sample size 5. Power The probability of getting a significant result if in fact there IS a difference in the population. Usually you set this at 80%. The opposite of Type II Error rate (also known as β (beta)). HIGH POWER BIGGER sample size [ Note that a high dropout rate also increases sample size ] FOR CONFIDENCE INTERVALS: Confidence intervals are related to hypothesis tests, so the considerations above are used for confidence intervals too. NOTE: Significance level = 100% - Confidence Level (so for a 95% confidence interval, the significance level is 5%) NOTE: The “difference you are looking for” is half the width of the confidence interval. Also known as the “margin of error”. FOR REGRESSION: Rule of thumb: at least 10 times as many subjects as there are explanatory variables. Proper calculations are based on the t-tests involved to see if slope is significant. X1 Y Y X2 X1 X2 At least 2×10 = 20 By Dr David Butler © 2012 The University of Adelaide X3 X4 X5 At least 5×10 = 50 30 The Right Questions about Statistics Maths Learning Centre SOME TERMINOLOGY: Type I Error: NO difference in the population BUT there IS a difference in the sample (also known as significance level or alpha α) Type II Error: There IS a difference in the population BUT there is NO difference in the sample (also known as beta β, or the opposite of power) PERFORMING THE CALCULATIONS : Russ Lenth’s has created a comprehensive suite of online calculators: http://homepage.stat.uiowa.edu/~rlenth/Power You need all the information mentioned above in order to use the calculators. There are also simple formulas for the t-tests and chi-squared tests in Chapter 36 of “Medical Statistics at a Glance” by Aviva Petrie and Caroline Sabin You need all the information mentioned above in order to use the formulas. By Dr David Butler © 2012 The University of Adelaide 31