Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BMS 617 Statistical Techniques for the Biomedical Sciences Lecture 3: Types of Variable and Scatter Types of Variable Understanding the type of variable with which you are working is important. Type of variable determines which arithmetic operations make sense Helps determine which tests are appropriate for hypothesis testing Determining the type of variable To determine the type of variable we are using, we ask the following questions: Is there an ordering for values of the variable? If there is an ordering, is there a scale? i.e. Does an increase in one unit always mean the same thing? If there is an ordering and a scale, does the value zero have a specific meaning? Additionally, we ask if the variable is continuous or discrete. Continuous means it takes on any value, including fractional values. Discrete means it takes on only specific, disjoint, values. Nominal Variables Nominal variables are those whose values have no ordering. Just qualitative categories. Cannot be continuous. Examples: Gender Values are "Male", "Female" Race Values are "Black", "White", "Asian", "Native American", etc… Ordinal Variables Ordinal variables are variables with qualitative categories which have an ordering, but no scale. Example: Economic status Values are typically stated as "Low", "Medium", or "High", which are computed using a number of factors (income, education level, occupation, wealth). These are ordered because there is a natural ordering low → medium → high. They have no scale because the difference between low and medium is not necessarily the same as the difference between medium and high. Interval Variables Interval variables are variables with ordering and scale, but with no meaningful zero Examples: Temperature in celsius or fahrenheit There is a scale, because a difference in one degree means the same thing, no matter what the starting temperature is. However, the choice of a zero value is essentially arbitrary. Operations on interval variables Computing differences of values of interval variables makes sense. For example, computing a change in temperature (difference between two temperatures) makes sense, since a change of one unit (one degree) makes sense. Computing ratios of values of interval varaibles does not make sense, because there is no meaningful zero value. Ratios of values are dimensionless Have no units Should be the same no matter what units we start in. 100°C is not double 50°C These values are equal to 212°F and 132°F respectively. Ratio Variables Ratio Variables have order, scale, and a meaningful zero. A meaningful zero means a value of zero indicates that none of the quantity is present. Temperature in Kelvin is a ratio variable 0K means no heat is present (it's physically unobtainable) 0°C and 0°F do not mean this Example: blood pressure. Zero blood pressure means the blood is not being pushed around the body. Operations on Ratio Variables It makes sense to compute differences and ratios of ratio variables. A blood pressure of 120 is double the blood pressure of 60. Note that the difference of values of an interval variable is always a ratio variable For example, elapsed time (essentially the difference between two dates) is a ratio variable Examples For each of the following, determine the type of the variable (Nominal, Ordinal, Interval, Ratio). Also determine whether it is continuous or discrete. Variable Type (N/O/I/R) Continuous or Discrete Tumor grade Heart Rate Number of Heart Attacks in patient's lifetime Color Weight More Examples Variable Type (N/O/I/R) Continuous or Discrete Disease Status (affected or not affected) Pain Scale Age Genotype CT values (from real time qPCR) Ambiguity in variable types Determining the type of variable can depend on context, and/or on the measurement techniques used. In a psychological experiment, patients are exposed to flash cards of various colors and activity in specific parts of the brain is measured. Color here is (most likely) a nominal variable. In a cosmological experiment, the colors of stars are observed and used (along with other data) to determine their relative speeds. Color here is measured by wavelength of light, and is a ratio variable. Is Age a continuous or discrete variable? Age is really a continuous (ratio) variable: it's the amount of time elapsed since birth. However, it is often collected as a discrete variable, by rounding down to a whole number of years. The imprecision in this rounding is usually insignificant, since effects of age tend to be more noisy than this loss of precision anyway. However, it is usually better to collect data on a subject's date of birth and subsequent dates of important events in the study: this way ages can be calculated to the number of days if required. In statistical analysis, it is usually fine to treat age as a continuous variable, even when the measurement is rounded to a whole number of years. All continuous data is measured to a degree of precision, and the loss of precision becomes part of the noise. This is no different with age. Graphing Continuous Data The next sections of the course will focus on continuous data. Or data that may be treated as continuous. We will begin with a discussion of the best ways to visualize and present data. Summarizing Data Often, experiments will collect more data than can reasonably be presented in a poster, presentation, or manuscript. If this is not the case, then present all the data! Typically, we collect datapoints in the range of dozens upwards (to trillions, in the case of sequencing experiments) Data must be summarized for presentation and interpretation. Aims of summarizing data Summarized data may be presented textually (in a table) or graphically A good summary shoud: Demonstrate what a "typical" value looks like. Demonstrate the extent to which values deviate from the "typical" value. Provide as much detail as is realistically possible. Clearly state how the summary was made. Measures of central tendency "Typical" values in a data set are identified by a measure of central tendency Choosing the right measure is important Mean Median Mode All these are kinds of "Average" Mean The mean is the measure of central tendency most commonly understood by the word "average". Sum of all the values divided by the number of values. Since values are summed, mean only makes sense for interval and ratio data. The mean can be dramatically affected by extreme outliers. Median The median is the "middle" value. Computed by ordering all values and taking the middle one. Mean of the middle two if there are an even number of values. Not affected by a small number of outliers, no matter how extreme. A good measure for ordinal data. Mode The mode is the most common value. The French word mode means fashion. Value that occurs most often. Makes no sense for continuous data If measured with enough precision, no value could occur more than once. The best measure of central tendency for nominal data Does not always measure the "center" of the data Averages do not tell the story Merely stating an average can be extremely misleading. The average human being has one breast and one testicle. Example (simulated). Two patients have blood pressure measured every two hours from 6 a.m. to 10 p.m. Patient A B Mean systolic blood pressure 115.3 119.6 Both patients appear healthy… Example … however, examine all the data: Time Patient A systolic b.p. Patient B systolic b.p. 6 a.m. 144 115 8 a.m. 108 130 10 a.m. 92 121 Noon 122 118 2 p.m. 67 122 4 p.m. 142 120 6 p.m. 131 113 8 p.m. 99 122 10 p.m. 133 115 Patient A no longer appears healthy. Need some way to summarize the variability in these measurements. Measures of Variability Range Just the minimum and maximum values in the data Interquartile range The range of the "middle half" of the data Variance and/or standard deviation A measure of the average deviation from the mean Coefficient of variation The standard deviation relative to the mean. Range Range is the simplest measure of variability. Just the minimum and maximum values. For our simulated blood pressure data, already gives a good clue as to what is happening. Systolic blood pressure Patient A Patient B Mean 115.3 119.6 Range 67-144 113-130 Very susceptible to outliers One bad reading can completely change the range Interquartile Range Simliar philosophy to the median Order the values in the data set Find the 25th percentile and the 75th percentile The values ¼ and ¾ the way along the ordering The difference is the interquartile range The interquartile ranges for the patients in our blood pressure example are 34 and 7 Verify this! Standard Deviation Standard Deviation is the most commonly used measure of variability Intuitively, it measures the average difference between each data point and the mean. Gives a sense of the average spread of the data Computing the standard deviation The formula for the standard deviation is given by Yi represents each data point Y is the mean n is the number of data points. Motulsky (p 73) has a good discussion of why n-1 is used instead of n. Variance Variance is just the square of the standard deviation. Useful quantity for performing some statistical tests we'll see later Interpretation less intuitive than standard deviation Units of standard deviation are the same as the units of the measurements. Units of variance are the square of the units of the measurements. Systolic blood pressure Patient A Patient B Mean ± sd 115.3 ± 25.8 119.6 ± 5.1 Coefficient of Variation The coefficient of variation (CV) is simply the standard deviation divided by the mean Only makes sense for ratio variables (why?) CV has no units Often presented as a percentage Occassionally useful for comparing scatter in variables in unrelated units Graphing Data We'll look at four ways of graphing our blood pressure data: Column Scatter Plot Box and Whisker Plot Column or Bar Chart Line Chart In all these, it is important to show both a measure of central tendency (average) and a measure of variability. Column Scatter Plot A Column Scatter Plot plots all the data as individual points in a column. Rarely used But very useful, for up to around 100 data points Not much software support Column Scatter Plot Example Box and Whisker Plot A box and whisker plot shows the range, interquartile range, and median of the data set A good choice when the median and interquartile range are good measures of central tendency and variation for your data The median is marked with a horizontal line The interquartile range is marked with a box "Whiskers" extend to the full range of the data A variation is for the whiskers to extend to most of the range, and outliers to be marked individually as points Box and Whisker Plot Example Bar Chart Bar charts use horizontal or vertical bars to demonstrate the mean of the data set "Error bars" are used to show a measure of variability Some important condsiderations for bar charts: It is natural to look at the relative size of the bars in order to compare the relative values of the means. Therefore, bar charts should only be used with ratio data and should have the base of the bar at zero There are various ways the error bars can be drawn (we will see later), so always clearly state what the error bars represent Bar Chart Example Line Chart A line chart is useful if the data points are ordered, and the ordering is important For example, if we want to track the data over time Like a column scatter plot, a line chart plots all the data Line chart example Conclusion When presenting data, choose a visualization which maximizes the information available At a minimum, present an average and a measure of variation Use a column scatter plot if possbile Use a box and whisker plot for interval variables, or if the median is a better measure than the mean Use a line chart when the indivdual data points can be ordered and that ordering is important Use a bar chart only for ratio data, and base the chart at zero Clearly state what the error bars represent