Download BMS 617 Lecture 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
BMS 617
Statistical Techniques for the Biomedical Sciences
Lecture 3: Types of Variable and Scatter
Types of Variable
Understanding the type of variable with which you are working is important.
Type of variable determines which arithmetic operations make sense
Helps determine which tests are appropriate for hypothesis testing
Determining the type of variable
To determine the type of variable we are using, we ask the following questions:
Is there an ordering for values of the variable?
If there is an ordering, is there a scale?
i.e. Does an increase in one unit always mean the same thing?
If there is an ordering and a scale, does the value zero have a specific meaning?
Additionally, we ask if the variable is continuous or discrete.
Continuous means it takes on any value, including fractional values.
Discrete means it takes on only specific, disjoint, values.
Nominal Variables
Nominal variables are those whose values have no ordering.
Just qualitative categories.
Cannot be continuous.
Examples:
Gender
Values are "Male", "Female"
Race
Values are "Black", "White", "Asian", "Native American", etc…
Ordinal Variables
Ordinal variables are variables with qualitative categories which have an ordering, but no scale.
Example: Economic status
Values are typically stated as "Low", "Medium", or "High", which are computed using a number of factors (income,
education level, occupation, wealth).
These are ordered because there is a natural ordering low → medium → high.
They have no scale because the difference between low and medium is not necessarily the same as the difference between
medium and high.
Interval Variables
Interval variables are variables with ordering and scale, but with no meaningful zero
Examples: Temperature in celsius or fahrenheit
There is a scale, because a difference in one degree means the same thing, no matter what the starting temperature is.
However, the choice of a zero value is essentially arbitrary.
Operations on interval variables
Computing differences of values of interval variables makes sense.
For example, computing a change in temperature (difference between two temperatures) makes sense, since a change of one
unit (one degree) makes sense.
Computing ratios of values of interval varaibles does not make sense, because there is no meaningful zero value.
Ratios of values are dimensionless
Have no units
Should be the same no matter what units we start in.
100°C is not double 50°C
These values are equal to 212°F and 132°F respectively.
Ratio Variables
Ratio Variables have order, scale, and a meaningful zero.
A meaningful zero means a value of zero indicates that none of the quantity is present.
Temperature in Kelvin is a ratio variable
0K means no heat is present (it's physically unobtainable)
0°C and 0°F do not mean this
Example: blood pressure.
Zero blood pressure means the blood is not being pushed around the body.
Operations on Ratio Variables
It makes sense to compute differences and ratios of ratio variables.
A blood pressure of 120 is double the blood pressure of 60.
Note that the difference of values of an interval variable is always a ratio variable
For example, elapsed time (essentially the difference between two dates) is a ratio variable
Examples
For each of the following, determine the type of the variable (Nominal, Ordinal, Interval, Ratio). Also determine whether it is continuous
or discrete.
Variable
Type (N/O/I/R) Continuous or Discrete
Tumor grade
Heart Rate
Number of Heart Attacks in patient's lifetime
Color
Weight
More Examples
Variable
Type (N/O/I/R) Continuous or Discrete
Disease Status (affected or not affected)
Pain Scale
Age
Genotype
CT values (from real time qPCR)
Ambiguity in variable types
Determining the type of variable can depend on context, and/or on the measurement techniques used.
In a psychological experiment, patients are exposed to flash cards of various colors and activity in specific parts of the brain is
measured.
Color here is (most likely) a nominal variable.
In a cosmological experiment, the colors of stars are observed and used (along with other data) to determine their relative speeds.
Color here is measured by wavelength of light, and is a ratio variable.
Is Age a continuous or discrete variable?
Age is really a continuous (ratio) variable: it's the amount of time elapsed since birth. However, it is often collected as a discrete variable,
by rounding down to a whole number of years. The imprecision in this rounding is usually insignificant, since effects of age tend to be
more noisy than this loss of precision anyway. However, it is usually better to collect data on a subject's date of birth and subsequent dates
of important events in the study: this way ages can be calculated to the number of days if required.
In statistical analysis, it is usually fine to treat age as a continuous variable, even when the measurement is rounded to a whole number of
years. All continuous data is measured to a degree of precision, and the loss of precision becomes part of the noise. This is no different
with age.
Graphing Continuous Data
The next sections of the course will focus on continuous data.
Or data that may be treated as continuous.
We will begin with a discussion of the best ways to visualize and present data.
Summarizing Data
Often, experiments will collect more data than can reasonably be presented in a poster, presentation, or manuscript.
If this is not the case, then present all the data!
Typically, we collect datapoints in the range of dozens upwards (to trillions, in the case of sequencing experiments)
Data must be summarized for presentation and interpretation.
Aims of summarizing data
Summarized data may be presented textually (in a table) or graphically
A good summary shoud:
Demonstrate what a "typical" value looks like.
Demonstrate the extent to which values deviate from the "typical" value.
Provide as much detail as is realistically possible.
Clearly state how the summary was made.
Measures of central tendency
"Typical" values in a data set are identified by a measure of central tendency
Choosing the right measure is important
Mean
Median
Mode
All these are kinds of "Average"
Mean
The mean is the measure of central tendency most commonly understood by the word "average".
Sum of all the values divided by the number of values.
Since values are summed, mean only makes sense for interval and ratio data.
The mean can be dramatically affected by extreme outliers.
Median
The median is the "middle" value.
Computed by ordering all values and taking the middle one.
Mean of the middle two if there are an even number of values.
Not affected by a small number of outliers, no matter how extreme.
A good measure for ordinal data.
Mode
The mode is the most common value.
The French word mode means fashion.
Value that occurs most often.
Makes no sense for continuous data
If measured with enough precision, no value could occur more than once.
The best measure of central tendency for nominal data
Does not always measure the "center" of the data
Averages do not tell the story
Merely stating an average can be extremely misleading.
The average human being has one breast and one testicle.
Example (simulated). Two patients have blood pressure measured every two hours from 6 a.m. to 10 p.m.
Patient
A
B
Mean systolic blood pressure 115.3 119.6
Both patients appear healthy…
Example
… however, examine all the data:
Time Patient A systolic b.p. Patient B systolic b.p.
6 a.m. 144
115
8 a.m. 108
130
10 a.m. 92
121
Noon 122
118
2 p.m. 67
122
4 p.m. 142
120
6 p.m. 131
113
8 p.m. 99
122
10 p.m. 133
115
Patient A no longer appears healthy.
Need some way to summarize the variability in these measurements.
Measures of Variability
Range
Just the minimum and maximum values in the data
Interquartile range
The range of the "middle half" of the data
Variance and/or standard deviation
A measure of the average deviation from the mean
Coefficient of variation
The standard deviation relative to the mean.
Range
Range is the simplest measure of variability.
Just the minimum and maximum values.
For our simulated blood pressure data, already gives a good clue as to what is happening.
Systolic blood pressure Patient A Patient B
Mean
115.3
119.6
Range
67-144 113-130
Very susceptible to outliers
One bad reading can completely change the range
Interquartile Range
Simliar philosophy to the median
Order the values in the data set
Find the 25th percentile and the 75th percentile
The values ¼ and ¾ the way along the ordering
The difference is the interquartile range
The interquartile ranges for the patients in our blood pressure example are 34 and 7
Verify this!
Standard Deviation
Standard Deviation is the most commonly used measure of variability
Intuitively, it measures the average difference between each data point and the mean.
Gives a sense of the average spread of the data
Computing the standard deviation
The formula for the standard deviation is given by
Yi represents each data point
Y is the mean
n is the number of data points.
Motulsky (p 73) has a good discussion of why n-1 is used instead of n.
Variance
Variance is just the square of the standard deviation.
Useful quantity for performing some statistical tests we'll see later
Interpretation less intuitive than standard deviation
Units of standard deviation are the same as the units of the measurements.
Units of variance are the square of the units of the measurements.
Systolic blood pressure Patient A Patient B
Mean ± sd
115.3 ± 25.8 119.6 ± 5.1
Coefficient of Variation
The coefficient of variation (CV) is simply the standard deviation divided by the mean
Only makes sense for ratio variables (why?)
CV has no units
Often presented as a percentage
Occassionally useful for comparing scatter in variables in unrelated units
Graphing Data
We'll look at four ways of graphing our blood pressure data:
Column Scatter Plot
Box and Whisker Plot
Column or Bar Chart
Line Chart
In all these, it is important to show both a measure of central tendency (average) and a measure of variability.
Column Scatter Plot
A Column Scatter Plot plots all the data as individual points in a column.
Rarely used
But very useful, for up to around 100 data points
Not much software support
Column Scatter Plot Example
Box and Whisker Plot
A box and whisker plot shows the range, interquartile range, and median of the data set
A good choice when the median and interquartile range are good measures of central tendency and variation for your data
The median is marked with a horizontal line
The interquartile range is marked with a box
"Whiskers" extend to the full range of the data
A variation is for the whiskers to extend to most of the range, and outliers to be marked individually as points
Box and Whisker Plot Example
Bar Chart
Bar charts use horizontal or vertical bars to demonstrate the mean of the data set
"Error bars" are used to show a measure of variability
Some important condsiderations for bar charts:
It is natural to look at the relative size of the bars in order to compare the relative values of the means.
Therefore, bar charts should only be used with ratio data and should have the base of the bar at zero
There are various ways the error bars can be drawn (we will see later), so always clearly state what the error bars represent
Bar Chart Example
Line Chart
A line chart is useful if the data points are ordered, and the ordering is important
For example, if we want to track the data over time
Like a column scatter plot, a line chart plots all the data
Line chart example
Conclusion
When presenting data, choose a visualization which maximizes the information available
At a minimum, present an average and a measure of variation
Use a column scatter plot if possbile
Use a box and whisker plot for interval variables, or if the median is a better measure than the mean
Use a line chart when the indivdual data points can be ordered
and that ordering is important
Use a bar chart only for ratio data, and base the chart at zero
Clearly state what the error bars represent