Download Powerpoint - Marshall University Personal Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Marshall University School of Medicine
Department of Biochemistry and Microbiology
BMS 617
Lecture 2: Types of Variable,
measures of central tendency, and
scatter
Marshall University Genomics Core Facility
Types of Variable
• Understanding the type of variable with which
you are working is important.
– Type of variable determines which arithmetic
operations make sense
– Helps determine which tests are appropriate for
hypothesis testing
Marshall University School of Medicine
Determining the type of variable
• To determine the type of variable we are using, we ask the
following questions:
– Is there an ordering for values of the variable?
– If there is an ordering, is there a scale?
• i.e. Does an increase in one unit always mean the same thing?
– If there is an ordering and a scale, does the value zero have a specific
meaning?
• Additionally, we ask if the variable is continuous or discrete.
– Continuous means there’s always a value lying strictly between any
two distinct values
• So it must be able to take on fractional values
– Discrete means it takes on only specific, disjoint, values.
Marshall University School of Medicine
Nominal variables
• Nominal variables are those whose values
have no ordering.
– Just qualitative categories.
– Cannot be continuous.
– Examples:
• Gender
– Values are "Male", "Female”
• Race
– Values are "Black", "White", "Asian", "Native American", etc…
Marshall University School of Medicine
Ordinal values
• Ordinal variables are variables with qualitative
categories which have an ordering, but no scale.
– Example: Economic status
• Values are typically stated as "Low", "Medium", or "High",
which are computed using a number of factors (income,
education level, occupation, wealth).
• These are ordered because there is a natural ordering low →
medium → high.
• They have no scale because the difference between low and
medium is not necessarily the same as the difference
between medium and high.
Marshall University School of Medicine
Interval Variables
• Interval variables are variables with ordering
and scale, but with no meaningful zero
• Examples: Temperature in celsius or
fahrenheit
– There is a scale, because a difference in one
degree means the same thing, no matter what the
starting temperature is.
– However, the choice of a zero value is essentially
arbitrary.
Marshall University School of Medicine
Operations on interval variables
• Computing differences of values of interval
variables makes sense.
– For example, computing a change in temperature
(difference between two temperatures) makes sense,
since a change of one unit (one degree) makes sense.
– Computing ratios of values of interval varaibles does
not make sense, because there is no meaningful zero
value.
• Ratios of values are dimensionless
– Have no units
– Should be the same no matter what units we start in.
• 100°C is not double 50°C
• These values are equal to 212°F and 132°F respectively.
Marshall University School of Medicine
Operations on Ratio Variables
• It makes sense to compute differences and
ratios of ratio variables.
– A blood pressure of 120 is double the blood
pressure of 60.
• Note that the difference of values of an
interval variable is always a ratio variable
– For example, elapsed time (essentially the
difference between two dates) is a ratio variable
Marshall University School of Medicine
Examples
• For each of the following, determine the type of the variable
(Nominal, Ordinal, Interval, Ratio). Also determine whether it is
continuous or discrete.
Variable
Type (N/O/I/R)
Tumor grade
Heart rate
# Heart attacks in a patient’s
lifetime
Color
Weight (mass)
Disease status
Pain scale
Age
Genotype
CT values from RT-qPCR
Marshall University School of Medicine
Continuous/Discrete
Ambiguity in variable types
•
Determining the type of variable can depend on context, and/or on the
measurement techniques used.
– In a psychological experiment, patients are exposed to flash cards of various colors and activity
in specific parts of the brain is measured.
•
Color here is (most likely) a nominal variable.
– In a cosmological experiment, the colors of stars are observed and used (along with other
data) to determine their relative speeds.
•
•
Color here is measured by wavelength of light, and is a ratio variable.
Is Age a continuous or discrete variable?
– Age is really a continuous (ratio) variable: it's the amount of time elapsed since birth.
However, it is often collected as a discrete variable, by rounding down to a whole number of
years. The imprecision in this rounding is usually insignificant, since effects of age tend to be
more noisy than this loss of precision anyway. However, it is usually better to collect data on a
subject's date of birth and subsequent dates of important events in the study: this way ages
can be calculated to the number of days if required.
– In statistical analysis, it is usually fine to treat age as a continuous variable, even when the
measurement is rounded to a whole number of years. All continuous data is measured to a
degree of precision, and the loss of precision becomes part of the noise. This is no different
with age.
Marshall University School of Medicine
Summarizing Data
• The next sections of the course will focus on
continuous data.
– Or data that may be treated as continuous
• Often, experiments will collect more data than can
reasonably be presented in a poster, presentation, or
manuscript.
– If this is not the case, then present all the data!
• Typically, we collect datapoints in the range of dozens
upwards (to trillions, in the case of sequencing
experiments)
– Data must be summarized for presentation and
interpretation.
Marshall University School of Medicine
Aims of Summarizing Data
• Summarized data may be presented textually
(in a table) or graphically
• A good summary should:
– Demonstrate what a "typical" value looks like.
– Demonstrate the extent to which values deviate
from the "typical" value.
– Provide as much detail as is realistically possible.
– Clearly state how the summary was made.
Marshall University School of Medicine
Measures of Central Tendency
• "Typical" values in a data set are identified by
a measure of central tendency
– Choosing the right measure is important
• Mean
• Median
• Mode
– All these are kinds of "Average"
Marshall University School of Medicine
Mean
• The mean is the measure of central tendency
most commonly understood by the word
"average".
– Sum of all the values divided by the number of
values.
– Since values are summed, mean only makes sense
for interval and ratio data.
– The mean can be dramatically affected by extreme
outliers.
Marshall University School of Medicine
Median
• The median is the "middle" value.
• Computed by ordering all values and taking
the middle one.
– Mean of the middle two if there are an even
number of values.
• Not affected by a small number of outliers, no
matter how extreme.
• A good measure for ordinal data.
Marshall University School of Medicine
Mode
• The mode is the most common value.
– The French word mode means fashion.
– Value that occurs most often.
• Makes no sense for continuous data
– If measured with enough precision, no value could
occur more than once.
• The best measure of central tendency for
nominal data
• Does not always measure the "center" of the data
Marshall University School of Medicine
Averages do not tell the story
• Merely stating an average can be extremely
misleading.
– The average human being has one breast and one
testicle.
• Example (simulated). Two patients have blood
pressure measured every two hours from 6 a.m.
to 10 p.m.
Patient
A
B
Mean systolic blood pressure
115.3
119.6
• Both patients appear healthy…
Marshall University School of Medicine
Example
• However, examine all the data:
Time
Patient A Systolic b.p.
Patient B Systolic b.p.
6 a.m.
144
115
8 a.m.
108
130
10 a.m.
92
121
Noon
122
118
2 p.m.
67
122
4 p.m.
142
120
6 p.m.
131
113
8 p.m.
99
122
10 p.m.
133
115
Marshall University School of Medicine
• Patient A
no longer
appears
healthy…
Measures of Variability
• Range
– Just the minimum and maximum values in the data
• Interquartile range
– The range of the "middle half" of the data
• Variance and/or standard deviation
– A measure of the average deviation from the mean
• Coefficient of variation
– The standard deviation relative to the mean.
Marshall University School of Medicine
Range
• Range is the simplest measure of variability.
– Just the minimum and maximum values.
– For our simulated blood pressure data, already
gives a good clue as to what is happening.
Systolic Blood pressure
Patient A
Patient B
Mean
115.3
119.6
Range
67-144
113-130
• Very susceptible to outliers
• One bad reading can completely change the range
Marshall University School of Medicine
Interquartile Range
• Simliar philosophy to the median
– Order the values in the data set
– Find the 25th percentile and the 75th percentile
• The values ¼ and ¾ the way along the ordering
– The difference is the interquartile range
• The interquartile ranges for the patients in our
blood pressure example are 34 and 7
– Verify this!
Marshall University School of Medicine
Standard Deviation
• Standard Deviation is the most commonly
used measure of variability
• Intuitively, it measures the average difference
between each data point and the mean.
– Gives a sense of the average spread of the data
Marshall University School of Medicine
Computing the Standard Deviation
• The formula for the standard deviation is given by
SD =
•
•
•
•
å(Y -Y )
2
i
n -1
Yi represents each data point
Y is the mean
n is the number of data points.
Motulsky (p 73) has a good discussion of why n-1
is used instead of n.
Marshall University School of Medicine
Variance
• Variance is just the square of the standard
deviation
– Useful quantity for performing some statistical tests
we’ll see later
– Interpretation less intuitive than standard deviation
• Units of standard deviation are the same as the
units of the measurement
• Units of variance are the square of the units of
the measurement
Marshall University School of Medicine
Coefficient of Variation
• The coefficient of variation (CV) is simply the
standard deviation divided by the mean
– Only makes sense for ratio variables (why?)
• CV has no units
– Often presented as a percentage
• Occasionally useful for comparing scatter in
variables in unrelated units
Marshall University School of Medicine
Graphing Data
• We’ll look at four ways of graphing our blood
pressure data:
– Column Scatter Plot
– Box and Whisker Plot
– Column or Bar Chart
– Line Chart
• In all these, it’s important to show both a
measure of central tendency (average) and a
measure of variability
Marshall University School of Medicine
Column Scatter Plot
• A column scatter plot plots all the data as
individual points in a column
– Rarely used
– But very useful, for up to 100 data points
– Not much software support
• GraphPad Prism, for which Marshall SOM has a license,
can do this
Marshall University School of Medicine
Column Scatter Plot Example
Marshall University School of Medicine
Box and Whisker Plot
• A box and whisker plot shows the range,
interquartile range, and median of the data set
– A good choice when the median and interquartile
range are good measures of central tendency and
variation for your data
– The median is marked with a horizontal line
– The interquartile range is marked with a box
– "Whiskers" extend to the full range of the data
• A variation is for the whiskers to extend to most of the
range, and outliers to be marked individually as points
Marshall University School of Medicine
Box and Whisker Plot Example
Marshall University School of Medicine
Bar Chart
• Bar charts use horizontal or vertical bars to
demonstrate the mean of the data set
• "Error bars" are used to show a measure of variability
• Some important considerations for bar charts:
– It is natural to look at the relative size of the bars in order
to compare the relative values of the means.
– Therefore, bar charts should only be used with ratio data
and should have the base of the bar at zero
– There are various ways the error bars can be drawn (we
will see later), so always clearly state what the error bars
represent
Marshall University School of Medicine
Bar chart example
Marshall University School of Medicine
Line Chart
• A line chart is useful if the data points are
ordered, and the ordering is important
– For example, if we want to track the data over
time
• Like a column scatter plot, a line chart plots all
the data
Marshall University School of Medicine
Line Chart Example
Marshall University School of Medicine