Download Positive Negative

Document related concepts
no text concepts found
Transcript
Statistics for Science
Journalists
STEVE DOIG
CRONKITE SCHOOL OF JOURNALISM
Journalists hate math
 Definition of journalist: A do-gooder who hates math.
 “Word person, not a numbers person.”
 1936 JQ article noting habitual numerical errors in
newspapers
 Japanese 6th graders more accurate on math test than
applicants to Columbia’s Graduate School of Journalism
 20% of journalists got more than half wrong on 25question “math competency test” (Maier)
 18% of 5,100 stories examined by Phil Meyer had math
errors
Bad examples abound
 Paulos: 300% decrease in murders
 Detroit Free Press (2006): Compared ACS to Census
data to get false drop in median income
 KC Star (2000): Priests dying of AIDS at 4 times the
rate of all Americans
 Delaware ZIP Code of infant death
 NYT: 51% of women without spouses
Common problems
 Numbers that don’t add up
 Making the reader do the math
 Failure to ask “Does this make sense?”
 Over-precision
 Ignoring sampling error margins
 Implying that correlation equals causation
Dangers of journalistic innumeracy
 Misleads math-challenged readers/viewers
 Hurts credibility among math-capable
readers/viewers
 Leads to charges of bias, even when cause is
ignorance
 Makes reporters vulnerable to being used for the
agendas of others
Common Research Methods
 Randomized experiments: Measure deliberate
manipulation of the environment
 Observational studies: Measure the differences
that occur naturally
 Meta-analyses: Quantitative review of multiple
studies
 Case Study: Descriptive in-depth examination of
one or a few individuals
Simple Measures...
...don’t exist!
Measurement Variability
 Variable measurements include
unpredictable errors or discrepancies
that aren’t easily explained.
 Natural variability is the result of the
fact that individuals and other things
are different.
Reasons for variable measures
 Measurement error
 Natural variability between individuals
 Natural variability over time in a single
individual
Some Pitfalls in Studies
Deliberate Bias?
If you found a wallet with $20, would you:
 “Keep it?”
(23% would keep it)
 “Do the honest thing and return it?”
(13% would keep it)
Unintentional Bias?
 “Do you use drugs?”
 “Are you religious?”
Desire to Please?
People routinely say they have voted when they
actually haven’t, that they don’t smoke when they do,
and that they aren’t prejudiced.
One study six months after an election:
 96% of actual voters said they voted.
 40% of non-voters said they voted.
Asking the uninformed?
Washington Post poll : “Some people say the 1975
Public Affairs Act should be repealed. Do you agree
or disagree that it should be repealed?”
 24% said yes
 19% said no
 rest had no opinion
Asking the uninformed?
Later Washington Post poll: “President Clinton says
the 1975 Public Affairs Act should be repealed. Do
you agree or disagree that it should be repealed?”
 36% of Democrats agreed
 16% of Republicans agreed
 rest had no opinion
Unnecessary Complexity?
 “Do you support our soldiers in Iraq so that
terrorists won’t strike the U.S. again?”
Question Order
 “About how many times a month do you normally go
out on a date?”
 “How happy are you with life in general?”
Sampling
Margin of Error
95% of the time, a random sample’s
characteristics will differ from the
population’s by no more than about
1
n
where N= sample size
Two Important Concepts about Error Margin
 The larger the sample, the smaller the margin of
sampling error.
 The size of the population being surveyed doesn’t
matter.*
*Unless the sample is a significant fraction of the population.
Sampling realities
 Bigger sample means more cost (money and/or
time)
 Diminishing return on error margin improvement as
sample increases.



N=100: +/- 10 percentage points
N=400: +/- 5 percentage points
N=900: +/- 3.3 percentage points
 Sample needs only to be large enough to give a
reasonable answer.
 Sampling error affects subsamples, too.
Describing data sets
Three Useful Features of a Set of Data
 The Center
 The Variability
 The Shape
The Center
 Mean (average): Total of the values, divided by the
number of values
 Median: The middle value of an ordered list of values
 Mode: The most common value
 Outliers: Atypical values far from the center
Yankees’ Baseball Salaries
 Average: $7,404,762
 Median: $2,500,000
 Mode: $500,000 (also the minimum)
 Outlier: $27.5 million (Alex Rodriguez)
The Variability
Some measures of variability:
 Maximum and minimum: Largest and smallest
values
 Range: The distance between the largest and
smallest values
 Quartiles: The medians of each half of the ordered
list of values
 Standard deviation: Think of it as the average
distance of all the values from the mean.
What is “normal”?
 Don’t consider the average to be “normal”
 Variability is normal
 Anything within about 3 standard deviations of the
mean is “normal”
Bell-Shaped “Normal” Curve
Some Characteristics
of a Normal Distribution
 Symmetrical (not skewed)
 One peak in the middle, at the mean
 The wider the curve, the greater the standard
deviation
 Area under the curve is 1 (or 100%)
mean
Percentiles
Your percentile for a particular measure (like height or
IQ) is the percentage of the population that falls
below you.
Compared to other American males:
 My height (5’ 11”): 75th percentile
 My weight (
): 85th percentile
 My age (66): 88th percentile
230 lbs.
Therefore, I am older and heavier than I am tall.
Standardized Scores
A standardized score (also called the z-score) is simply
the number of standard deviations a particular value
is either above or below the mean.
The standardized score is:
 Positive if above the mean
 Negative if below the mean
Useful for defining data points as outliers.
The Empirical Rule
For any normal curve, approximately:
 68% of values within one StdDev of the mean
 95% of values within two StdDevs of the mean
 99.7% of values within three StdDevs of the mean
Outlier
 A value that is more than three standard deviations
above or below the mean.
Correlation
Strength of Relationship
Correlation (also called the correlation coefficient or
Pearson’s r) is the measure of strength of the linear
relationship between two variables.
Think of strength as how closely the data points come
to falling on a line drawn through the data.
Features of Correlation
 Correlation can range from +1 to -1
 Positive correlation: As one variable increases,
the other increases
 Negative correlation: As one variable increases,
the other decreases
 Zero correlation means the best line through the
data is horizontal
 Correlation isn’t affected by the units of
measurement
Positive Correlations
r = +.1
r = +.8
r = +.4
r = +1
Negative Correlations
r = -.4
r = -.1
r = -.8
r = -1
Zero correlation
r=0
r=0
Number of Points
Doesn’t Matter
r = .8
r = .8
Important!
Correlation does not imply
causation.
Correlation of variables
 When considering relationships between
measurement variables, there are two kinds:


Explanatory (or independent) variable: The variable that
attempts to explain or is purported to cause (at least partially)
differences in the…
Response (or dependent or outcome) variable
 Often, chronology is a guide to distinguishing them
(examples: baldness and heart attacks, poverty and
test scores)
Some reasons why
two variables could be related
 The explanatory variable is the direct cause of the
response variable
Example: pollen counts and percent of population
suffering allergies, intercourse and babies
Some reasons
two variables could be related
 The response variable is causing a change in the
explanatory variable
Example: hotel occupancy and advertising spending,
divorce and alcohol abuse
Some reasons
two variables could be related
 The explanatory variable is a contributing -- but not
sole -- cause
Example: birth complications and violence, gun in
home and homicide, hours studied and grade, diet
and cancer
Some reasons
two variables could be related
 Both variables may result from a common cause
Example: SAT score and GPA, hot chocolate and
tissues, storks and babies, fire losses and
firefighters, WWII fighter opposition and bombing
accuracy
Some reasons
two variables could be related
 Both variables are changing over time
Example: divorces and drug offenses, divorces and
suicides
Some reasons
two variables could be related
 The association may be nothing more than
coincidence
Example: clusters of disease, brain cancer from cell
phones
So how can we
confirm causation?
The only way to confirm is with a designed
(randomized double-blind) experiment.
But non-statistical evidence of a possible connection
may include:
 A reasonable explanation of cause and effect.
 A connection that happens under varying conditions.
 Potential confounding variables ruled out.
Regression
Linear Regression
In addition to figuring the strength of the relationship,
we can create a simple equation that describes the
best-fit line (also called the “least-squares” line)
through the data.
This equation will help us predict one variable, given
the other.
Best-fit (“least-squares”) Line
Best-fit Line??? (much variance)
Best-fit Line! (least variance)
Remember
9th Grade Algebra?
x = horizontal axis
y = vertical axis
Equation for a line:
y = slope * x + intercept
or as it often is stated:
y = mx + b
Regression in data journalism
 Public school test scores
 Cheating in school test scores
 Tenure of white vs. black coaches in NBA
 Racial bias in picking jurors
 Racial profiling in traffic stops
Confusion of the inverse
Confusion of the Inverse
Confusing these two:
 Probability of actually having a condition, given a
positive test for it
 Probability of having a positive test, given actually
having the condition
When the incidence of some disease or condition is
very low, and the test for it is not perfect, there will
be a high probability that a positive test result is
false positive.
Definitions
 Base rate: The probability that someone has a
disease or condition, without knowing any test
results.
 Test Sensitivity: Proportion of people who
correctly test positive when they have the disease
or condition (true positive)
 Test Specificity: Proportion of people who correctly
test negative when they don’t have the disease or
condition (true negative)
Drug Tests
Consider this scenario:
 Base rate: 1% of population to be tested uses
dangerous drugs
 You use a test that’s 99% accurate in both sensitivity
and specificity
 10,000 people are tested
Drug Tests
Test
Positive
Test
Negative
Total
Users
100
Not
9,900
Total
10,000
Drug Tests
Users
Test
Positive
99
Test
Negative
1
Total
100
Not
9,900
Total
10,000
Drug Tests
Test
Negative
1
Total
Not
9,801
9,900
Total
9,802
10,000
Users
Test
Positive
99
100
Drug Tests
Test
Negative
1
Total
Users
Test
Positive
99
Not
???
9,801
9,900
9,802
10,000
Total
100
Drug Tests
Users
Not
Total
Test
Positive
99
Test
Negative
1
Total
99
9,801
9,900
198
9,802
10,000
100
(50% of positives are FALSE!)
Confidence intervals
and p-values
Confidence Intervals
 Like the error margin around poll results
 A confidence interval is a tradeoff between
certainty and accuracy, like shooting at
targets of different sizes
 The bigger the sample, the smaller the
confidence interval at the 95% level
 When comparing results, if confidence
intervals overlap, the results are NOT
statistically significant
P-values
 P-value is the probability that the sample result is




significantly different from the true result (i.e., wrong)
95% confidence interval (p < 0.05) is the most commonly
used interval in social science research
Hard science, particularly medicine, often needs tighter
confidence intervals and smaller p-values, like p<0.01
Studies are going to be wrong about 5% of the time (and
you won’t know when)
On the other hand, they probably won’t be very wrong.
How to read a research study
 Pay attention to the method: Observational, randomized






double-blind experiment, meta-analysis, case study
Note the sample size
Don’t ignore the confidence intervals
Consider the p-value as the probability you’re writing
about something that isn’t true
Remember correlation doesn’t necessarily mean
causation.
Consider the quality of the journal (peer reviewed?)
Who paid for the research?
Newsroom math bibliography
 “Numbers in the Newsroom”, by Sarah Cohen, IRE
 “News and Numbers”, by Victor Cohn and Lewis




Cope
“Precision Journalism (4th edition)”, by Phil Meyer
“Innumeracy”, by John Allen Paulos
“A Mathematician Reads the Newspaper,” by John
Allen Paulos
“Damned Lies and Statistics,” by Joel Best
Questions?