Download Lecture 1 - Lorenzo Marini

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Predictive analytics wikipedia , lookup

Regression analysis wikipedia , lookup

Generalized linear model wikipedia , lookup

Least squares wikipedia , lookup

Data analysis wikipedia , lookup

Probability box wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Introduction to Biostatistical Analysis
Practical statistics course for first-year PhD students
Session 1: 7/11/2012
Lecture: Basic concepts
Session 2: 8/1/2012
Lecture: Introduction to statistical hypothesis testing
Session 3: 9/11/2012
Lecture: Analysis of Variance
Session 4: 14/11/2012
Lecture: Regression
Session 5: 16/11/2012
Lecture: Synthesis and applications
Lecturer: Lorenzo Marini
DAFNAE, University of Padova
E-mail: [email protected], Tel.: +39 0498272807
http://www.biodiversity-lorenzomarini.eu/
1
Introduction to Biostatistical Analysis
Practical statistics course for first-year PhD students
5 Sessions:
Lecture: theory + Practical: use of R
+
Assessment (at least three sessions attended)
Analysis of an assigned data set and writing a report
Your report should consist of the following 2 items:
1. A 1 page R script fully documenting your analysis.
2. A word document presenting the aims, method of analysis, the results,
and an interpretation.
2
Statistics: Definition
STATISTICS
-Statistical methods can be used to summarize or describe a
collection of data; this is called descriptive statistics.
-In addition, patterns in the data may be modeled from samples,
and then used to draw inferences about the process or population
being studied; this is called inferential statistics.
Statistics = Techniques of
•
Collecting
•
Analysing
•
Drawing conclusions from
DATA
“A mode of thought”
will change the way you do science
3
Statistics: Population and samples
Population: is a set of entities of interest (e.g. PhD students, farms,
bees, fishes, dogs…)
Samples: is a subset of entities randomly drawn from the population
Spatial
extent
Samples
Population
Statistics answers RESEARCH
questions using samples
4
Statistics: Population and samples
Example: Population of PhD students
Question: do female students perform better in stats than male
students?
Spatial
Extent?
How to draw samples?
Population
♂
♀
Statistics answers RESEARCH
questions using samples
5
Inferential statistics: logics
Statistical testing in five steps:
1. Construct a null hypothesis (H0) (RESEARCH QUESTION)
E.g. Question: do female students perform better in stats than male students?
2. Choose a statistical analysis
E.g. T-test to detect difference between male and female
3. Collect the data (sampling)
E.g. Sampling male and female students
4. Calculate P-value and test statistic
Perform t-test
5. Reject/accept (H0) if p is small/large
(ANSWER THE QUESTION)
Common error
Sampling before (1)
constructing the hypothesis
and (2) choosing the
6
statistical analysis
Strong vs. weak inference
Weak inference
(Observational)
1. Many hypotheses
2. Correlational
SEVERAL alternative
hypotheses for
explaining the process
Strong inference
(Experimental)
1. A clear hypothesis
2. A specified test
ONE clear testable
hypothesis
What is a hypothesis?
• a statement that is testable
• Testable: can be falsified (K. Popper)
7
KEEP IN MIND!
1. Clearer is the question, easier is the sampling,
simpler is the statistics to be applied
2. If you have blurry and foggy questions and bad design,
you will have to apply complex analyses
Where are you?
Without statistics you cannot write scientific papers
(almost…)
8
1. Construct hypotheses: types of variables
Which is/are your response variable/s?
Object of your research (e.g. animal fitness, yield,
healing time, cow milk production…)
Which is/are your explanatory variable/s?
Variables for explaining your response variable
(e.g. age, risk factors, temperature, hormones,
fertilization, diet...)
Which are your research questions (hypotheses)?
HYPOTHESIS: a statement that can be falsified
9
1. Construct hypotheses
Response Variable
Continuous
Weight, height, length, temperature,
concentration
Count
Number of individuals, days, cells; zero is a
common value
Proportion
Percentage mortality, infection rate, proportion
responding to a treatment; percent leaf area
eaten.
10
1. Construct hypotheses
Box-plot
100
Categorical explanatory variable
0
20
60
Species, Clone,
Genotype, Treatment, Diet,
Growth Chamber, Breed…
Each categorical factor has two or
more levels
A
B
C
D
E
F
G
H
1.5
11
10
Scatter plot
6
4
2
0
response
Weight, height, length, temperature
8
Continuous explanatory variable
-0.5
0.0
0.5
1.0
log(factor)
1. Construct hypotheses: prepare the data
Each column is one variable
Explanatory 1:
Age
Explanatory 2:
Gender
Response:
Score
26
Male
10
24
Male
9
26
Male
5
28
Male
4
32
Male
3
24
Female
4
26
Female
8
28
Female
9
28
Female
6
Do not confound levels
of a categorical factor
with a true factor
E.g. male and female
are not factors!
2. Choose a statistic analysis
Univariate analysis
Y
Variables
- One Response variable (Y):
(e.g. Y= Hormone concentration)
- One or more explanatory variables (Xi) (e.g. age, size, breed…)
Multivariate analysis
Variables
Y1 Y2 Y3 Y4 Y5
- More than 1 response variable (Yi)
(e.g. Yi= species composition )
- One or more explanatory variables (xi) (e.g. age, training,
breed, sex, )
13
2. Choose a statistic analysis
1
Univariate
Continuous
Count
Proportion
Distributions
Response
variable
More
Multivariate
14
2. Choose a statistic analysis
Response Variable:
Continuous (normal)
Count (normal or Poisson)
Proportion (normal or binomial)
Distributions
Explanatory Variables:
Continuous:
E.g. Regression
Categorical:
E.g. ANOVA
Categorical + continuous:
E.g. ANCOVA, GLMs
Statistical
analyses
15
2. Choose a statistic analysis
Parametric statistics
The population is assumed to fit any parameterized distributions.
A probability distribution describes the values and probabilities that a
random event can take place
-Normal
-Poisson
-Gamma
-Binomial…
General Linear Models
(ANOVA, regression, ANCOVA)
Generalized Linear Models GLMs
NB The distribution depends on the
nature of your response variable
Non-parametric statistics
Nonparametric methods are often referred to as ‘distribution free’
methods as they do not rely on assumptions that the data are drawn from a
given probability distributions.
16
2. Choose a statistic analysis
Statistical analysis
Assumptions
Each analysis (even nonparametric) requires to make
some background assumptions
Sampling design
Each analysis requires an
appropriate sampling design
If both these conditions are met
5. We can accept/refuse our hypotheses
17
3. Collect the data (sampling):
Population
Samples
SAMPLING
Key step in any
research
Sampling is that part of statistical practice concerned with the selection
of individual observations intended to yield some knowledge about a
population of concern
Key concepts:
-Randomization
-Replication (no pseudo-replication!!!)
-Independence
18
3. Collect the data: randomization
Population (N)
Sample (n<N)
1
2
3
…
n
Without replacement
A simple random sample is selected so that every
possible sample has an equal chance of being drawn
from the population
Copy and paste in R
###Draw randomly one student among you
students<-seq(1,30) ## assign to each student an ID
students ##show the ID numbers
sample(students,1) ## select randomly one among the 16
hist(replicate(1000,sample(students,1)), breaks=8)
19
3. Collect the data: sufficient replication!
True replication vs. pseudoreplication
Replication means having replicate observations at a
spatial and temporal scale that matches the application of
the experimental treatments
True replicates must be independent
n replicates
Degree of freedom
Replicates MUST NOT:
- Come from a time series
- Be grouped in space
- Be repeated measures on the same individuals
(but it depends: wait for mixed models!!!)
P
20
3. Collect the data: independence
Population
4 replicates
Random sampling
4 replicates
Independence
No meaningful relation between the sampling units
3 MAIN PROBLEMS IN BIOSTATISTICAL ANALYSES
1. Spatial dependence (e.g. spatial autocorrelation)
2. Temporal dependence (e.g. repeated measures)
3. Biological dependence (e.g. siblings)
21
3. Collect the data: independence
ID
Gender
Score
1
Male
10
2
Male
9
3
Male
5
4
Male
4
5
Male
3
6
Male
4
7
Female
8
8
Female
9
9
Female
6
10
Female
7
11
Female
8
12
Female
5
Each row is one observation or
measurement
MAIN QUESTION:
Are your observations true replicates?
Do not confound observations with
replicates!
3. Collect the data: spatial dependence
Which is your replicate and which is the right scale?
♂
♀
10 birds per sex
15 feathers per bird
300 measurements
18 rats
18 liver
3 samples per liver
2 analyses per tissue sample
108 measurements
23
3. Collect the data: spatial dependence
Which is your replicate and which is the right scale?
♀
Response variable
Feather length
♂
Explanatory variable
Sex (♀ and ♂)
15 measures per bird
24
3. Collect the data: spatial dependence
Which is your replicate and which is the right scale?
E.g. ANOVA
> Bird_level<-aov(feather ~ sex)
df SS
MS
F value
P
sex
1
3.4279
3.42
5.887
0.025 *
Residuals 18 10.4813
0.58
--------------------------------------------------> Feather_level<-aov(feather ~ sex)
sex
Residuals
df
1
298
SS
51.41
242.48
MS
51.41
0.81
F value
63.19
P
3.9e-14 ***
Pseudo-replication!!! Do you have a solution?
25
3. Collect the data: temporal dependence
Temporal dependence: time series (not covered)
When we measure experimental units repeatedly
over time, we are collecting longitudinal data,
also called repeated measures data.
Time 1
measure 1
Time 2
Time 3
Time 4
measure 2
measure 3
measure 4
26
3. Collect the data: biological dependence
Biological dependence
Unknown genetic relations between sampled units
Individuals belonging to same litter or brood
Litter A
Litter B
Drug A
Drug B
Biased sampling (factor + noise)
Proper sampling (2 litter + diet)
27
3. Collect the data (sampling)
If temporal or spatial dependence exists
SIMPLEST SOLUTION
To have an appropriate number of replicates at the proper scale
[P.A. Murtaugh 2007. Simplicity and complexity in ecological data analysis. Ecology,
88, 56–62]
ALTERNATIVE SOLUTION
Wait for mixed models (they can deal with non-independent
data)
Agricultural research: easy to control possible confounding factors
Ecological research: more difficult to apply traditional models.
Modern mixed models with REML estimation (extremely
dangerous analyses!!!)
28
3. Collect the data (sampling)
Examples of ‘Bad’ design
no replication (you cannot use statistics)
clumped segregation (totally uninformative)
isolative segregation (growth chambers etc.)
systematic (problem: periodic variations)
Example of ‘Good’ Designs
randomized block (paired or block design)
completely randomized (if enough time,
space, money)
29
3. Collect the data (sampling): one example
Factors: Irrigation ------- Response: Maize yield
i step: identify your sampling unit
ii step: identify your replicate and the sample size
iii step: decide the spatial distribution
iv step: one or repeated measurements?
Arable field
Ditch
30
3. Collect the data (sampling)
‘Good’ design (multifactorial ANOVA)
A
C
D
B
D
A
B
C
B
D
C
A
C
B
A
D
Latin square
(in case of two gradients)
Split-plot (very common!)
E.g. Agronomic trials, greenhouses, Petri dishes
Nested
(especially in medicine)
E.g. liver samples from individual rat
31
3. Collect the data (sampling)
Before sampling you MUST know which is your
REPLICATE and the right SCALE of your study!!!
Sampling with appropriate replication!!!
32
3. Collect the data (sampling)
Manipulative
experiments
Natural
experiments
Observational studies
If we know exactly the
analysis the choice of the
sampling is straightforward
Spatial patterns in the samples
If you don’t work with experiments you
should always consider the spatial patterns
33
in your sampling design
Basic concepts: mean and variance
Body size
MEAN AND VARIANCE
Whereas the mean is a way to describe the location of a
distribution, the variance is a way to capture its scale or
degree of being spread out
mean
y

mean 
deviance  SS   ( yi  mean)
i
n
var 
 ( yi  mean)
(n  1)
2
SD 
(y
i
 mean)
(n  1)
2
2
34
Basic concepts: Residuals
RESIDUALS
residual
mean
A residual is an observable
estimate of the unobservable
statistical error.
The simplest case involves a
random sample of n men
whose heights are measured.
The difference between the height of each man in the sample
and the observable sample average is a residual.
Residuals represent what we cannot explain
35
Basic concepts: Uncertainty
Population
We normally compute means
using samples. It is not feasible
to measure all the individuals in
a population.
Distribution of single cow milk production)
WHOLE POPULATION
Mean=40
15000
Mean=42
5000
How can we reduce the degree
of uncertainty?
0
Frequency
Mean=39.5
As we work with samples, there is
always a degree of
UNCERTAINTY in the estimation.
20
30
40
50
60
36
Milk production
Basic concepts: Law of the Large Numbers
3
As the sample size (n) grows, the sample mean
approaches to the population mean
x
1
2
When is a sample large enough?
Population with mean = 0
SD = 1
-2
-1
0
Run the simulation to answer
0
20
40
60
n = 100
80
100
library(animation)
ani.options(ani.height = 480, ani.width = 600, outdir = getwd(), nmax =100,
interval = 0.1, title = "Demonstration of the Law of Large Numbers",
description = "The sample mean approaches to the population mean as
the sample size n grows.")
ani.start()
par(mar = c(3, 3, 1, 0.5), mgp = c(1.5, 0.5, 0))
lln.ani(FUN = rnorm, mu = 0, np = 50, pch = 20, col.poly = "grey")
ani.stop()
37
Basic concepts: Uncertainty
Standard error (SE)
SD
SE 
n
SE is simply the SD of the probability
distribution of a specific statistic.
E.g. SE of the mean
Confidence intervals (CI)
CI is an interval estimate of a population
parameter. How likely the interval is to
contain the parameter is determined by
the confidence level (95%)
t distribution (n<30)
CI 0.975  mean  t0.975,df  SE
CI 0.025  mean  t0.025,df  SE
Normal distribution (n>30)
CI 0.975  mean  z0.975  SE
CI 0.025  mean  z0.025  SE
38
Basic concepts
DEGREE OF FREEDOM
The number of INDEPENDENT measurements minus
the number of parameters estimated from the data
Df does not correspond to the sum of the single
measures but must be computed on our replicates
The number of the df define the scale of our analyses!!!
Look at the error df to spot pseudo-replication
39
Basic concepts: Why distributions?
We use distribution in statistics mainly in 2 ways:
1. To model our response variables we need to know its distribution
(assumptions)
2. Once we know the distribution we can run a statistical test
(There are loads of tests to do (F test, t test etc.)
No matter what test we do, the principal is always the same:
-> Calculate a test statistic (e.g. z, t, F) which follows a DEFINED
DISTRIBUTION
-> Look up the critical value related to our level of significance
-> Compare the calculated value with the critical value
->We can then associate a PROBABILITY to our decision
40
Basic concepts: Normal distribution
Normal standardized (mean=0, sd=1)
y  y mean
z
sd
f ( z) 
1  z2 / 2
e
2
E.g. Suppose we have measured the heights of 100 people.
The mean height was 170 cm and the sd was 8 cm
• shorter than a particular height?
• taller than a particular height?
• between one specified height and
another?
41
Basic concepts: Poisson distribution
The Poisson distribution, which describes a very large number of
individually unlikely events that happen (count data)
Non-negative values
Variance=mean
Right skewed
1 Parameter: λ (mean=variance)
Use: count data
Sample from a Poisson distribution
(n=1000, mean=variance=0.2)
var = mean
42
Basic concepts: Binomial distribution
The binomial distribution describes the number of
successes in a finite series of independent Yes/No
experiments.
2 Parameters: sample size, probability
Use: proportion data and power analysis
43
What’s R?
R is a system for statistical computation and graphics. It consists
of a language plus a run-time environment with graphics, a
debugger, access to certain system functions, and the ability to
run programs stored in script files.
R has a home page at http://www.R-project.org/.
It is a free software distributed under a GNU-style copyleft, and
an official part of the GNU project (“GNU S”).
44
Why R?
The benefits of R are:
+ R is free. R is open-source and runs on UNIX, Windows and Macintosh
+ R has an excellent built-in help system
+ R has excellent graphing capabilities
+ Students can easily migrate to the commercially supported S-Plus
program if commercial software is desired
+ R's language has a powerful, easy to learn syntax with many built-in
statistical functions
+ The language is easy to extend with user-written functions
+ R is a computer programming
What is R lacking compared to other software solutions?
- It has a limited graphical interface (S-Plus has a good one). This means,
it can be harder to learn at the outset.
- There is no commercial support. (Although one can argue the
international mailing list is even better)
- The command language is a programming language so students must 45
learn to appreciate syntax issues etc.
Appendix 1
Box-plot explanation
46