Download Introduction to Bioinformatics 1. Course Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
4. Functional Genomics and
Microarray Analysis (1)
Version 1.0 – 19 Jan 2009
Version 1.0
Background
Functional Genomics

Functional Genomics:
–
–

Systematic analysis of gene activity in healthy and diseased tissues.
Obtaining an overall picture of genome functions, including the expression
profiles at the mRNA level and the protein level.
Functional Genome Analysis:
–
–
–
used to understand the functions of genes and proteins in an organism. This is
typically known as genome annotation.
used in integrative biology and systems biology studies aiming to understand
health and disease states (e.g. cancer, obesity, …etc)
Used as an important step in the search for new target molecules in the drug
discovery process. (which genes, proteins to target and how)
Version 1.0
DNA sequence
codes for
(split into genes)
Amino Acid
Sequence
What is…?
folds into
Protein
has
3D
Structure
dictates
Protein
Function
determines
Cell
Activity

Gene Expression:
–
–

The process by which the information encoded in a gene is converted into an
observable phenotype (most commonly production of a protein).
The degree to which a gene is active in a certain tissue of the body, measured
by the amount of mRNA in the tissue.
Microarrays:
–
–
Tools used to measure the presence and abundance of gene expression
(measure as mRNA) in tissue.
microarray technologies provide a powerful tool by which the expression
patterns of thousands of genes can be monitored simultaneously and
measured quantitatively
Version 1.0
Applications of
Microarray Technology

Applications covered only as example
contexts, emphasis is on analysis methods
–
–
–
–
–
Identify Genes expressed in different cell
types (e.g. Liver vs Kidney)
Learn how expression levels change in
different developmental stages (embryo vs.
adult)
Learn how expression levels change in
disease development (cancerous vs noncancerous)
Learn how groups of genes inter-relate (genegene interactions)
Identify cellular processes that genes
participate in (structure, repair, metabolism,
replication, … etc)
Version 1.0
Affymetrix Inc. is the leading
provider of Microarray
technology (GeneChip® )
Microarrays
Basic Idea
http://www.affymetrix.com/

A Microarray is a device that detects the presence and abundance
of labelled nucleic acids in a biological sample.

In the majority of experiments, the labelled nucleic acids are derived
from the mRNA of a sample or tissue.

The Microarray consists of a solid surface onto which known DNA
molecules have been chemically bonded at special locations.
–
–
Each array location is typically known as a probe and contains many
replicates of the same molecule.
The molecules in each array location are carefully chosen so as to
hybridise only with mRNA molecules corresponding to a single gene.
Version 1.0
Several companies sell equipment to make DNA chips, including
spotters to deposit the DNA on the surface and scanners to detect
the fluorescent or radioactive signals.
Basic Idea

A Microarray works by exploiting the ability of a given mRNA
molecule to bind specifically to, or hybridize to, the DNA template
from which it originated.

By using an array containing many DNA samples, scientists can
determine, in a single experiment, the expression levels of
hundreds or thousands of genes within a cell by measuring the
amount of mRNA bound to each site on the array.

With the aid of a computer, the amount of mRNA bound to the
spots on the Microarray is precisely measured, generating a
profile of gene expression in the cell.
Version 1.0
Microarray Process

The molecules in the target biological sample are labelled using a
fluorescent dye before sample is applied to array
–
–

If a gene is expressed in the sample, the corresponding mRNA
hybridises with the molecules on a given probe (array location).
If a gene is not expressed, no hybridisation occurs on the
corresponding probe.
Reading the array output
–
–
–
After the sample is applied, a laser light source is applied to the array.
The fluorescent label enables the detection of which probes have
hybridised (presence) via the light emitted from the probe.
If gene is highly expressed, more mRNA exists and thus more mRNA
hybridises to the probe molecules (abundance) via the intensity of the
light emitted.
Version 1.0
The array
Chemistry Basics:
Surface Chemistry is used to
attach the probe molecules to
the glass substrate.
Chemical reactions are used to
attach the florescent dyes to the
target molecules
Probe and Target hybridise to
form a double helix
Version 1.0
Affymetrix GeneChip
Example of Single Label Chips

Hundreds of thousands of oligonucleotide probes packed at extremely high
densities. The probes designed to maximize sensitivity, specificity, and
reproducibility, allowing consistent discrimination between specific and
background signals, and between closely related target sequences.

RNA labeled and scanned in a single “color” one sample per chip
Version 1.0
From Microarray images to
Gene Expression Matrices
Final data
Gene Expression Matrix
Intermediate data
Array scans
Images
Samples
Spots
Genes
Raw data
Spot/Image
quantiations
Gene
expression
levels
Version 1.0
Steps of a Microarray
Experiment
Biological question
Sample Attributes
Experimental design
Platform Choice
Microarray experiment
Image analysis
16-bit TIFF Files
Quantify the Dots
Normalization
Clustering
Statistical Analysis
Data Mining
Pattern Discovery
Biological verification
and interpretation
Classification
Version 1.0
Qualitative Interpretation of Reads

GREEN represents High Control hybridization
RED represents High Sample hybridization
YELLOW represents a combination of Control and Sample
where both hybridized equally.
BLACK represents areas where neither the Control nor
Sample hybridized.

Main issue is to quantify the results:
–
–
–
–
How green is green?
What is the ratio of the signal to background noise?
How to compare multiple experiments using different
chips?
How to quantify cross hybridization (if any)?
Version 1.0
Normalization

Normalisation is a general term for a collection of methods that are
directed at reasoning about and resolving the systematic errors
and bias introduced by microarray experimental platforms

Normalisation methods stand in contrast with the data analysis
methods described in other lectures (e.g. differential gene
expression analysis, classification and clustering).

Our overall aim is to be able to quantify measured/calculated
variability, differentials and similarity:
–
Are they biologically significant or just side effects of the experimental
platforms and conditions?
Version 1.0
The measured gene expression in any experiment
includes true gene expression,together with
contributions from many sources of variability
Why Normalization
Sources of Microarray Data Variability



There are several levels of variability in
measured gene expression of a feature.
At the highest level, there is biological
variability in the population from which
the sample derives.
At an experimental level, there is
–
–
–
variability between preparations and
labelling of the sample,
variability between hybridisations of the
same sample to different arrays, and
variability between the signal on replicate
features on the same array.
Variability between
Individuals
True gene expression
of individual
Variability between
sample preparations
Variability between
arrays and
hybridisations
Variability between
replicate features
Measured gene
expression
Version 1.0
Typical Problem: Usually
more variability at low
intensity
Normalisation Examples
Probe Intensity Value

The raw intensities of signal from each spot on the array are not directly
comparable. Depending on the types of experiments done, a number of
different approaches to normalization may be needed. Not all types of
normalization are appropriate in all experiments. Some experiments may
use more than one type of normalization.

Reasonable Assumption: intensities of fluorescent molecules reflect the
abundance of the mRNA molecules – generally true but could be
problematic

Example:
–
–
–
intensity of gene A spot is 100 units in normal-tissue array
intensity of gene A spot is 50 units in cancer-tissue array
Conclusion: gene A’s expression level in normal issue is significantly
higher than in cancer tissue
Version 1.0
Normalisation Examples
Probe Intensity Value
Images showing examples of
how background intensity can be
calculated

Problem? What if the overall background intensity of the normaltissue array is 95 units while the background intensity of cancertissue array is 10 units?

Solutions:
–
–
–

Subtract background intensity value
Take ratio of spot intensity to background intensity (preferable)
In both cases have to decide where to measure background intensity
(e.g. local to spot or globally per chip)
In general, There could be many factors contributing to the background
intensity of a microarray chip
–
To compare microarray data across different chips, data (intensity levels) need
to be normalized to the “same” level
Version 1.0
Differential Gene Expression Analysis

Consider a microarray experiment
–
that measures gene expression in two groups of rat tissue (>5000
genes in each experiment).
–
The rat tissues come from two groups:
 WT: Wild-Type rat tissue,
 KO: Knock Out Treatment rat tissue
–
Gene expression for each group measured under similar conditions
–
Question: Which genes are affected by the treatment? How
significant is the effect? How big is the effect?
Version 1.0
Calculating Expression Ratios

In Differential Gene Expression Analysis, we are interested in identifying
genes with different expression across two states, e.g.:
–
–
–
–
–
–

Tumour cell lines vs. Normal cell lines
Treated tissue vs. diseased tissue
Different tissues, same organism
Same tissue, different organisms
Same tissue, same organism
Time course experiments
We can quantify the difference (effect) by taking a ratio
Eka
Rk 
Ekb

i.e. for gene k, this is the ratio between expression in state a compared to
expression in state b
–
–
This provides a relative value of change (e.g. expression has doubled)
If expression level has not changed ratio is 1
Version 1.0
A gene is up-regulated in state 2 compared
to state 1 if it has a higher value in state 2
Fold change
(Fold ratio)

Ratios are troublesome since
–
Up-regulated & Down-regulated genes treated differently


–
Genes up-regulated by a factor of 2 have a ratio of 2
Genes down-regulated by same factor (2) have a ratio of 0.5
As a result



A gene is down-regulated in state 2
compared to state 1 if it has a lower value in
state 2
down regulated genes are compressed between 1 and 0
up-regulated genes expand between 1 and infinity
Using a logarithmic transform to the base 2 rectifies problem, this is
typically known as the fold change
Eka
Fk  log 2( Rk )  log 2( )
Ekb
 log 2( Eka ) log 2( Ekb)
Version 1.0
A, B and D are down regulated
C is up-regulated
E has no change
Examples of fold change
Gene ID
Expression
in state 1
Expression
in state 2
Ratio
Fold Change
A
100
50
2
1
B
10
5
2
1
C
5
10
0.5
-1
D
200
1
200
7.65
E
10
10
1
0
You can calculate Fold change between pairs of expression values:
e.g. Between State 1 vs State 2 for gene A
Or Between mean values of all measurements for a gene in the
WT/KO experiments
•mean(WT1..WT4) vs mean (KO1..KO4)
Version 1.0
Statistics
Significance of Fold Change

For our problem we can calculate an average fold ratio for each
gene (each row)

This will give us an average effect value for each gene
–

2, 1.7, 10, 100, etc
Question which of these values are significant?
–
–
Can use a threshold, but what threshold value should we set?
Use statistical techniques based on number of members in each
group, type of measurements, etc -> significance testing.
Version 1.0
Condition
Statistics
Group 1
members
Condition
Group 2
members
Unpaired statistical experiments

Overall setting: 2 groups of 4 individuals each
–
–

Experiment 1:
–
–

We measure the height of all students
We want to establish if members of one group are consistently (or on
average) taller than members of the other, and if the measured
difference is significant
Experiment 2:
–
–

Group1: Imperial students
Group2: UCL students
We measure the weight of all students
We want to establish if members of one group are consistently (or on
average) heavier than the other, and if the measured difference is
significant
Experiment 3:
–
………
Version 1.0
Condition
Statistics
Group 1
members
Condition
Group 2
members
Unpaired statistical experiments

In unpaired experiments, you typically have two groups of people
that are not related to one another, and measure some property
for each member of each group

e.g. you want to test whether a new drug is effective or not, you
divide similar patients in two groups:
–
–
–
One groups takes the drug
Another groups takes a placebo
You measure (quantify) effect of both groups some time later

You want to establish whether there is a significant difference
between both groups at that later point

The WT/KO example is an unpaired experiment if the rats in the
experiments are different !
Version 1.0
Statistics
Unpaired statistical experiments

The WT/KO example is an unpaired experiment if the rats in the
experiments are different!
Experiment for WT Rats for
Gene 96608_at
Experiment for KO Rats for
Gene 96608_at
Rat #
WT gene expression
Rat # KO gene expression
WT1
100
KO1
150
WT2
100
KO2
300
WT3
200
KO3
100
WT4
300
KO4
300
Version 1.0
Statistics
Unpaired statistical experiments


How do we address the problem?
Compare two sets of results
(alternatively calculate mean for
each group and compare means)
140
120
100
80
60
40
20
0

Graphically:
–
–
Scatter Plots
Box plots, etc
Are these two series significantly different?
140
120
100

Compare Statistically
–
Use unpaired t-test
80
60
40
20
0
Are these two series significantly different?
Version 1.0
Condition 1 Condition 2
Statistics
Group
members
Paired statistical experiments

In paired experiments, you typically have one group of people, you
typically measure some property for each member before and
after a particular event (so measurement come in pairs of before
and after)

e.g. you want to test the effectiveness of a new cream for tanning
–
–

You measure the tan in each individual before the cream is applied
You measure the tan in each individual after the cream is applied
You want to establish whether the there is a significant difference
between measurements before and after applying the cream for
the group as a whole
Version 1.0
Statistics
Paired statistical experiments

The WT/KO example is a paired experiment if the rats in the
experiments are the same!
Experiments for Gene 96608_at
Rat #
WT gene
KO gene
expression expression
Rat1
100
200
Rat2
100
300
Rat3
200
400
Rat4
300
500
Version 1.0
Statistics
Paired statistical experiments





How do we address the problem?
Calculate difference for each pair
Compare differences to zero
Alternatively (compare average
difference to zero)
Graphically:
–
–

Scatter Plot of difference
Box plots, etc
Statistically
–
Use unpaired t-test
15
10
5
0
-5
-10
-15
Are differences close to Zero?
15
10
5
0
-5
-10
-15
Version 1.0
Statistics
Significance testing



In both cases (paired and unpaired) you want to establish whether
the difference is significant
Significance testing is a statistical term and refers to estimating
(numerically) the probability of a measurement occurring by
chance.
To do this, you need to review some basic statistics
–
–
–
–
Normal distributions: mean, standard deviations, etc
Hypothesis Testing
t-distributions
t-tests and p-values
Version 1.0
68% of dist.
Mean and
standard deviation
1 s.d.
1 s.d.
x
X


Mean and standard deviation tell you the basic features
of a distribution
mean = average value of all members of the group
u = (x1+x2+x3 ….+xN)/N

standard deviation = a measure of how much the
values of individual members vary in relation to the
mean
s.d . 
( x1  u ) 2  ( x 2  u ) 2  ( x3  u ) 2  ...( xN  u ) 2
s.d . 

1
u
N
N
The normal distribution is symmetrical about the mean
68% of the normal distribution lies within 1 s.d. of the
mean
N
x
i
x i
1 N
( xi  u ) 2

N x i
Version 1.0
s.d . 
1 N
2
(
x
i

u
)

N x i
Note on s.d. calculation

Through the following slides and in the tutorials, I use the
following formula for calculating standard deviation
s.d . 

Some people use the unbiased form below (for good reasons)
s.d . 

1 N
2
(
x
i

u
)

N x i
1 N
( xi  u) 2

N  1 x i
Please use the simple form if you want the answers to add up
at the end
Version 1.0
The Normal Distribution
Many continuous variables follow a normal distribution, and it
plays a special role in the statistical tests we are interested in;
•The x-axis represents the values of a
particular variable
•The y-axis represents the proportion of
members of the population that have
each value of the variable
68% of dist.
1 s.d.
1 s.d.
x
X
•The area under the curve represents
probability – i.e. area under the curve
between two values on the x-axis
represents the probability of an
individual having a value in that range
Version 1.0
Hypothesis Testing: Are two data sets
different

We use z-test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known (and Ho
are the same)

We pose a null hypothesis that the
means are equal

We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (both means are
equal)
–
–
if probability is low (low p) reject the null
hypothesis and accept the alternative
hypothesis (both means are different)
If probability is high (high p) accept null
hypothesis (both means are equal)
Population 1
Population 2
Ha
Population 1
Population 2
If standard deviation known use z test,
else use t-test
Version 1.0
Comparing Two Samples
Graphical interpretation

To compare two groups you can
compare the mean of one group
graphically.

The graphical comparison allows you
to visually see the distribution of the
two groups.

If the p-value is low, chances are there
will be little overlap between the two
distributions. If the p-value is not low,
there will be a fair amount of overlap
between the two groups.

We can set a critical value for the xaxis based on the threshold of p-value
Version 1.0
t-test terminology

t-test: Used to compare the mean of a sample to a known number
Assumptions: Subjects are randomly drawn from a population and the
distribution of the mean being tested is normal.

Test: The hypotheses for a single sample t-test are:

–
–
Ho: u = u0
Ha: u < > u0
(where u0 denotes the hypothesized
value to which you are comparing a
population mean)

p-value: probability of error in rejecting the hypothesis of no difference
between the two groups.
Version 1.0
t-Tests Intuitively
Version 1.0
t-test terminology
Unpaired vs. paired t-test

Same as before !! Depends on your experiment

Unpaired t-Test: The hypotheses for the comparison of two
independent groups are:
–
–

Ho: u1 = u2 (means of the two groups are equal)
Ha: u1 <> u2 (means of the two group are not equal)
Paired t-test: The hypothesis of paired measurements in same
individuals
–
–
Ho: D = 0 (the difference between the two observations is 0)
Ha: D <> 0 (the difference is not 0)
Version 1.0
Remember these formulae !!
Calculating t-test (t statistic)

First calculate t statistic value and then calculate p value
For the paired t-test, t is calculated using the following formula:
mean(d )
Where d is calculated by
t
di  xi  yi
 (d )
n
And n is the number of pairs being tested.

For an unpaired (independent group) t-test, the following formula is used:
t
mean( x )  mean( y )
 2 ( x)
n( x )
2

 ( y)
n( y )
Where σ (x) is the standard deviation of x and n (x) is the number of elements in x.
Version 1.0
Calculating p-value for t-test


When carrying out a test, a P-value can be calculated based on the tvalue and the ‘Degrees of freedom’.
There are three methods for calculating P:
– One Tailed >: P  p (t , ) / 2
– One Tailed <: P  1  p ( t , ) / 2
P  p(t , )
– Two Tailed:

Where p(t,v) is looked up from the t-distribution table

The number of degrees (v) of freedom is calculated as:
–
UnPaired: n (x) +n (y) -2
–
Paired: n- 1 (where n is the number of pairs.)
Version 1.0
p-values

Results of the t-test: If the p-value associated with the t-test is
small (usually set at p < 0.05), there is evidence to reject the null
hypothesis in favour of the alternative.

In other words, there is evidence that the mean is significantly
different than the hypothesized value. If the p-value associated
with the t-test is not small (p > 0.05), there is not enough evidence
to reject the null hypothesis, and you conclude that there is
evidence that the mean is not different from the hypothesized
value.
Version 1.0
t-value and p-value

Given a t-value, and degrees of freedom, you can look-up a pvalue

Alternatively, if you know what p-value you need (e.g. 0.05) and
degrees of freedom you can set the threshold for critical t
Version 1.0

Finding a critical t
The table provides the t
values (tc) for which P(tx > tc)
=A
A = .05
A = .05
tc =1.812
-tc=-1.812
t.100
t.05
t.025
t.01
t.005
3.078
1.886
.
.
1.372
6.314
2.92
.
.
1.812
12.706
4.303
.
.
2.228
31.821
6.965
.
.
2.764
.
.
.
.
.
.
.
.
.
.
200
1.286
1.282
1.653
1.645
1.972
1.96
2.345
2.326
63.657
9.925
.
.
3.169
.
.
2.601
2.576
Degrees of Freedom
1
2
.
.
10

Version 1.0
Meaning of t-value
High t-value

Take Gene A, assuming paired test:
Gene
A




R1a
R2a
10
R3a
20
R4a
30
R1b
40
R2b
110
For Either type of test
Average Difference is = 100, SD. = 0
t value is near infinity,
p is extremely low
R3b
120
R4b
130
140
Version 1.0
t
mean(d )
 (d )
n
Where d is calculated bydi  xi  yi
Consider Gene M for a paired experiment
Gene
T
R1a
R2a
10
R3a
20
R4a
30
R1b
40
R2b
10
Average Change  0


Average Difference is = 0
t value is zero, what does this mean?
R3b
20
R4b
30
40
Version 1.0
t
mean(d )
 (d )
n
Where d is calculated bydi  xi  yi
Consider Gene T for a paired experiment
Gene
T
R1a
Avergae Change 
R2a
11
R3a
19
R4a
32
R1b
39
R2b
110
R3b
120
R4b
130
99  101  98  101
 99.75
4
(99  99.75) 2  (101  99.75) 2  (98  99.75) 2  (101  99.75) 2
SD 
4
99.75
t
 155
1.29 / 4
 1.29
140
Version 1.0
Hypothesis Testing

Uses hypothesis testing methodology.

For each Gene (>5,000)
–
–
–
–

Pose Null Hypothesis (Ho) that gene is not affected
Pose Alternative Hypothesis (Ha) that gene is affected
Use statistical techniques to calculate the probability of rejecting the
hypothesis (p-value)
If p-value < some critical value reject Ho and Accept Ha
The issues:
–
–
–
Large number of genes (or experiments)
Need quick way to filter out significant genes that have high fold change
Need also to sort genes by fold change and significance
Version 1.0
Volcano Plots
A visual approach
Volcano plots are a graphical means for visualising results of
large numbers of t-tests allowing us to plot both the Effect
and significance of each test in an easy to interpret way
For each gene
compare the
value of the
effect between
population WT
vs. KO
For each gene
calculate the
significance of
the change
Identify Genes
with high effect
and high
significance
(fold change)
(t-test, p-value)
Volcano Plot
Version 1.0
Volcano plots


In a volcano plot:
X-axis represents effect measured as fold
change:
Effect = log(WT)
– log(KO)
2
2
If WT = WO,
= log(WT
/ KO)
2
Effect Fold Change = 0 , If WT = 2 WO, Effect Fold Change = 1
...

y-axis represents the number of zeroes in the
p-value
Version 1.0
Numerical Interpretation (Significance)
p< 0.01
(2 decimal places)
p< 0.1
(1 decimal place)
Using log10 for
Y axis: