Download Week 7: Bioinformatics Module Microarrays

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Beyond the Human Genome:
Transcriptomics
Dr Jen Taylor
Henry Wellcome Centre for Gene Function
Bioinformatics
Department of Statistics
[email protected]
Beyond the Human Genome:
1995
Human Genome sequencing begins in earnest
“Mapping the Book of Life”
1999
Human Genome
= approx 140, 000 genes
2000 - First Draft
= 30, 000 – 40,000 genes ??
Human Genome
2003 - Essential Completion
Human Genome
= 24, 195 genes !!!???
Commemorative stained glass window for F.C. Crick, designed by
Maria McClafferty.(Photograph: Paul Forster)
Gonville & Caius College, Cambridge, UK.
Beyond the Human Genome:
Gene Number ≠ Complexity
Complexity
Gene
Regulation
Transcriptome
Commemorative stained glass window for F.C. Crick, designed by
Maria McClafferty.(Photograph: Paul Forster)
Gonville & Caius College, Cambridge, UK.
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data analysis
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
Transcriptome:
“transcriptome, the mRNAs expressed by a genome at any given time..”
(Abbott, 1999)
Central Dogma of Molecular Biology

mRNA – single stranded RNA molecule

Complementary to DNA

Processed (spliced and polyadenylated)
RNA transcript

Carries the sequence of a gene out of the
nucleus into the cytoplasm where it can
be translated into a protein structure
Image: Access Excellence, National Institutes of Heath
Transcriptome: An evolving definition


(the population of) mRNAs expressed by a genome at any given time
(Abbott, 1999)
The complete collection of transcribed elements of the genome.
(Affymetrix, 2004)

mRNAs: 35, 913 transcripts (including alternative spliced variants)

Non-coding RNAs

tRNAs (497 genes)

rRNAs (243 genes)

snmRNAs (small non-messenger RNAs)

microRNAs and siRNAs (small interferring RNAs)

snoRNAs (small nucleolar RNAs)

snRNAs (small nuclear RNAs)

Pseudogenes (~ 2,000)
The human transcriptome
Nucleotides
High density oligonucleotide arrays
across 11 different cell lines
~ 70% of transcripts non-coding
~79-88% have multiple transcripts
Kapranov et al., 2002
~ 90% of transcribed nucleotides
outside annotated exons
The dimensions of the unique transcriptome??
>>> current 40,000 estimate
Kampa et al., Novel RNAs identified from an in-depth analysis of the transcriptome of human
chromosomes 21 and 22. Genome Research. 2004
Transcriptomics
Scope



the population of functional RNA transcripts.
the mechanisms that regulate the production of RNA transcripts
dynamics of the trancriptome (time, cell type, genotype, external stimuli)
Definition
The study of characteristics and regulation of the functional RNA transcript
population of a cell/s or organism at a specific time.
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data analysis
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
Observing the transcriptome
High-throughput friendly
Genome
Predicts Biology
**
Regulatory
network
Transcriptome
Context dependent and dynamic
Proteome
**Li et al., 2004
Publications: Expression Profiling vs Proteomics
Expression Profiling
Proteomics
3500
3000
2500
2000
1500
1000
500
0
20
20
20
20
19
19
19
19
19
03
02
01
00
99
98
97
96
95
Quantitative monitoring of gene expression patterns with a
complementary
DNA microarray.
“ The challenge is no longer
in the expression
arrays themselves, but
in developing experimental designs to exploit the full power of a
Schena M, Shalon D, Davis RW, Brown PO.
global perspective.”
Stanford University Medical Center, CA.
Eric Lander
Data from PubMed
Observing the transcriptome?
Classic Human Transcriptome Profiling Studies:
Trancriptome reflects Biology
Golub et al.,
Molecular classification of cancer: class
discovery and class prediction by gene
expression monitoring. Science 1999.
ALL – acute lymphoblastic leukemia
AML – acute myeloid leukemia
Scherf et al.,
A gene expression database for the
molecular pharmacology of cancer. Nature
Genetics 2000
60 human cancer cell lines
Observing the transcriptome
Focussed Experimental Approaches:
 Northern Blotting Analysis
 Real time PCR (quantitative or semi-quantitative)
Highthroughput Approaches:
 Closed System Profiling:
Microarray expression profiling
 Open System Profiling:
Serial analysis of gene expression (SAGE)
Massively Parallel Signature Sequencing (MPSS)
Red – increase of Cy5 sample transcripts
Green – increase of Cy3 sample transcripts
Yellow – equal abundance
Limit of Detection: 1 in 30,000 transcripts
~ 20 transcripts/cell
Experimental overview:
Cell
population A
Cell
population B
RNA
extraction
A
A
B
Quantify
pixel intensities.
B
Reverse
transcription
A
A
B
“Overlay
images”
B
Klenow
label
incorporation
Sample A labelled
with cy5 dye
Scan cy5
channel
Sample B labelled
with cy3 dye
Scan cy3
channel
Hybridisation
Washing
Red – increase of Cy5 sample transcripts
Green – increase of Cy3 sample transcripts
Yellow – equal abundance
Limit of Detection: 1 in 30,000 transcripts
~ 20 transcripts/cell
Platforms and Formats
 Isotope
 Nylon – cDNA (300-900 nt)
 Two-colour
 Glass
 cDNA or Oligo (80 nt)
 500 – 11,000 elements
 Affymetrix
 Silicone – oligo (20 nt)
 22 ,000 elements
 Tissue Arrays
 Glass
 Tissue Discs (20-150)
Affymetrix GeneChip®
Limits: 1: 100,000
transcripts
~ 5 transcripts/cell
Affymetrix GeneChip®
http://www.affymetrix.com
Affymetrix:
Gene Expression Arrays
Arabidopsis Genome
C. elegans Genome
Drosophila Genome
E. coli Genome
Human Genome U133 Plus
Mouse Genome
Yeast Genome
Rat Genome
Zebrafish
Plasmodium/Anopheles
Transcripts/Genes
24,000
22,500
18, 500
20, 366
47,000
39, 000
5, 841 (S. cerevisiae) & 5, 031 (S. pombe)
30, 000
14, 900
4,300 (P. falciparum) & 14,900 (A. gambiae)
Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700)
Canine (21,700), Bovine (23,000)
B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)
Microarray and GeneChip Approaches
Advantages:




Rapid
Method and data analysis well described and supported
Robust
Convenient for directed and focussed studies
Disadvantages:
 Closed system approach
 Difficult to correlate with absolute transcript number
 Sensitive to alternative splicing ambiguities
Serial Analysis of Gene Expression (SAGE)
The principles:
 Velculescu et al., Science 1995
 A transcript (new or novel) can be recognised by a small subset (e.g. 14) of its
nucleotides – a tag
 Linking tags allows for rapid sequencing.
 Open system for transcript profiling
Modified SAGE methods
LongSAGE (21 nt)
14 nt
SAGE-lite, micro-SAGE, mini-SAGE
AAAAAAAAA – 3’
TAG
AAAAAAAAA – 3’
TAG
AAAAAAAAA – 3’
TAG
AAAAAAAAA – 3’
TAG
TAG
RASL/DASL methods (5’ and 3’ Tags)
TAG
Sequence
TAG
TAG
AGCTTGAACCGTGACATCA
TGGCCATTGGCCCCAATTG
AGACAGTGAGTTCAATGC
SAGE
Advantages:




Potential ‘open’ system method – new transcripts can be identified
Accuracy of unambiguous transcript observation
Digital output of data
Quantitative and qualitative information
Disadvantages:

Characterising novel transcripts is often computationally difficult from short tag sequences

Tag specificity (recently increased length to 21 bp)

Length of tags can vary (RE enzyme activity variable with temperature)

A subset of transcripts do not contain enzyme recognition sequence

Sensitive to a subset of alternative splice variants
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data analysis
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
Biological question
Sample Attributes
Experimental design
Platform Choice
Microarray experiment
Image analysis
16-bit TIFF Files
(Rspot, Rbkg), (Gspot, Gbkg)
Normalization
Clustering
Statistical Analysis
Data Mining
Pattern Discovery
Biological verification
and interpretation
Classification
Analysis
47,000 x 2 x 2
Liver
Brain
188, 000
datapoints
47,000 x 2 x 2
188, 000
datapoints
Lymphocyte
47,000 x 2 x 2
datapoints
188, 000
Analysis
Essential problem:
Given a large dataset with technical and biological noise:
Find:
A) Transcripts: patterns (common themes or differences)
measures of robustness or some idea of uncertainty
B) Sample: similarities or differences between samples on global/multi-gene level
Analysis
Liver
Brain
Lymphocytes
Which transcripts
are different?
What are the
patterns?
Biologists Nightmare: Statisticians Playground
Characteristics of the expression profiling data:







High dimensionality
Sample number (n) low and observation number high (p)
Non-independence of observations
Complex patterns: visualisation and extraction
Incorporation of contextual information
Standardisation and data sharing
Integration of & with other data types
Analysis Methods

Classical parametric & non-parametric statistical tests for hypothesis testing

Unsupervised clustering algorithms
Hierarchical clustering
Kmeans and Self-Organising Maps

Classification
e.g. Machine learning and Linear discriminant analysis

Dimensionality Reduction or Principal Component Analysis
e.g. Gene Shaving and Multi-dimensional Scaling

Probabilistic Modelling
Dynamic Bayesian Networks
Markov Models
Analysis Methods
Classical Parametric Statistical Analysis:
H 0 (GeneA)   AL   AB   ALy
 AL
Fold Change
Tools:
T-test
 ALy
ANOVA
Mann Whitney
U Test
 AB
Liver
Brain
Lymphocyte
Analysis Methods
Classical Parametric Statistical Analysis:
H 0   L   B   Ly
(P=0.01) 20,000 transcripts = 200 transcripts
Difficulties
???
 Assumes that observations are normally distributed and independent
 ‘Statistical significance’ does not equal biological significance
 Appropriate multiple testing corrections are difficult
Analysis Methods
Clustering Approaches:
Divides or groups genes/samples into groups “clusters”, based on similarities and
differences
Number of groups is user defined
Algorithms:
Hierarchical clustering
Kmeans clustering
Self organising maps
log2(cy5/cy3)
Distance Metrics
2
0
-2
Time
Distance between 2 expression
vectors
Euclidean Pearson(r*-1)
to
1.4
-0.90
to
4.2
-1.00
log2(cy5/cy3)
Distance Metric
2
0
-2
Transcription Factor Transcript
Target Transcript 1
Target Transcript 2
Pearson Distance
Euclidean Distance
Hierarchical Clustering
g1
g2
g3
g4
g5
g6
g7
g8
g1 is most like g8
g1
g8
g2
g3
g4
g5
g6
g7
g4 is most like {g1, g8}
g1
g8
g4
g2
g3
g5
g6
g7
Hierarchical Tree
g1
g8
g4
g5
g7
g2
g3
g6
Clustering: Case Study
Sorlie et al., 2001
Breast tissue subtypes
Hierarchical clustering
K-means clustering
Partition or centroid algorithms
Step 1: User specifies K clusters
Brain
Expression
Level
x
K=3
x
x
Liver Expression Level
K-means clustering
Step 2 – Using Euclidean distance nearest points assigned to clusters (k)
Step 3 – New centroids calculated
x
K=3
x
x
K-means clustering
Step 4 – Points re-assigned to nearest centroid
Step 5 – New centroids calculated
Iterates until
centroids don’t
move
K=3
Transcript B
Classification
Transcript A
K-nearest neighbour methods (KNN)
Linear Discriminant Analysis (LDA)
Machine Learning: Support Vector Machines
Neural Network Analysis
Adapted from Florian Markowetz
Classification
Training Set
Test Set
2/3 sample set
1/3 sample set
Define Classification Rule
Gene B
Linear Discriminant Analysis
KNN
Gene A
Classification
Gene B
More complex classifiers
Gene A
KNN – Voting scheme – (k=3) Use three closest points to classify
Adapted from Florian Markowetz
Probabilistic Modelling
 Incorporate dependencies and prior knowledge into the identification of
patterns/clusters:
- relationships in time between samples
- relationships between genes
 Handle measures of uncertainty well
 Conceptually simple, consideration needed on implementation
 Markov modelling
 Dynamic bayesian networks
Analysis Methods

Classical parametric & non-parametric statistical tests for hypothesis testing

Unsupervised clustering algorithms
Hierarchical clustering
Kmeans and Self-Organising Maps

Classification
Machine learning and Linear discriminant Analysis

Dimensionality Reduction or Principal Component Analysis
Gene Shaving and Multi-dimensional Scaling

Probabilistic Modelling
Dynamic Bayesian Networks and Pattern recognition
Markov Models
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data curation and analysis pipelines
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
…. to be continued.
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data curation and analysis pipelines
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
Regulation of Gene Expression
Abundance (transcript) =
Rate of Transcription
Transcription
–
Rate of Decay
Decay
Protein/DNA interactions
Protein/RNA interactions
 cis and trans regulatory sequence motifs
 cis-acting regulatory motifs
 chromatin structure
 secondary structure
 Methylation
Regulation of Transcription
Wray et al., 2003
Regulation of Decay
Stabilisation – facilitates rapid increase in potential protein production
Stabile
Abundance
Abundance
Destabilisation – facilitates precise time and dose control of transcripts
Time
Decay
Time
Sequence-mediated mRNA decay – AU rich elements (AREs)

3’ UTR, 50 – 150 nucleotides

usually multiple copies (e.g. AUUUA x 5)

protein recruitment for destabilisation

size and content variation (functionally critical motif unknown)

>30% of vertebrate homologous mRNAs have highly conserved elements in the
3’UTR - often sequence & position
The importance of the decay process

BMP2 (bone morphogenetic protein 2) developmentally critical, highly
conserved protein in vertebrates (Fritz et al., 2004)
3’ UTR conservation:
- 73% /100 nucleotides, 450 myr evolution
- 95% within mammals

Cancer related genes:
C-fos, C-myc, C-jun, MMP-13, Cyclooxygenase-2, Cyclin D, Cyclin E,
Cyclins A and B, Cdk inhibitors, DNA methyltransferase 1……….
(Review: Audic and Hartley, 2004)
Regulation of Transcription
Wray et al., 2003
Regulation of Trancription
Diverse orientations, structure and functional properties of
regulatory modules
Wray et al., 2003
Regulation of the transcriptome

Finding regulatory elements using co-abundant transcripts
Assumption:
shared abundance profile = same cluster = shared regulatory machinery
Penacchio and Rubin, 2001
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data analysis
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
The transcriptome & the genome
Using the genome to infer/observe the transcriptome:

Construction of whole genome/transcriptome arrays and SAGE tags

Using sequence features to predict gene expression:
Beer and Tavazoie. Predicting gene expression from sequence. Cell 2004

Using chromatin structure to predict regulation of gene expression:
Sabo et al. Genome-wide identification of DNaseI hypersenstive sites. PNAS
2004
Quantitative trait loci mapping
Morley et al., Genetic analysis of genome-wide variation in human gene
expression. Nature 2004

Schadt et al., Genetics of gene expression surveyed in mouse, human and maize.
Nature 2003
Transcriptome & Genome
Beer and Tavazoie, Cell. 2004
Abundance
profile

Transcription factor
binding site
Predict potential gene
expression patterns
Transcriptome & Genome
Beer and Tavazoie, Cell. 2004
AND Logic:
AND Logic, OR Logic:
OR Logic, NOT Logic:
Combinatorial patterns help identify
groups of transcripts predicted to
show similar abundance profiles
Solid: Actual expression Dashed: Predicted
Introduction:
The scope of transcriptomics – a definition of the transcriptome
Part I: Observing the transcriptome
Experimental methodology
Data analysis
Part II: Using the transcriptome
The regulation of the trancriptome
The transcriptome and the genome
The transcriptome and the proteome
Beyond the Human Transcriptome
The transcriptome & the proteome

Functional annotations of co-abundant genes
Yang et al., 2003 Decay rates of human mRNAs: Correlation with functional
characteristics and sequence attributes. Genome Research.
Co-ordinated patterns of decay rates within functional classes of transcripts
Transcription factor functional classes have “fast-decaying”
mRNAs (<2 hr half lives).
Transcripts of multi-subunit proteins have correlated decay
patterns and rates
The transcriptome & the proteome
Do they agree?
Studies of direct correlation between mRNA abundance and protein abundances
( r = 0.6) (Hegde et al., 2003)
Biological Issues:
Post-translational modifications
Protein stability and folding
Alternative splicing products
Technical Issues:
Inter-platform variability (microarray and RT PCR: r = 0.8)
Protein abundance measures – 2D gel electrophoresis
The transcriptome & the proteome
The integration of transcriptomics and proteomics
Hegde et al., 2003
Synergistic approaches to biological problems using both transcriptomics and
proteomics
Beyond the Human Transcriptome
Challenges for the Future: (short and long term)

Integration of different datatypes
- sequence, exon structure, transcript abundance, protein abundance and function

Dealing with alternative splice variants

The regulatory processes behind any given RNA abundance

Dealing with gene ontologies in a quantitative manner
Beyond the Human Transcriptome
Future Directions:

‘Open’ systems for comprehensively cataloguing the transcriptome
- between tissues/cells/developmental time points
- between individuals

Variation of transcriptome between individuals
- coding variants, epigenetic variation and inheritance

Clinical deployment of transcriptome profiling approaches in diagnostics and
pharmacogenetics

Human Regulatory Network Resources for Tissues
Acknowledgements
OX-FORD
BIOINFORMATICS
GROUP
Genomes, Sequences and Function
Oxford Centre for Gene Function
Jotun Hein
Chris Holmes
Gerton Lunter
Lizhong Hao
Ben Holtom
Karen Lees
http://www.stats.ox.ac.uk/~taylor/Presentations