Download Gene Expression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia, lookup

Helitron (biology) wikipedia, lookup

Transcript
TRANSCRIPTOMICS AND
GENE ANNOTATION
UNIT 7
GENE EXPRESSION – A MISNOMER ?
• In reality, gene expression can only be quantified
by looking at protein products in the cell (via
proteomic approaches).
• However, the term has been co-opted to describe
differences in transcript (mRNA) levels.
• Transcripts may or may not be translated into
protein and thus don’t necessarily reflect gene
expression.
DIFFERENTIAL GENE EXPRESSION
• Responsible for differences between cell types of the same
organism (e.g., kidney vs. brain cells)
• Means by which development is controlled
• Involves gene feedback loops and induction/repression
initiated by external (environmental stimuli) and/or internal
(transcription factors) forces
A TYPICAL EUKARYOTIC PROTEIN CODING GENE
A TYPICAL EUKARYOTIC PROTEIN CODING GENE
TRANSCRIPTOMICS
• The study of the complete set of RNAs
(transcriptome) encoded by the genome
• 1 - of a specific cell or organism
• 2 - at a specific time or
• 3 - under a specific set of conditions
• Dependent on:
• The organism
• The cell, cell line, or tissue
• The developmental stage
• The condition/treatment
• Usually, we tend to ignore the rRNAs and tRNAs
TRANSCRIPTOME COMPLEXITY
TRANSCRIPTOME COMPLEXITY
HOW MUCH OF THE HUMAN GENOME IS
TRANSCRIBED?
• ENCODE project (Nature, 2007)
•
Examined 1% of the genome (~30Mb)
•
“The human genome is pervasively transcribed, such that the majority of its bases
are associated with at least one primary transcript and many transcripts link distal
regions to established protein coding loci.”
•
“Many novel non-protein-coding transcripts have been identified, with many of
these overlapping protein-coding loci and other located in regions of the genome
previously thought to be transcriptionally silent.”
•
“Numerous previously unrecognized transcription start sites have been identified,
many of which show chromatin structure and sequence-specific protein binding
properties similar to well-understood promoters.”
•
74% of bases are represented in a primary transcript with evidence coming from 2
or more experimental technologies
•
This project was published prior to more advanced techniques being developed
and conclusions have been contested.
TRANSCRIPTOMICS
• The main/common tasks in a transcriptome analysis
• 1 – Identify your targets/goals (protein coding, non-coding)
• 2 – Transcriptome reconstruction
•
Identify the genes transcribed
•
Identify isoforms
• 3 – Expression quantification
• 4 – Identify differential expression
• 5 – Transcription mapping
• 6 – Identify gene variants within/among individuals
• 7 – Identify allele-specific transcription
HISTORY OF GENE EXPRESSION ANALYSIS
• Northern blotting
• EST sequencing
• Microarrays
• RNA-Seq
RNA-SEQ: WHOLE TRANSCRIPTOME SHOTGUN
SEQUENCING
• The current state-of-the-art
• Process
•
Isolate mRNA from a tissue
•
Reverse transcribe mRNA into cDNA
•
Sequence cDNA with a next-generation sequencer (e.g., Illumina)
•
Quantify number of different transcripts, copy number of each transcript, and
identify SNPs, splicing variations, etc.
Sample preparation
Next generation sequencing (NGS)
Data analysis:
Mapping reads
Visualization (Gbrowser)
De novo assembly
Quantification
RNA-SEQ
•
•
Pros
Cons
•
High sensitivity, Quantitative, Rapid
•
•
Less expensive than microarray development
(but not microarray screening)
May provide more information than
you want
•
Not cost-effective if you are only
looking at a handful of genes
•
Not limited to detecting transcripts that
correspond to known genomic sequence
•
Can provide single-nucleotide resolution for
alternative splicing and exon boundaries
•
Large dynamic range for transcript detection
RNA-SEQ
• Library construction challenges
• How to avoid rRNA?
• Use oligo-dT enrichment
• Bias toward 3’ end
• Protocols to remove rRNA
followed by random
fragmentation
• More even coverage but bias
against the ends
• “Not-so-random” (NSR) priming
• Subtract the random
hexamers and heptamers that
are likely complementary to
rRNAs before first round cDNA
synthesis
RNA-SEQ
• Strand-specific library or not?
• Transcription can occur in both directions
• Gene can be located on either DNA strand and sometimes
overlapping.
• Complementary RNA molecule to a given mRNA can also be
transcribed, antisense transcription, are involved in regulatory
mechanism.
• Knowing from which DNA strand the RNA molecule originates
from is an important piece of information, which helps resolving
annotation ambiguities for known and novel genes, provides
hints to the function of the studied RNA, and helps with
quantification
• Knowing the strand of origination can resolve questions
about the gene or origin, function of the RNA and
expression level
RNA-SEQ
• Standard library example
5’
First Strand Synthesis
5’
Second Strand Synthesis
5’
A
A
5’
T
A
A
T
A addition
Adapter ligation
RNA-SEQ
• Major methods for
strand-specific libraries
• 1. Differential adapter
ligation to RNA
• 2. Differential adapter
priming (RT method)
• 3. ‘Strand marking’ of
the RNA or secondstrand cDNA (dUTP)
The problem:
-
Reconstruct full-length transcripts (1000’s bp) from short reads (100bp)
Read coverage highly variable
Capture alternative isoforms
 Annotation? Expression differences? Novel non-coding?
Solution(?):
- Read-to-reference alignments, assemble transcripts
- Assemble transcripts directly
Read mapping vs. de novo assembly
Good reference
No genome
Cufflinks Workflow
- Map reads to reference genome:
- Disambiguate alignments
- Allow for gaps (introns)
- Use pairs (if available)
-
Build sequence consensus:
- Identify exons & boundaries
- Identify alternative isoforms
- Quantify isoform expression
-
Differential expression:
- Between isoforms
- Between samples
- Annotation-based and novel transcripts
Transcriptome assembly with Trinity:
How it works
Brian Haas
Moran Yassour
Kerstin Lindblad-Toh
Aviv Regev
Nir Friedman
David Eccles
Alexie Papanicolaou
Michael Ott
…
Trinity Workflow
- Compress data (inchworm):
- Cut reads into k-mers (k consecutive nucleotides)
- Overlap and extend
- Report all sequences (“contigs”)
-
Build de Bruijn graph (chrysalis):
- Collect all contigs that share k-1-mers
- Build graph
- Map reads to components
-
Enumerate all consistent possibilities (butterfly):
- Unwrap graph into linear sequences
- Use reads and pairs to eliminate false sequences
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• ENCODE project (Nature, 2012)
•
“…assign biochemical functions for 80% of the genome, in particular outside of the
well-studied protein-coding regions.”
•
“We define a functional element as a discrete genome segment that
•
•
•
1 - encodes a defined product (e.g. protein or non-coding RNA)
•
2 - or displays a reproducible biochemical signature (e.g. protein-binding or a specific
chromatin structure).”
Even more criticism of this work (and I think deservedly so).
•
The definition above is much too loose and allows for just about anything to be
considered ‘functional’.
•
Essentially, anything that produces a transcript or is bound by a protein is ‘functional’.
Criticized most soundly, in my opinion, by Dan Graur in: “On the immortality of
television sets: ‘Function’ in the human genome according to the evolution-free
gospel of ENCODE”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• On the immortality of television sets: ‘Function’ in the human genome
according to the evolution-free gospel of ENCODE (GBE 2013)
•
Main points – “This absurd conclusion was reached through…”
•
“employing the seldom used ‘causal role’ definition of biological function and then
applying it inconsistently to different biochemical properties”
•
“committing a logical fallacy known as ‘affirming the consequent’”
•
“failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage
DNA’”
•
“using analytical methods that yield biased errors and inflate estimates of
functionality”
•
“favoring statistical sensitivity over specificity”
•
“emphasizing statistical significant rather than the magnitude of the effect.”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Employing the seldom used ‘causal role’ definition of biological function….”
Main points – “This absurd conclusion was reached through…”
•
What is the meaning of ‘function’?
•
Selected effect definition is historical and evolutionary
•
•
For a trait, T, to have a proper biological function, F, it is necessary and sufficient that
two conditions hold
•
1 – T originated as a reproduction of some prior trait that performed F (or something
similar) in the past
•
2 – T exists because of F
•
The ‘selected effect’ function of a trait is the effect for which it was selected or is
maintained
Causal role definition
•
•
For a trait, Q, to have a causal role, function G, it is necessary and sufficient that Q
performs G.
The heart
•
Selected effect – to pump blood, Causal role – to add mass to the body
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Employing the seldom used ‘causal role’ definition of biological function….”
Main points – “This absurd conclusion was reached through…”
•
Two identical sequences (TATAAA) in the genome at distinct loci
•
Instance 1 has been selected for and maintained by natural selection with the
effect of binding a transcription factor to initiate gene expression
•
Instance 2 has arisen by chance, but because of its sequence, can also bind a
transcription factor but probably has no impact on function
•
Instance 1 – selected effect, Instance 2 – causal role
•
“ENCODE adopted a strong version of the causal role definition of function,
according to which a functional element is a discrete genome segment that
produces a protein or an RNA or displays a reproducible biochemical signature
(e.g., protein binding). Oddly, ENCODE not only uses the wrong concept of
functionality, it uses it wrongly and inconsistently.”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Committing a logical fallacy known as ‘affirming the consequent’
•
If P, then Q. Q. Therefore, P.
•
According to ENCODE, DNA segments that ‘function’ in a process (e.g. gene
regulation) tend to display a certain property (e.g. transcription factor binding)
•
Another DNA segment displays said property (e.g. it binds a transcription factor)
•
Therefore, the DNA segment is functional (e.g. is involved in gene regulation)
•
All of my ‘nopes’ apply.
•
One of my favorite passages, “the ENCODE authors singled out transcription as a
function, as if the passage of RNA polymerase through a DNA sequence is in some
way more meaningful than other functions. But, what about DNA polymerase and
DNA replication? Why make a big fuss about 74.7% of the genome that is
transcribed, and yet ignore the fact that 100% of the genome takes part in a
strikingly “reproducible biochemical signature”—it replicates!”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Committing a logical fallacy known as ‘affirming the consequent’
•
If P, then Q. Q. Therefore, P.
•
According to ENCODE, DNA segments that ‘function’ in a process (e.g. gene
regulation) tend to display a certain property (e.g. transcription factor binding)
•
Another DNA segment displays said property (e.g. it binds a transcription factor)
•
Therefore, the DNA segment is functional (e.g. is involved in gene regulation)
•
All of my ‘nopes’ apply.
•
One of my favorite passages, “the ENCODE authors singled out transcription as a
function, as if the passage of RNA polymerase through a DNA sequence is in some
way more meaningful than other functions. But, what about DNA polymerase and
DNA replication? Why make a big fuss about 74.7% of the genome that is
transcribed, and yet ignore the fact that 100% of the genome takes part in a
strikingly “reproducible biochemical signature”—it replicates!”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Committing a logical fallacy known as ‘affirming the consequent’
•
“Transcription ≠ function”
•
“Histone modification ≠ function”
•
“Open chromatin ≠ function”
•
“Transcription factor binding ≠ function”
•
“DNA methylation ≠ function”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage
DNA’”
•
Misconceptions about ‘junk DNA’
•
1 – lack of knowledge of original definition
•
2 – belief that evolution can always get rid of nonfunctional DNA
•
3 – belief that ‘future potential’ constitutes ‘function’
•
Original definition of junk DNA – a genomic segment on which selection does not
operate.
•
“This sense of the term “junk DNA” was used by Jacob (1977) in his famous paper
“Evolution and Tinkering”: “[N]atural selection does not work as an engineer … It
works like a tinkerer—a tinkerer who does not know exactly what he is going to
produce but uses whatever he finds around him whether it be pieces of string,
fragments of wood, or old cardboards … The tinkerer … manages with odds and
ends. What he ultimately produces is generally related to no special project, and it
results from a series of contingent events, of all the opportunities he had to enrich
his stock with leftovers.”
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage
DNA’”
•
Misconceptions about ‘junk DNA’
•
1 – lack of knowledge of original definition
•
2 – belief that evolution can always get rid of nonfunctional DNA
•
3 – belief that ‘future potential’ constitutes ‘function’
•
“Evolution can only produce a genome devoid of “junk” if and only if the effective
population size is huge and the deleterious effects of increasing genome size are
considerable.”
•
In bacteria, this generally applies. Generation time is correlated with genome size
and effective population sizes are enormous.
•
In eukaryotes, not so much.
HOW MUCH OF THE HUMAN GENOME IS
FUNCTIONAL?
• Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage
DNA’”
•
Misconceptions about ‘junk DNA’
•
1 – lack of knowledge of original definition
•
2 – belief that evolution can always get rid of nonfunctional DNA
•
3 – belief that ‘future potential’ constitutes ‘function’
•
Teleology – the philosophy that nature has goals
•
“Junk DNA may, in fact, exhibit a very similar behavior to the regular junk in one's
garage, which is kept for years and years, and then thrown out a day before it may
become useful.”
•
“Some years ago I noticed that there are two kinds of rubbish in the world and that
most languages have different words to distinguish them. There is the rubbish we
keep, which is junk, and the rubbish we throw away, which is garbage. The excess
DNA in our genomes is junk, and it is there because it is harmless, as well as being
useless, and because the molecular processes generating extra DNA outpace those
getting rid of it. Were the extra DNA to become disadvantageous, it would become
subject to selection, just as junk that takes up too much space, or is beginning to
smell, is instantly converted to garbage … ”. Brenner 1998
NORTHERN BLOTTING
• What is it?
• Detection of RNA on a substrate via hybridization with
a probe
• Pros
• No amplification involved
• Can study expression of multiple genes (e.g., 5-10) on same
gel as long as they are of different molecular weights
• Allows detection of some alternative splicing
• Cons
•
•
•
•
Must blot gels
Requires lots of starting mRNA
RNA highly vulnerable to degradation
Not high-throughput
NORTHERN BLOTTING
• Isolate mRNAs from multiple samples that differ with regard to
tissue type, developmental stage, disease resistance, exposure
to stimulus, etc.
• Place each mRNA population in its own well of a denaturing
agarose gel (formaldehyde added to gel to keep inter- and
intramolecular base pairing from occurring)
• Separate mRNAs by electrophoresis
• Blot mRNA onto membrane. Fix RNA to membrane.
• Hybridize labeled DNA probe(s) to membrane
• Quantify differences in transcript levels between samples
NORTHERN BLOTTING
• What is it?
• Detection of RNA on a substrate via hybridization
with a probe
HOW NORTHERN BLOTTING IS USED
• -actin expression in the brain of a mouse
• Can also look at changes in expression in
multiple tissues as a function of time
EST SEQUENCING (SANGER)
• Sequence ESTs isolated from different tissues or
different experimental trials
• Compare similarities and differences in EST expression
patterns
• Dominance of certain transcripts can make EST
sequencing an inefficient means of measuring changes
in gene expression
• For example, in estrogen-treated chicken oviduct, >
50% of transcripts in cell are product of same gene
EST SEQUENCING (SANGER)
• Pros
• Lots of sequence information that can be used for lots of
different purposes
• Can study expression variation of whole transcriptome, not
just a handful of genes
• Allows detection of alternative splicing
• Cons
• Expensive as dominant cDNAs will be sequenced over and
over again
• Not likely to be truly quantitative due to RT and cloning
biases
MICROARRAYS
• DNAs are spotted onto a glass microscope slide or similar substrate
• Fluorescence techniques rather than radioisotopic techniques are
used in visualization
• Spots are about the size of a typed period using 10 pt font.
• Each spot contains roughly the same amount of DNA
• The density of spots on a microarray depends upon the type of
robot and the wishes of the scientist
• A typical microarray robot can make about 12 slides an hour
• Slides can be stored at room temperature
• Many PCR products can be spotted on a single array (currently up
to 390,000 spots)
• For species with relatively few genes (e.g., yeast), it is possible to
spot all the genes in an ordered manner onto a single array
MICROARRAY
TWO–CHANNEL ARRAY EXPERIMENTS
• Expression patterns are compared by hybridizing both
control and test cDNA populations to the same
microarray(s)
• Each cDNA population is labeled with a different
fluorophore (fluorescent tag)
• While relative changes in gene expression can be
detected, using two cDNA populations (which can
compete with each other) does not permit estimation
of absolute expression levels
OVERVIEW OF TWO–CHANNEL EXPERIMENT
DETAIL OF TWO–CHANNEL EXPERIMENT
MICROARRAY
• Use the fluorescence data to determine exactly which genes
are expressed differently between two tissue types
• Quantify differences in expression for individual genes
• Actually know which genes correspond to which spots
• Find genes that may be activated together (gene expression
pathways)
MICROARRAYS
• Pros
• Can study expression variation of whole transcriptome, not
just a handful of genes
• Definitely high-throughput
• Rapid screening possible
• Cons
• Very few once good slides/chips are made
• Not truly as quantitative as qPCR
• Not practical with poorly characterized genomes (expense in
designing chip requires a commitment from a relatively large
scientific community)