Download Genomics Post-ENCODE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pharmacogenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

RNA silencing wikipedia , lookup

History of RNA biology wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epitranscriptome wikipedia , lookup

Copy-number variation wikipedia , lookup

Point mutation wikipedia , lookup

NUMT wikipedia , lookup

Oncogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Gene expression programming wikipedia , lookup

Primary transcript wikipedia , lookup

Genetic engineering wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Non-coding RNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Human genetic variation wikipedia , lookup

Transposable element wikipedia , lookup

NEDD9 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Microevolution wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Tag SNP wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic library wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Public health genomics wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

ENCODE wikipedia , lookup

Transcript
Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE
Rory Johnson
Bioinformatics and Genomics
Centre for Genomic
Regulation
AEEH
21 / 2 / 14
1
This talk:
• Our view of the human genome today thanks to ENCODE
• What it means for translational research
2
Epigenetics: the intermediate between genome and phenotype
3
Our changing view of the genome
2000
2014
(Hong Kong)
4
Our changing view of the genome
CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT
Histones,
+
modifications
Chromatin
Transcription
factors
This organisation is encoded
in non-protein coding
genome sequence
Enhancers
5
The Genome and Epigenome
Genome sequence:
Simple
Static
Epigenome sequence:
Multi-layered
Dynamic
Cell-specific
=> Hence ENCODE
6
The human genome in numbers
• 3 x10^9 base pairs
• 20,345 protein coding genes
• 13,870 Long noncoding RNA genes
• 9013 Small noncoding RNA genes
• 3x10^6 regulatory regions (enhancers)
• 12,460 known trait-associated SNPs (short nucleotide variants)
• 88% of trait-associated SNPs lie outside protein coding sequence
7
Next Generation Sequencing
The high throughput reading of DNA or RNA.
The main system now is Illumina Hiseq
Statistics:
Read length: ~150nt
Reads per lane: ~150 million
Lanes per run: 16
Total nt per run: ~400 billion
Cost per run: ~16,000 euro
(Human genome project took 13 years and $3billion to
sequence 3 billion nt, ending 2003)
http://www.labome.com/method/RNA-seq-Using-Next-Generation-Sequencing.html
8
NGS based methods for genome analysis: towards the clinic
ChIP-seq (chromatin
immunoprecipation)
Transcription factor binding /
chromatin state
Dnase-seq
Transcription factor binding /
chromatin state
RNAseq
mRNA transcription / splicing
Ribosome footprinting
Translation rate
Hiseq
Genome 3D structure
These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics:
• Rui et al Cell, Volume 148, Issue 6, 1293-1307, “iPOP”
• Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013)
“Using ATAC-seq maps of human CD4+ T cells from a proband obtained on consecutive days, we demonstrated the
feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.”
9
The ENCODE Project
• ENCODE: Encyclopedia of DNA Elements
(http://www.genome.gov/10005107)
• International consortium dedicated to comprehensively
mapping the human epigenome.
• Created high quality ongoing gene annotations: GENCODE
• 32 laboratories, $400million
• In Spain: Roderic Guigo (CRG) was one of the leaders
(with Tom Gingeras, CSHL) of the transcriptomics section.
• 147 cell types (mainly transformed cell lines)
• 1640 genome-wide datasets
10
ENCODE integrates multiple data types across cell types
RNAseq
Gene expression
ChIP
Chromatin
ChIP
Transcription Factors
ChIA-PET
Genome structure / folding
GENCODE
Gene annotation catalogue
11
Visualizing ENCODE data at the UCSC Genome Browser
http://genome-euro.ucsc.edu/
12
ENCODE data of relevance to hepatology
Genes
Chromatin
Transcription
Factors
RNA
ENCODE Tier 2:
HepG2 cell line hepatocellular carcinoma
(see http://www.genome.gov/26524238 for other cell types)
Including:
8 RNAseq experiments
114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G)
http://genome-euro.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html
13
Chromatin state is extremely cell type specific
14
Other projects of relevance: Epigenomics Roadmap Project
15
Other projects of relevance: eQTL
• Gtex – Genotype Tissue Expression project
• Hunting for genetic variants that influence gene
expression
 Linking genetic variants to changes in gene expression
– regulatory variants or “expression quantitative trait
loci” (eQTL)
 These will be different between tissues
16
What does this mean for translational research?
• Protein-focussed studies will miss the majority of functional disease causing variants / mutations
• Non-coding variants will usually be regulatory
• Non-coding variants will usually be cell type specific
• Large projects like ENCODE are producing rich data that can be used to interpret clinical results
`
17
How can genetic variants (SNPs) in noncoding regions cause phenotype?
• By altering the nucleotide sequence recognized by regulatory protein
Hawkins et al Nature Reviews Genetics 11, 476-486
18
How can genetic variants (SNPs) in noncoding regions cause phenotype?
Genetic Variant (SNP)
Gene Expression
Disease
19
How does ENCODE affect translational research projects?
• Genome wide association study (GWAS)
• Exome sequencing
• Gene expression profiling
20
Translational research approaches 1: Genetic approaches
Genomic approaches to identify genetic variants underlying disease:
GWAS – genome wide association study
Advantages
Disadvantages
Genome wide
Depends on limited # of marker SNPs
Not biased towards coding regions
Low resolution
Good at identifying common variants
Does not yield insights into mechanism
Exome sequencing – target genome sequencing
Advantages
Disadvantages
Proteome wide
No information about noncoding variants
Can identify rare causative variants
Likely missing most causative variants
Usually yields mechanistic hypothesis
High resolution
21
Interpretation of GWAS results
GWAS gives an unbiased genome wide set of candidate SNPs
The majority of these lie outside protein coding regions
Two main challenges:
1. Identifying the causative SNP
2. Understanding the mechanism of action of that SNP
Hepatocellular carcinoma
Li et al PLoS Genet 8(7): e1002791.
22
Identifying the causative SNP using ENCODE data
Hunt for the likely functional
SNP in LD with marker
Schaub et al Genome
Res. 2012
Sep;22(9):1748-59. doi:
10.1101/gr.136127.111.
e
23
Understanding the mechanism of a noncoding SNP using ENCODE data
Schaub et al Genome
Res. 2012
Sep;22(9):1748-59. doi:
10.1101/gr.136127.111.
e
24
RegulomeDB: A web server for functional prediction of SNPs using ENCODE data
25
Exome sequencing
Exome sequencing: targeted genome sequencing
of protein coding exons
Relies on capturing a selected subset of
genome
Advantages:
• lower cost and
• higher statistical power
• can detect rare private mutations
Disadvantages:
• Presently ignoring the noncoding genome
(~99%)
26
Exome sequencing: whats next?
Whole genome sequence not likely to be practical: no statistical power
Exome technology is highly customisable could be adapted to noncoding regions
The main question: what are the target regions?
• How to define the target space?
• regulatory regions?
• Noncoding RNAs?
• Protein binding sites?
• Likely to be organ / disease specific
• Will require bioinformatic analysis to design reagents before experimental project begins.
27
Translational research approaches 2: Transcriptomic approaches
ENCODE has made a major contribution to gene expression
studies, by providing high quality annotations of novel
noncoding genes through GENCODE.
Microarray studies
• Microarrays are restricted by the catalogue of probes
chosen
• Commercial arrays: usually protein coding genes
• MicroRNA arrays available
• Long noncoding RNA arrays available (CRG provide free
designs) – based on ENCODE annotations
28
Translational research approaches 2: Transcriptomic approaches
RNAseq
• Unbiased > can discover novel RNAs
• Can quantify expression of known and novel
genes, and discover RNA from non “genic” loci
• Analysis requires more bioinformatic analysis
• Still more expensive than arrays
29
Translational research approaches 2: Transcriptomic approaches
Problems:
It is easy to discover and quantify the expression of novel genes
It is difficult to understand the function of such genes
We have no bioinformatic tools to predict the function of most novel ncRNAs
We have limited experimental tools to investigate them
30
What does ENCODE mean for these studies?
GWAS
• GWAS study design will not likely be affected
• ENCODE will allow better interpretation of discovered
SNPs
Exome
• Whole genome cohort studies may never be feasible
• Capture sequence approach can be redesigned to
study noncoding variants in disease of choice
• ENCODE and other public data will aid in the design of
these projects
Gene expression
• New gene annotations can help in both microarray and
RNAseq projects to discover novel noncoding gene
targets.
• RNAseq will eventually replace arrays as costs drop,
but right now new array designs are competitive in
large experiments and given bioinformatic
requirements
31
Nothing would have been possible without…
CRG Bioinformatics & Genomics
ENCODE / GENCODE
Roderic Guigó
Jennifer Harrow
Tim Hubbard
(GENCODE, Sanger)
Bioinformatics and Genomics group
FUNDING
Ramón y Cajal RYC-2011-08851
Plan Nacional BIO2011-27220
32
The main message of ENCODE
To understand genotypes and phenotypes, we must look beyond the protein coding gene.
Further reading:
Interpreting noncoding genetic variation in complex traits and human disease
•Lucas D Ward & Manolis Kellis
•Affiliations
Nature Biotechnology 30, 1095–1106 (2012)
33
How could variants in noncoding regions cause phenotype?
• By altering the nucleotide sequence recognized by regulatory protein
• By altering a noncoding RNA gene, either in expression levels or mature sequence
Hawkins et al Nature Reviews Genetics 11, 476-486
Haas et al RNA Biol. 2012 Jun;9(6):924-37
34
Levels of genome regulation
We now appreciate the genome is regulated at multiple
levels:
• “Epigenetically” – chromatin structure
• Transcriptionally – RNA production
• Post-transcriptionally – RNA processing
(splicing, transport, stability)
• Translationally – protein production at ribosome
• Structurally – the folding structure of the
genome
=> These sequences all have effects on
phenotype and thus may contribute to disease
=>
All of these are encoded in
noncoding DNA sequence
35
Case study: Studying disease-associated regulatory SNPs
incorporating cohort epigenome data
A SNP for breast cancer
creates a NFκB binding site
Karczewski KJ et al
Proc Natl Acad Sci U
S A. 2013 Jun
4;110(23):9607-12
36