Download 2.4.databases_ensembl - T

Document related concepts

Metagenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Copy-number variation wikipedia , lookup

Oncogenomics wikipedia , lookup

Primary transcript wikipedia , lookup

Genetic engineering wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Human genome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
An Introduction to
ENSEMBL
Cédric Notredame
The Top 5 Surprises in the Human Genome Map
1.
2.
3.
4.
The blue gene exists in 3 genotypes: Straight Leg, Loose Fit and Button-Fly.
Tiny villages of Hobbits actually live in our DNA and produce minute quantities of wool -- which we've been
ignorantly referring to as "navel lint" and throwing away for centuries.
It's nearly impossible to re-fold it along the original creases.
Beer-drinking gene conveniently located next to bathroom-locating gene.
and the Number 1 Surprise In The Human Genome Map...
5-Now that there's a map, male scientists will
attempt to cure diseases by randomly
throwing stuff into beakers, stubbornly
refusing to use the map or ask for directions
-- all the while insisting the cure is right
around the next corner
ENSEMBL: Our Scope
-What is ENSEMBL ?
-Searching Genes in ENSEMBL
-Viewing Genes in ENSEMBL?
-Doing Research With ENSEMBL?
-Where do ENSEMBL Genes Come From
Accessing Genomes
• Genomes sequences are becoming available very rapidly
– Large and difficult to handle computationally
– Everyone expects to be able to access them immediately
• Bench Biologists
– Has my gene been sequenced?
– What are the genes in this region?
– Where are all the GPCRs
– Connect the genome to other resources
• Research Bioinformatics
– Give me a dataset of human genomic DNA
– Give me a protein dataset
What is It ?
• Set of high quality gene predictions
– From known human mRNAs aligned against genome
– From similar protein and mRNAs aligned against genome
– From Genscan predictions confirmed via BLAST of Protein,
cDNA, ESTs databases.
• Initial functional annotation from Interpro
• Integration with external resources (SNPs, SAGE, OMIM)
• Comparative analysis
– DNA sequence alignment
– Protein orthologs
Mr ENSEMBL ?
Richard Durbin (ACEDB)
Ewan Birney (EBI)
Challenges ?
• Scale and data flow
– mainly engineering problems
• Presentation, ease of use
– mainly engineering problems
• Algorithmic
– Partly engineering
– Partly research
ENSEMBL Home
Help!
• context sensitive help pages - click
• access other documentation via
generic home page
• email the helpdesk
HelpDesk / Suggestions
Finding What You
Need
Human
homepage
Text search
BLAST/SSAHA
BLAST/SSAHA ????
Changing Angle…
Map View
Anchor View
Contig View
Chromosome
Overview
Genes and Markers
1Mb
Configuration
Detailed View
Genes, ESTs, CpG etc.
100kb
Contig View
close-up
Customising
& short cuts
Transcripts
red & black
(Ensembl predictions)
Evidence
Pop-up
menu
Cyto View
Marker View
SNP View
Synteny View
Dotter View
Gene
View
Gene-View
Gene-View
Gene-View
Trans View
Exon-View
Protein-View
Protein-View
Protein-View
Family-View
CDK-like
Family-View
CDK-like
The Right View On My Gene
-Where Is My Gene ?
Map View
Cyto View
Contig View
-How Many Transcript for My Gene
Gene View
Exon View
-What is the Function of my Gene
-How does My Gene compare with other
Species
Protein View
SNP View
Family View
Synteny View
Dotter View
Getting The Stuff
Back Home
Export-View
Data Mining with EnsMart
• The aim of EnsMart is to integrate Ensembl data into a
single, multi-species, query-optimised database
– Requirement for cross-database joins removed.
– Query-optimised schema improves speed of data
retrieval.
• Examples
– Coding SNPs for all novel GPCRs
– The sequence in the 5kb upstream region of known
proteases between D1S2806 and D1S2907
– Mouse homologues of human disease genes containing
transmembrane domain located between 1p23 and 1q23
EnsMart I
EnsMart II
Asking Questions With
ENSEMBL
Asking Questions
1-Selecting AND Downloading Genes using
-Functional
-And Evolutive Criteria
2-Comparing Two Pieces of Genome
Asking A Question with ENSMART
What Do You Want ???
All The Human Genes
-Involved in Cell Death
-Associated with a Disease
-With a Homologue in Mouse and Chicken
Which
Specie
Select the region
Where?
What kind
of Gene ?
Select the kind of data
What Kind
of Function ?
Choose An
Evolutionnary
Trace
Select the kind of data
Control of
Regulatory
Region
Control of
Biochemical
Function
Control of
Genetic
Variation
Human Gene
Cell Death
Human Gene
Cell Death
Mouse
Human Gene
Cell Death
Chicken
1133 genes
1106 genes
880 genes
Human Gene
Cell Death
C. Elegans
338 genes
Asking A Question with ENSMART
How Do You Want it Packed ???
I would like
-Chromosome Information
-The ID of my sequences
-The corresponding OMIM Id
-The corresponding Chicken id
Asking A Question with ENSMART
How Do You Want it Packed ???
Come to think of it…
-I’d like to take a look at the
5’ upstream regions
Asking A Question with ENSMART
What Do You Want ???
I Want To know if the Mouse and the Human
Genome are conserved around the Human
Gene SNX5
Where Do ENSEMBL
Genes Come From
Genebuild
Evaluating genes and transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
The Aim…
Overview…
manual curation
Ensembl transcript
predictions
evidence
other groups’ models
Automatic Gene Annotation
human proteins
Other proteins
Pmatch
cDNAs
ESTs
Exonerate
Genewise
Est2Genome
Add UTRs
Genscan exons
Merge
other evidence
Ensembl
Genes
EST genes
ENSEMBL Geneset
• Place all available species-specific proteins to
make transcripts
• Place similar proteins to make transcripts
 Use mRNA data to add UTRs
• Build transcripts using cDNA evidence
• Build additional transcripts using Genscan +
homology evidence
• Combine annotations to make genes with
alternative transcripts
Getting Genes from Known Proteins
Human protein sequences
SwissProt/TrEMBL/RefSeq
pmatch* v. assembly
blast and Miniseq
Genewise
*R. Durbin, unpublished
Adding the UTRs
proteins - Genewise – phases, no UTRs
cDNAs - Est2Genome – UTRs, no phases
Translatable gene with UTRs
Gene Build is Protein-Based
•DNA-DNA alignments don’t give translatable
genes
•Protein level Alignment give:
– frameshifts and splice sites
•Genewise (Ewan Birney)
– Protein – genomic alignment
– Has splice site model
– Penalises stop codons
– Allows for frameshifts
Making Genes
• Combine results of all Genewises
and Genscans:
•
•
•
•
•
Group transcripts which share exons
Reject non-translating transcripts
Remove duplicate exons
Attach supporting evidence
Write genes to database
A Typical Human Release:
NCBI 34 (Dec 2003)
• NCBI 34 assembly, released Dec 2003
•
•
•
•
Ensembl genes:
Ensembl coding transcripts:
(plus 1,744 pseudogenes)
Ensembl exons:
21,787 (23.762 in release 35)
31,609
225,897
• Input human seqs: 48,176 proteins; 86,918 cDNAs
• Transcripts made from:
– Human proteins with (without) UTRs
– Non-human proteins with (without) UTRs
– cDNA alignment only
68% (19%)
2% (9%)
0.8%
Manual Vs Automatic Annotation
Genes
Sensitivity
~90% of manual genes are in
Specificity
~75% of
genes are in the manual sets
Exon bps Sensitivity
~70% of manual bps are in
exons
(90% of coding bps)
Specificity
~80% of
bps are in manual exons
Alternative transcripts per gene
manual 3
1.3
Figures are for the gene build on NCBI 33 (human) and manual annotation for
chromosomes 6, 14 & 14
Each Genebuild is a Story…
Data availability
Hard evidences in mouse, rat, human
Similarity build more important For other species;
Structural Issues
Zebrafish
Many similar genes near each other
Genome from different haplotypes
C. briggsae
Very dense genome
Short introns
Mosquito
Many single-exon genes
Genes within genes
Configuration Files provide flexibility
Life in Release 2003
Species
Gene number
Exons/gene
Homo sapiens
21787
8.7
Mus musculus
24948
8.7
Rattus norvegicus
23751
7.9
Danio rerio (zebra fish)
20062
7.9
Caenorhabditis briggsae
(nematode)
11884
7.2
Anopheles gambiae
(mosquito)
14707
4.0
Evaluating genes and transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Using ESTs
human proteins
Other proteins
Pmatch
cDNAs
ESTs
Exonerate
Genewise
Est2Genome
Add UTRs
Genscan exons
Merge
Other evidence
Ensembl
Genes
EST genes
Using ESTs
EST analysis
Map ESTs using Exonerate
(determine coverage, % identity and location in genome)
Filter on %identity and depth
(5.5 million ESTs from dbEST – maping of about 1/3)
Map to genome using Est2Genome
(determine strand, splicing)
Exonerate
Exonerate
Golden
path
contigs
cDNA hits
•Exonerate positions cDNA sequences
to assembly contigs
• Store hits as Ensembl FeaturePairs in
database
EST2Genome
Blast and Est2Genome
Virtual
contig
cDNA hits
Filter
Blast & Miniseq
Est_genome
Reconstructing Alternative Splicing
ESTs
Merge ESTs according to consecutive exon
overlap and set splice ends
Genomewise
Alternative transcripts with translation and UTRs
Display of EST Evidences
EST transcripts
Human ESTs
Display limited to 7 at any one point –
full data accessible in the databases
Evaluating genes and transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Ab initio Genscan predictions
Genscan
prediction
Evidence
supporting
Genscan
exons
Evaluating genes and transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Manual Curation:
VErtebrate Genome Annotation
Manual Curation: VEGA
Sanger / Vega
manual curation
Evaluating Genes and Transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Other Gene-Models
Turn on DAS sources
Other models
as ‘DAS
sources’
FASTAView display
Evaluating Genes and Transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Known Vs novel transcripts
• Naming takes place after the gene build is completed
• Transcripts/proteins mapped to SwissProt, RefSeq and SPTrEMBL
entries
• If mapped = ‘known’ : if not = ‘novel’
• Require high sequence similarity, but allow incomplete coverage
• Note:
 Difficult for families of closely-related genes
 Wrongly annotated pseudogenes may also cause problems
Evaluating Genes and Transcripts
•
•
•
•
•
•
•
Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names & descriptions
Gene Names and Descriptors
Names and descriptions
•
Names taken from mapped database entries
•
Official HGNC (HUGO) name used if available (or equivalent
for other species)
•
Otherwise SwissProt > RefSeq > SPTrEMBL
•
Novel transcripts have only Ensembl stable ids
•
Genes named after ‘best-named’ transcript
•
Gene description taken from mapped database entries (source
given)
• Hints:
 Orthology can provide useful confirmation
 If no description, check for any Family description
Stability…
www.ensembl.org/Docs/wiki/html/EnsemblDocs/Answer006.html
Geneview and Exonview
Gene name &
description
Alternative
transcripts
links to
ExonView
Links to putative
orthologues
Transcript name
Mapping to
external
databases
Evidence used to
build the
transcript
Evidence Tracks in ContigView
Expanded
tracks
Compressed
tracks
Future Directions
•Improved pseudogene annotation, for all species
•Upstream regulatory elements - using CpG islands,
Eponine predictions, motifs to aid in prediction of
transcription start sites
• Improve use of cDNAs - can already use to add
alternatively spliced transcripts
• Improve UTR extension
• Make use of comparative data
• Non coding RNAs - currently filtered out of build sets
ENSEMBL
-Finding the right DATA:
ENSMART and BLAST
-The central View of ENSEMBL: ContigView
-Genome Comparison: Synteny View
-ENSEMBL incorporate all the evidences into
its gene models
Genebuild overview
Human
Proteins
Other Proteins
Human cDNAs
Human ESTs
Pmatch
Exonerate
Genewise
Est2Genome
Genewise
genes
Aligned
cDNAs
Genewise genes with UTRs
Supported
genscans
(optional)
Aligned ESTs
ClusterMerge
Genebuilder
Preliminary
gene set
cDNA genes
Gene
Combiner
Final set
+ pseudogenes
Pseudogenes
Core Ensembl
genes
Ensembl
EST genes
Annotation Stages
Place all known genes
Map all AVAILABLE species specific proteins in the genome
and find gene structure using Genewise
Annotate novel genes
Use protein from other species to build new
transcripts based on homology
Use AVAILABLE mRNAs to add UTRs to the built
transcripts
Use further homology to proteins, mRNAs and ESTs
to build transcripts using Genscan exons
Combine annotations
Manual Vs Automatic Annotation
Gene locus level
Sn
Sp
chr13
0.90
0.74
chr14
0.92
0.77
with around 75% of the predictions
chr6
0.94
0.72
covered by a manual annotation
ENSEMBL predictions cover 90% or more
of manually annotated gene structures,
Exon level (based on transcript pairs)
Coding exons only
UTR exons predictions
All exons
Sn
Sp
Sn
Sp
chr13
0.83
0.90
0.73
0.78
chr14
0.78
0.88
0.69
0.77
chr6
0.85
0.89
0.73
0.76
are less accurate than
coding exons.
92% of coding exons
and 80% of all exons
are exact matches
Numbers are for NCBI33 genebuild