Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Essential gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Oncogenomics wikipedia , lookup

Point mutation wikipedia , lookup

Genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Copy-number variation wikipedia , lookup

Metagenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Ensembl
Database and Web Browser
www.ensembl.org
Stephen Baird
Apoptosis Research Centre
Children’s Hospital of Eastern Ontario
[email protected]
Lecture/Lab 7.3
1
•
•
•
Lecture 7.1
Focus on vertebrates
No fungi/plants
Brassica/Arabidopsis
genome browser is at
http://ensembl.warwick.ac.uk/
2
What is Ensembl?
•
•
•
•
•
Joint project of EBI and Sanger
Automated annotation of eukaryotic genomes
Open source software
Relational database system
Web interface
“The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology and
computer industries - to use this free information.”
- Dr. Mike Dexter, Director of the Wellcome Trust
Lecture/Lab 7.3
3
Ensembl components
Search tools:
Data:
Chromosomes
(FeatureView, KaryoView,
Ctyoview, MapView)
Diseases
SNPs and Haplotypes
(SNPView, GeneSNPView,
HaploView, LDView)
(DiseaseView)
Functions
(GOView)
Sequence
Similarity
(BLAST, SSAHA)
Genes
(GeneView, TransView,
ExonView, GeneSeqView)
Protein
(ProtView, DomainView,
FamilyView
Lecture/Lab 7.3
Genome
Sequence
Markers
(MarkerView)
(ContigView)
Comparative
Genomics
(ContigView, MultiContigView,
SyntenyView, GeneView)
Text
(TextView)
Other
Annotations
Anything
(BioMart/Martview)
4
Ensembl Gene Annotation
• “Basis for initial analysis and publication of
most vertebrate genomes”
• Genome assembly from NCBI
• Gene build system
– Targeted gene builds predict known genes
– Similarity gene builds predict novel genes
Lecture/Lab 7.3
5
Curwen et al, Genome Res 14: 942-950, 2004
Lecture/Lab 7.3
6
Targeted gene build
• Align known proteins with pmatch and BLAST
• Incorporate aligned cDNA sequences to find
splice sites, UTRs with genewise
UTRs predicted
ContigView of best in genome
gene with associated evidence
Known gene (p53)
Proteins aligned
Unigene clusters aligned
Lecture/Lab 7.3
cDNAs aligned
7
Similarity gene build
• Identify novel exons ab initio using Genscan
• Confirm exons by BLAST to known proteins,
mRNAs, UniGene clusters
Unigene
ContigView of homology gene with
clusters
associatedaligned
evidence
Proteins aligned
GenScan predictions
Lecture/Lab 7.3
Novel gene
8
Ensembl Gene Annotation
• Resulting “Ensembl genes”
are highly accurate with
low false positive rates
• Ensembl human gene identifiers
are 95% stable between builds
• Ensembl and RefSeq differ with
8-12% of the genes
– The Consensus CDS (CCDS)
project is a collaborative effort
between Ensembl/EBI, UCSC and
NCBI to identify a core set of
human protein coding regions that
are consistently annotated and of
high quality (~13,000 genes).
Lecture/Lab 7.3
9
Manually curated genes: VEGA
• Some chromosomes
contain manually
curated genes from
VEGA database
• “Otter manual
annotation system”
allows integration of
automatic and
manual annotations
(eg. from Apollo) into
Ensembl by The
Human and
Vertebrate Annotation
(HAVANA) group
annotators at the
Sanger center
Lecture/Lab 7.3
VEGA gene
10
Ensembl EST genes
• ESTs not accurate enough to produce Ensembl
genes, but important for identifying alternative
transcripts
• ESTs aligned to genome and merged to create
an independent set of “EST genes”
Known gene
EST genes
Unigene clusters aligned
Lecture/Lab 7.3
11
Pseudogenes
• Processed pseudogenes in annotation
identified (lack of introns, frameshifts,
presence of multi-exon version elsewhere in
genome, etc.)
Pseudogene
Lecture/Lab 7.3
12
Noncoding RNA Genes
• Genes with no ORFs that are functional (tRNAs,
rRNAs, miRNAs …)
• 7220 annotations from Sean Eddy and Tom Jones
miRNAs
Coding gene
Lecture/Lab 7.3
13
Example 1: Exploring Caspase-3
• Aim to demonstrate basic browsing and views
• Caspase-3 is a gene involved in apoptosis
(cell suicide)
• We will look at:
–
–
–
–
–
Gene annotation
SNPs
Orthologs and genome alignments
Alternative transcripts and EST genes
Protein Structure
Lecture/Lab 7.3
14
Species-specific
homepage Gene
Lecture/Lab 7.3
Text Search
caspase-3
15
GeneView
GeneSplice
View
GeneRegulation
View
ContigView
GeneSNPView
ExportView
Lecture/Lab 7.3
TransView
of transcript
ExonView
ProteinView
Orthologs predicted
by sequence
similarity and16
synteny
GeneView
DAS - Distributed Annotation System
- external annotation of splicing,
transcripts, array expression, pubmed
links, associated phenotypes, Protonet,
Reactome, UniProt.
Information for each Transcript
- similarity matches, links to RefSeq,
OMIM, PDB, Array probes, GO,
InterPro, Protein FamilyView, transcript
structure, protein properties.
Lecture/Lab 7.3
17
GeneView
GeneSNPView
Lecture/Lab 7.3
18
GeneSNPView
Lecture/Lab 7.3
19
Other SNP/Haplotype tools
• SNPView – info on a
single SNP
• ProteinView (protein
sequence with SNP
markup)
• LDView: View linkage
disequilibrium (only
limited regions)
• HaploView: View
haplotypes (only
limited regions)
Lecture/Lab 7.3
20
Click Back to
Lecture/Lab 7.3
GeneView
21
ContigView
Chromosome
and bands
Sequence
contigs
To Detailed
view
Lecture/Lab 7.3
22
ContigView: Detailed View
See other
tracks, options
in menus
Genscan
predictions
Gene
annotations
Targetted gene
predictions
(2 alternative
transcripts)
EST genes
Other tracks:
Aligned
sequences etc.
Base View
Region
Lecture/Lab 7.3
23
ContigView- Features menu
Export image (ps,
pdf, svg) or fasta file
Lecture/Lab 7.3
Click on ‘close menu’
24
MultiContigView
Conserved
regions
Rat ortholog
Lecture/Lab 7.3
25
Other Comparative
Genomics Tools
• Up to 6 genome
alignments with
MLAGAN in
AlignSliceView
• Other view is
SyntenyView
• Also access
comparative genomics
through EnsMart
Lecture/Lab 7.3
26
DAS-Distributed Annotation System
Lecture/Lab 7.3
27
Data Mining with BioMart
• Allows very fast, cross-data source querying
• Search for genes (features, sequences, etc.) or
SNPs based on
– Position; function; domains; similarity; expression;
etc.
• Accessible from Ensembl website (MartView) as
well as stand-alone
• Extremely powerful for data mining
Lecture/Lab 7.3
28
Example 2: BioMart
• A new disease locus has been mapped
between markers D21S1991 and D21S171. It
may be that the gene involved has already
been identified as having a role in another
disease. What candidates are in this region?
Lecture/Lab 7.3
29
BioMart: Choosing your dataset
Lecture/Lab 7.3
30
BioMart: Filtering
21
D21S1991
D21S171
Lecture/Lab 7.3
31
BioMart: Output
Note you can
output different
types of information
Lecture/Lab 7.3
32
BioMart: Output
Lecture/Lab 7.3
33
Sequence
Similarity
Searching
•
•
Use SSAHA for
exact matches
(fast)
Use BLAST for
more distant
similarity (slow)
Lecture/Lab 7.3
34
Looking for Help?
Lecture/Lab 7.3
35
DAS: Getting your Own Data in Ensembl
• DAS (Distributed Annotation System)
– Anyone can load data into Ensembl and allow others to
view it in the same view (eg. ContigView) as other
Ensembl annotations
– Click on ‘Manage
sources’ in
DAS dropdown
menu
Lecture/Lab 7.3
36
Other Ways to Access Ensembl
• MySQL database directly accessible
• APIs for Perl and Java
• Other software
– Apollo Java genome
annotation viewer/editor
– Sockeye Java viewer
• You can get your own
local version of
Ensembl: software and
data freely available
– http://www.ensembl.org/
Lecture/Lab 7.3
Docs/
Sockeye
37
Exercises
• Ex 1. Homologues of human genes are often present in Fugu
rubripes in more condensed form (with shorter introns). Is this true
for the gene PTEN, a tumor suppressor often mutated in advanced
cancers?
– Try MultiContigView; can you think of another way to get this
information as well?
• Ex 2. The microRNA bantam regulates the Drosophila (fruitfly) gene
hid by binding the 3’ UTR. Hid is involved in apoptosis, and it is
possible that binding sites for bantam could be found in the 3’ UTR
of other apoptosis genes as well. Obtain the 3’ UTR sequence of all
Drosophila genes known to be involved in apoptosis.
– Using BioMart, the GO term for apoptosis is GO:0006915, evidence
code TAS
• Ex 3. The file “PCR_product.txt” on the webserver contains the
sequence of a PCR product amplified from a mouse cDNA library.
What gene does the product correspond to? Does it contain the
complete coding sequence of that gene?
– Would it be better to use BLAST or SSAHA?
Lecture/Lab 7.3
38