Download Ensembl

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic engineering wikipedia , lookup

RNA silencing wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Transposable element wikipedia , lookup

Point mutation wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Copy-number variation wikipedia , lookup

Genomic library wikipedia , lookup

RNA interference wikipedia , lookup

NEDD9 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of genetic engineering wikipedia , lookup

Primary transcript wikipedia , lookup

Epitranscriptome wikipedia , lookup

Gene expression programming wikipedia , lookup

Human genome wikipedia , lookup

Gene desert wikipedia , lookup

Metagenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Non-coding RNA wikipedia , lookup

Genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome editing wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
The Ensembl Gene set
The “Genebuild”
21 April 2008
Outline
 The GeneBuild
(determining the Ensembl gene set)
 What it means for the scientist?
 ‘annotation pipeline’ vs ‘manual curation’
 Pseudogenes
 ncRNAs
 The CCDS project
2 of 32
Introduction
What is available?
I) Sequence Assemblies from genome
sequencing efforts
3 of 32
Gene Sequencingthe Assembly
This generates clones, vs new sequencing methods
http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html
4 of 32
Clones Available
Human:
(Tilepath- used in the assembly)
Ciona intestinalis
Shotgun assembly
5 of 32
ContigView: Clones and Contigs
Contigs
Clones
(Plate/well numbers)
Ensembl
Transcripts
6 of 32
Task:
View the tilepath clone in ContigView
for the region containing the human
BRCA2 gene.
Hint: Start with a search for the BRCA2
gene.
7 of 32
The Ensembl Geneset
How does Ensembl use mRNA and
protein information along with
the sequence assembly to define
distinct genes on the genome?
Protein
Sequence Assembly
Ensembl Geneset
8 of 32
Once the Assembly is Imported…
Proteins/mRNAs are aligned.
These have been submitted to
databases such as:
UniProt (manually curated) and
RefSeq (partially manually curated)
9 of 32
The Biological Evidence
All Ensembl gene predictions are based on
experimental evidence:
UniProt/Swiss-Prot
A manually curated database and therefore of
highest accuracy
NCBI RefSeq
A partially manually curated database
UniProt/TrEMBL
Automatically annotated translations of EMBL
coding sequence (CDS) features
EMBL / GenBank / DDBJ
Primary nucleotide sequence repository
10 of 32
Database Relationship
NCBI
RefSeq
Individual
Lab’s
Submission
EMBL-Bank
DDBJ
GenBank
UniProt
SwissProt TrEMBL
11 of 32
Genebuild
EMBL-Bank
GenBank
DDBJ
Sequence
(Assembly)
Proteins
(e.g. Swiss-Prot)
Manual
annotation
(HAVANA)
Ensembl
mRNA
EST
EST
genes
12 of 32
Why do I want to know?…
Ensembl genes may be based on
multiple protein/mRNAs
What is an Ensembl gene based on?
13 of 32
Task
Look at the evidence for the
human EPO gene.
What was this gene based on?
Hint: Go to Exon Information from
the GeneView page
14 of 32
EPO gene supporting evidence
15 of 32
Species-Specific GeneBuilds
Pan troglodytes genes are built by
projection from human genes.
Zebrafish has many gene
duplications.
Homo sapiens genes must have
protein evidence, not just mRNA.
16 of 32
Task
When was the chimpanzee (Pan
troglodytes) Genebuild
performed?
Can you find information as to
how genes were annotated?
Hint: Look on the chimpanzee
index page
17 of 32
External Gene Set: VEGA/Havana
Human, zebrafish, mouse and dog
Havana transcripts in blue or
gold…
What are Havana transcripts?
18 of 32
Havana and Ensembl match
When a Havana (manually curated) and Ensembl (automatic methods) predict
the same transcript, basepair for basepair, the transcripts are merged and
coloured gold.
20 of 32
Manually-curated gene sets in
Ensembl
Vega (Havana)
Homo sapiens, Danio rerio,
Mus musculus and Canis familiaris
WormBase
Caenorhabditis elegans
FlyBase
Drosophila melanogaster
SGD
Saccharomyces cerevisiae
21 of 32
What Can Go Wrong?
I)
A Gap in the assembly
BLAST hit
(SwissProt
entry)
Gene might not be found in Ensembl
II)
Fused genes
Gene might be associated with two names
23 of 32
Outline




The genome sequence
The Genebuild
‘manual curation’ by Havana
Other: EST gene set
Pseudogenes
ncRNAs
24 of 32
Expressed Sequence Tags vs
‘cDNA’
ESTs are annotated separately. Why?
 mRNA and cDNA used in the GeneBuild:
Sequenced to high standard, often complete.
 EST: Lower quality sequence.
‘One shot’ sequencing of cDNA from the 5’ and 3’ end
creates the EST sequence.
ESTs are only 500-800 nucleotides long
Low quality fragment- sequence error of ~2%.
BUT confers useful expression information
 discovery of new genes esp in diseased organisms
 Tissue type
 Timing/developmental stage
 Samples more transcripts, variants
25 of 32
Where Can I See This EST Geneset?
ContigView
Choose EST
genes
EST track
26 of 32
Pseudogenes: ‘False’ Genes
Processed
Unprocessed
mRNA
AAAAAA
Reverse transcription
and re-integration
Produced by gene
duplication and
rearrangement
pseudogene
AAAAAA
27 of 32
ncRNAs (non coding RNAs)
What types are in Ensembl?
tRNA (transfer RNA)
rRNA (ribosomal RNA)
scRNA (small cytoplasmic)
snRNA (small nuclear)
snoRNA (small nucleolar)
miRNA (microRNA)
28 of 32
ncRNAs (2 types)
I) RNA with low homology can be
identified through conserved 2ary
structure (search genome using
Rfam pattern)
II) High sequence conservation (miRNA)
BLAST alignment
‘RNA fold’ applied to make sure
sequences can fold (hairpin)
29 of 32
ncRNAs… where can I see them?
Find them in ContigView:
or use BioMart.
30 of 32
Summary – Ensembl Genes
*All Ensembl genes are based on biological evidence
(protein and mRNA)
 One Ensembl gene may come from proteins and
mRNAs in various databases.
 Havana (manually curated) genes are incorporated
into the Ensembl geneset, merged for human.
 The CCDS set strives for consensus coding
sequences across databases.
 Pseudogenes and RNAs are annotated, along with a
separate EST gene set.
31 of 32
For more on GeneBuild:
Help and Documentation
(About Ensembl)
http://www.ensembl.org/info/about/docs/genome_annotation.html
32 of 32