* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Ensembl
Oncogenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
RNA silencing wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Transposable element wikipedia , lookup
Point mutation wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Copy-number variation wikipedia , lookup
Genomic library wikipedia , lookup
RNA interference wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
History of genetic engineering wikipedia , lookup
Primary transcript wikipedia , lookup
Epitranscriptome wikipedia , lookup
Gene expression programming wikipedia , lookup
Human genome wikipedia , lookup
Gene desert wikipedia , lookup
Metagenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Pathogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Non-coding RNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome editing wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
The Ensembl Gene set The “Genebuild” 21 April 2008 Outline The GeneBuild (determining the Ensembl gene set) What it means for the scientist? ‘annotation pipeline’ vs ‘manual curation’ Pseudogenes ncRNAs The CCDS project 2 of 32 Introduction What is available? I) Sequence Assemblies from genome sequencing efforts 3 of 32 Gene Sequencingthe Assembly This generates clones, vs new sequencing methods http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html 4 of 32 Clones Available Human: (Tilepath- used in the assembly) Ciona intestinalis Shotgun assembly 5 of 32 ContigView: Clones and Contigs Contigs Clones (Plate/well numbers) Ensembl Transcripts 6 of 32 Task: View the tilepath clone in ContigView for the region containing the human BRCA2 gene. Hint: Start with a search for the BRCA2 gene. 7 of 32 The Ensembl Geneset How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome? Protein Sequence Assembly Ensembl Geneset 8 of 32 Once the Assembly is Imported… Proteins/mRNAs are aligned. These have been submitted to databases such as: UniProt (manually curated) and RefSeq (partially manually curated) 9 of 32 The Biological Evidence All Ensembl gene predictions are based on experimental evidence: UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository 10 of 32 Database Relationship NCBI RefSeq Individual Lab’s Submission EMBL-Bank DDBJ GenBank UniProt SwissProt TrEMBL 11 of 32 Genebuild EMBL-Bank GenBank DDBJ Sequence (Assembly) Proteins (e.g. Swiss-Prot) Manual annotation (HAVANA) Ensembl mRNA EST EST genes 12 of 32 Why do I want to know?… Ensembl genes may be based on multiple protein/mRNAs What is an Ensembl gene based on? 13 of 32 Task Look at the evidence for the human EPO gene. What was this gene based on? Hint: Go to Exon Information from the GeneView page 14 of 32 EPO gene supporting evidence 15 of 32 Species-Specific GeneBuilds Pan troglodytes genes are built by projection from human genes. Zebrafish has many gene duplications. Homo sapiens genes must have protein evidence, not just mRNA. 16 of 32 Task When was the chimpanzee (Pan troglodytes) Genebuild performed? Can you find information as to how genes were annotated? Hint: Look on the chimpanzee index page 17 of 32 External Gene Set: VEGA/Havana Human, zebrafish, mouse and dog Havana transcripts in blue or gold… What are Havana transcripts? 18 of 32 Havana and Ensembl match When a Havana (manually curated) and Ensembl (automatic methods) predict the same transcript, basepair for basepair, the transcripts are merged and coloured gold. 20 of 32 Manually-curated gene sets in Ensembl Vega (Havana) Homo sapiens, Danio rerio, Mus musculus and Canis familiaris WormBase Caenorhabditis elegans FlyBase Drosophila melanogaster SGD Saccharomyces cerevisiae 21 of 32 What Can Go Wrong? I) A Gap in the assembly BLAST hit (SwissProt entry) Gene might not be found in Ensembl II) Fused genes Gene might be associated with two names 23 of 32 Outline The genome sequence The Genebuild ‘manual curation’ by Havana Other: EST gene set Pseudogenes ncRNAs 24 of 32 Expressed Sequence Tags vs ‘cDNA’ ESTs are annotated separately. Why? mRNA and cDNA used in the GeneBuild: Sequenced to high standard, often complete. EST: Lower quality sequence. ‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides long Low quality fragment- sequence error of ~2%. BUT confers useful expression information discovery of new genes esp in diseased organisms Tissue type Timing/developmental stage Samples more transcripts, variants 25 of 32 Where Can I See This EST Geneset? ContigView Choose EST genes EST track 26 of 32 Pseudogenes: ‘False’ Genes Processed Unprocessed mRNA AAAAAA Reverse transcription and re-integration Produced by gene duplication and rearrangement pseudogene AAAAAA 27 of 32 ncRNAs (non coding RNAs) What types are in Ensembl? tRNA (transfer RNA) rRNA (ribosomal RNA) scRNA (small cytoplasmic) snRNA (small nuclear) snoRNA (small nucleolar) miRNA (microRNA) 28 of 32 ncRNAs (2 types) I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern) II) High sequence conservation (miRNA) BLAST alignment ‘RNA fold’ applied to make sure sequences can fold (hairpin) 29 of 32 ncRNAs… where can I see them? Find them in ContigView: or use BioMart. 30 of 32 Summary – Ensembl Genes *All Ensembl genes are based on biological evidence (protein and mRNA) One Ensembl gene may come from proteins and mRNAs in various databases. Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human. The CCDS set strives for consensus coding sequences across databases. Pseudogenes and RNAs are annotated, along with a separate EST gene set. 31 of 32 For more on GeneBuild: Help and Documentation (About Ensembl) http://www.ensembl.org/info/about/docs/genome_annotation.html 32 of 32