Download 2.4.databases_ensembl - T

An Introduction to ENSEMBL Cédric Notredame The Top 5 Surprises in the Human Genome Map 1. 2. 3. 4. The blue gene exists in 3 genotypes: Straight Leg, Loose Fit and Button-Fly. Tiny villages of Hobbits actually live in our DNA and produce minute quantities of wool -- which we've been ignorantly referring to as "navel lint" and throwing away for centuries. It's nearly impossible to re-fold it along the original creases. Beer-drinking gene conveniently located next to bathroom-locating gene. and the Number 1 Surprise In The Human Genome Map... 5-Now that there's a map, male scientists will attempt to cure diseases by randomly throwing stuff into beakers, stubbornly refusing to use the map or ask for directions -- all the while insisting the cure is right around the next corner ENSEMBL: Our Scope -What is ENSEMBL ? -Searching Genes in ENSEMBL -Viewing Genes in ENSEMBL? -Doing Research With ENSEMBL? -Where do ENSEMBL Genes Come From Accessing Genomes • Genomes sequences are becoming available very rapidly – Large and difficult to handle computationally – Everyone expects to be able to access them immediately • Bench Biologists – Has my gene been sequenced? – What are the genes in this region? – Where are all the GPCRs – Connect the genome to other resources • Research Bioinformatics – Give me a dataset of human genomic DNA – Give me a protein dataset What is It ? • Set of high quality gene predictions – From known human mRNAs aligned against genome – From similar protein and mRNAs aligned against genome – From Genscan predictions confirmed via BLAST of Protein, cDNA, ESTs databases. • Initial functional annotation from Interpro • Integration with external resources (SNPs, SAGE, OMIM) • Comparative analysis – DNA sequence alignment – Protein orthologs Mr ENSEMBL ? Richard Durbin (ACEDB) Ewan Birney (EBI) Challenges ? • Scale and data flow – mainly engineering problems • Presentation, ease of use – mainly engineering problems • Algorithmic – Partly engineering – Partly research ENSEMBL Home Help! • context sensitive help pages - click • access other documentation via generic home page • email the helpdesk HelpDesk / Suggestions Finding What You Need Human homepage Text search BLAST/SSAHA BLAST/SSAHA ???? Changing Angle… Map View Anchor View Contig View Chromosome Overview Genes and Markers 1Mb Configuration Detailed View Genes, ESTs, CpG etc. 100kb Contig View close-up Customising & short cuts Transcripts red & black (Ensembl predictions) Evidence Pop-up menu Cyto View Marker View SNP View Synteny View Dotter View Gene View Gene-View Gene-View Gene-View Trans View Exon-View Protein-View Protein-View Protein-View Family-View CDK-like Family-View CDK-like The Right View On My Gene -Where Is My Gene ? Map View Cyto View Contig View -How Many Transcript for My Gene Gene View Exon View -What is the Function of my Gene -How does My Gene compare with other Species Protein View SNP View Family View Synteny View Dotter View Getting The Stuff Back Home Export-View Data Mining with EnsMart • The aim of EnsMart is to integrate Ensembl data into a single, multi-species, query-optimised database – Requirement for cross-database joins removed. – Query-optimised schema improves speed of data retrieval. • Examples – Coding SNPs for all novel GPCRs – The sequence in the 5kb upstream region of known proteases between D1S2806 and D1S2907 – Mouse homologues of human disease genes containing transmembrane domain located between 1p23 and 1q23 EnsMart I EnsMart II Asking Questions With ENSEMBL Asking Questions 1-Selecting AND Downloading Genes using -Functional -And Evolutive Criteria 2-Comparing Two Pieces of Genome Asking A Question with ENSMART What Do You Want ??? All The Human Genes -Involved in Cell Death -Associated with a Disease -With a Homologue in Mouse and Chicken Which Specie Select the region Where? What kind of Gene ? Select the kind of data What Kind of Function ? Choose An Evolutionnary Trace Select the kind of data Control of Regulatory Region Control of Biochemical Function Control of Genetic Variation Human Gene Cell Death Human Gene Cell Death Mouse Human Gene Cell Death Chicken 1133 genes 1106 genes 880 genes Human Gene Cell Death C. Elegans 338 genes Asking A Question with ENSMART How Do You Want it Packed ??? I would like -Chromosome Information -The ID of my sequences -The corresponding OMIM Id -The corresponding Chicken id Asking A Question with ENSMART How Do You Want it Packed ??? Come to think of it… -I’d like to take a look at the 5’ upstream regions Asking A Question with ENSMART What Do You Want ??? I Want To know if the Mouse and the Human Genome are conserved around the Human Gene SNX5 Where Do ENSEMBL Genes Come From Genebuild Evaluating genes and transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions The Aim… Overview… manual curation Ensembl transcript predictions evidence other groups’ models Automatic Gene Annotation human proteins Other proteins Pmatch cDNAs ESTs Exonerate Genewise Est2Genome Add UTRs Genscan exons Merge other evidence Ensembl Genes EST genes ENSEMBL Geneset • Place all available species-specific proteins to make transcripts • Place similar proteins to make transcripts  Use mRNA data to add UTRs • Build transcripts using cDNA evidence • Build additional transcripts using Genscan + homology evidence • Combine annotations to make genes with alternative transcripts Getting Genes from Known Proteins Human protein sequences SwissProt/TrEMBL/RefSeq pmatch* v. assembly blast and Miniseq Genewise *R. Durbin, unpublished Adding the UTRs proteins - Genewise – phases, no UTRs cDNAs - Est2Genome – UTRs, no phases Translatable gene with UTRs Gene Build is Protein-Based •DNA-DNA alignments don’t give translatable genes •Protein level Alignment give: – frameshifts and splice sites •Genewise (Ewan Birney) – Protein – genomic alignment – Has splice site model – Penalises stop codons – Allows for frameshifts Making Genes • Combine results of all Genewises and Genscans: • • • • • Group transcripts which share exons Reject non-translating transcripts Remove duplicate exons Attach supporting evidence Write genes to database A Typical Human Release: NCBI 34 (Dec 2003) • NCBI 34 assembly, released Dec 2003 • • • • Ensembl genes: Ensembl coding transcripts: (plus 1,744 pseudogenes) Ensembl exons: 21,787 (23.762 in release 35) 31,609 225,897 • Input human seqs: 48,176 proteins; 86,918 cDNAs • Transcripts made from: – Human proteins with (without) UTRs – Non-human proteins with (without) UTRs – cDNA alignment only 68% (19%) 2% (9%) 0.8% Manual Vs Automatic Annotation Genes Sensitivity ~90% of manual genes are in Specificity ~75% of genes are in the manual sets Exon bps Sensitivity ~70% of manual bps are in exons (90% of coding bps) Specificity ~80% of bps are in manual exons Alternative transcripts per gene manual 3 1.3 Figures are for the gene build on NCBI 33 (human) and manual annotation for chromosomes 6, 14 & 14 Each Genebuild is a Story… Data availability Hard evidences in mouse, rat, human Similarity build more important For other species; Structural Issues Zebrafish Many similar genes near each other Genome from different haplotypes C. briggsae Very dense genome Short introns Mosquito Many single-exon genes Genes within genes Configuration Files provide flexibility Life in Release 2003 Species Gene number Exons/gene Homo sapiens 21787 8.7 Mus musculus 24948 8.7 Rattus norvegicus 23751 7.9 Danio rerio (zebra fish) 20062 7.9 Caenorhabditis briggsae (nematode) 11884 7.2 Anopheles gambiae (mosquito) 14707 4.0 Evaluating genes and transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Using ESTs human proteins Other proteins Pmatch cDNAs ESTs Exonerate Genewise Est2Genome Add UTRs Genscan exons Merge Other evidence Ensembl Genes EST genes Using ESTs EST analysis Map ESTs using Exonerate (determine coverage, % identity and location in genome) Filter on %identity and depth (5.5 million ESTs from dbEST – maping of about 1/3) Map to genome using Est2Genome (determine strand, splicing) Exonerate Exonerate Golden path contigs cDNA hits •Exonerate positions cDNA sequences to assembly contigs • Store hits as Ensembl FeaturePairs in database EST2Genome Blast and Est2Genome Virtual contig cDNA hits Filter Blast & Miniseq Est_genome Reconstructing Alternative Splicing ESTs Merge ESTs according to consecutive exon overlap and set splice ends Genomewise Alternative transcripts with translation and UTRs Display of EST Evidences EST transcripts Human ESTs Display limited to 7 at any one point – full data accessible in the databases Evaluating genes and transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Ab initio Genscan predictions Genscan prediction Evidence supporting Genscan exons Evaluating genes and transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Manual Curation: VErtebrate Genome Annotation Manual Curation: VEGA Sanger / Vega manual curation Evaluating Genes and Transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Other Gene-Models Turn on DAS sources Other models as ‘DAS sources’ FASTAView display Evaluating Genes and Transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Known Vs novel transcripts • Naming takes place after the gene build is completed • Transcripts/proteins mapped to SwissProt, RefSeq and SPTrEMBL entries • If mapped = ‘known’ : if not = ‘novel’ • Require high sequence similarity, but allow incomplete coverage • Note:  Difficult for families of closely-related genes  Wrongly annotated pseudogenes may also cause problems Evaluating Genes and Transcripts • • • • • • • Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Gene Names and Descriptors Names and descriptions • Names taken from mapped database entries • Official HGNC (HUGO) name used if available (or equivalent for other species) • Otherwise SwissProt > RefSeq > SPTrEMBL • Novel transcripts have only Ensembl stable ids • Genes named after ‘best-named’ transcript • Gene description taken from mapped database entries (source given) • Hints:  Orthology can provide useful confirmation  If no description, check for any Family description Stability… www.ensembl.org/Docs/wiki/html/EnsemblDocs/Answer006.html Geneview and Exonview Gene name & description Alternative transcripts links to ExonView Links to putative orthologues Transcript name Mapping to external databases Evidence used to build the transcript Evidence Tracks in ContigView Expanded tracks Compressed tracks Future Directions •Improved pseudogene annotation, for all species •Upstream regulatory elements - using CpG islands, Eponine predictions, motifs to aid in prediction of transcription start sites • Improve use of cDNAs - can already use to add alternatively spliced transcripts • Improve UTR extension • Make use of comparative data • Non coding RNAs - currently filtered out of build sets ENSEMBL -Finding the right DATA: ENSMART and BLAST -The central View of ENSEMBL: ContigView -Genome Comparison: Synteny View -ENSEMBL incorporate all the evidences into its gene models Genebuild overview Human Proteins Other Proteins Human cDNAs Human ESTs Pmatch Exonerate Genewise Est2Genome Genewise genes Aligned cDNAs Genewise genes with UTRs Supported genscans (optional) Aligned ESTs ClusterMerge Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes Annotation Stages Place all known genes Map all AVAILABLE species specific proteins in the genome and find gene structure using Genewise Annotate novel genes Use protein from other species to build new transcripts based on homology Use AVAILABLE mRNAs to add UTRs to the built transcripts Use further homology to proteins, mRNAs and ESTs to build transcripts using Genscan exons Combine annotations Manual Vs Automatic Annotation Gene locus level Sn Sp chr13 0.90 0.74 chr14 0.92 0.77 with around 75% of the predictions chr6 0.94 0.72 covered by a manual annotation ENSEMBL predictions cover 90% or more of manually annotated gene structures, Exon level (based on transcript pairs) Coding exons only UTR exons predictions All exons Sn Sp Sn Sp chr13 0.83 0.90 0.73 0.78 chr14 0.78 0.88 0.69 0.77 chr6 0.85 0.89 0.73 0.76 are less accurate than coding exons. 92% of coding exons and 80% of all exons are exact matches Numbers are for NCBI33 genebuild

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2.4.databases_ensembl - T