* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - University of Pennsylvania
Survey
Document related concepts
Non-coding DNA wikipedia , lookup
RNA silencing wikipedia , lookup
RNA interference wikipedia , lookup
Non-coding RNA wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene desert wikipedia , lookup
Ridge (biology) wikipedia , lookup
Promoter (genetics) wikipedia , lookup
X-inactivation wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Genome evolution wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Transcript
Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for Bioinformatics University of Pennsylvania Plasmodium genomics: Genomics and proteomics pave the way for controlling malaria Nature, October 3, 2002 Thinking Genomically Genome •Genome structure •Genes and function •Pathways •Expression patterns •(Complex) diseases Phenotype Using a Genomics Unified Schema (GUS) to ask genomic questions Genomic Unified Schema (GUS) is a relational database that warehouses and integrates biological sequence, sequence annotation, and gene expression data from a number of heterogeneous sources. User-friendly web interfaces present slices of the GUS database and allow researchers to execute structured queries for information concerning gene structure, function, and expression. GUS Powers Multiple Genomics Projects AllGenes Allgenes is based on a comprehensive mouse and human gene index. The genes are approximated by transcripts predicted from EST and mRNA clustering PlasmoDB PlasmoDB is the official database of the Plasmodium falciparum genome project which provides an integrated view of genome sequence data including expression data from EST, SAGE, and microarray projects EPConDB EPConDB is an index of genes expressed in endocrine pancreas. Expression is defined either through microarray experiments or sequence annotation. allgenes.org query "Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have been localized to mouse chromosome 5?" This query illustrates several aspects of the GUS database including: Data Integration Data Analysis Tools •RHMap •GOFunction •Sequence •GOFunction assigments •Boolean function •History function •BLAST http://www.allgenes.org/ Steve Fischer, Debbie Pinney, Brian Brunk, Joan Mazarelli, Jonathan Crabtree, Yongchang Gan, Sharon Diskin Nikolay Kolchanov, Alexey Katohkin Select the allgenes.org boolean query page Click on the "AND" button Choose the RH map and GO function queries Select mouse chromosome 5 and "transcription factor" There are 26 mouse RNAs (assemblies) that meet these criteria: This query result set now appears on the query "history" page: Now use the BLAST page to identify RNAs similar to my cDNA The results of the BLAST search appear in the query history Intersect ("AND") the BLAST search with the previous query: And we have our answer (the third row on the query history page): Other transcripts from the same gene Predicted GO function(s) (some manually reviewed) External links Mapping information Gene trap insertions Protein/motifs, etc. predicted protein CAP4 assembly EST expression profile UCSC BLAT PlasmoDB query: "List all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum's merozoite stage." This query illustrates several aspects of the GUS database including: Data Integration •Genome annotation •Mass spec Data Analysis •Sequence analysis Tools •History function http://plasmodb.org/ David Roos, Jessie Kissinger, Bindu Gajria, Martin Fraunholz, Jules Milgram, Phil Labo, Amit Bahl, Dave Pearson, Dinesh Gupta, Hagai Ginsburg Jonathan Crabtree, Jonathan Schug, Brian Brunk, Greg Grant, Trish Whetzel, Matt Mailman, Li Li Select Queries from the PlasmoDB homepage Choose signal peptide Choose chromosome and Gene/prediction type-submit There are 651 genes with predicted signal peptides Choose Gene Expression from the queries page, then Proteomics Then choose chromosome, lifecycle stage, evidence - submit There are 828 gene predictions that satisfy this query Go to the history page and choose which simple queries to combine. Select intersect. We have an answer. There are 86 predicted genes that satisfy our complex query Click on a gene to get a full report There is a variety of information available from the report page including: Predicted protein features and gene models EPConDB query: "Which DOTS assemblies (RNA) represented on the Endocrine Pancreas Consortium’s chip 2.0 are constituents of the insulin initiated signal transduction pathway ?" Data Integrationes3www w •Sequence •Microarray experiment •Transduction pathway Data Analysis •BLAST Tools •History function http://www.cbil.upenn.edu/EPConDB Klaus Kaestner, Marie Scearce, John Brestelli, Phillip Le Elisabetta Manduchi, Angel Pizarro, Debbie Pinney, Greg Grant, Joan Mazzarelli, Jonathan Crabtree, Hongxian He,Shannon Mcweeney, Matt Mailman Go to the gene information query page and click on “DOTS assemblies involved in a pathway” Choose the insulin pathway, a p-value, pancreas, the species, and whether an assembly must include an mRNA - submit There are 59 dots assemblies that are constituents of the insulin pathway Return to the gene information query page and select clones sets. Choose chip 2.0 - submit There are 3242 assemblies represented on chip 2 Go to the history page, select the queries to combine and select intersect – view the results There are 8 assemblies that satisfy the complex query. Clicking on an RNA retrieves an allgenes report. Using Databases to Think Genomically • Draw attention to these resources • Show how different data sources and approaches can be used to ask powerful questions • This can be done for different organisms, different systems How GUS Works AllGenes PlasmoDB EPConDB Java Servlets DoTS RAD TESS SRES Core Oracle RDBMS Object Layer for Data Loading Other sites, Other projects, e.g. GeneDB Goals of GUS • Generic platform for model organism or disease specific databases • Integration of genome, transcript and protein data, including: – – – – – – Sequence Function Expression Interaction Regulation Orthologs and paralogs • Support for: – – – – automated annotation and integration manual curation data mining/analysis and sophisticated queries web access http://www.gusdb.org Jonathan Crabtree, Jonathan Schug, Steve Fischer, Elisabetta Manduchi, Angel Pizarro, Junmin Liu, Debbie Pinney, Greg Grant, Trish Whetzel, Li Li, Sharon Diskin, Hongxian He Architecture of GUS GenBank, InterPro, GO, etc Genomic Sequence Automated Analysis & Integration GSSs & ESTs Annotation Object Layer DoTS Oracle/SQL TESS RAD WWW queries, browsing, & download Mapping Data Java Servlets & Perl CGI Core SRes Mining Applications microarray & SAGE Experiments QTL,POP, SNP, Clinical Annotator’s Interface Five domains GUS is divided into 5 domains* (separate name spaces) Namespace Domain Highlights Core Data Provenance Evidence Shared Resources Ontologies Sequence and annotation Central dogma Gene expression MIAME/MAGE Gene regulation Grammars SRes (Shared Resources) DoTS (DB of Transcribed Seqs) RAD (RNA Abundance DB) TESS (Trans Elem Search Site) * Protein Abundance DB domain underway DoTS central dogma schema Gene RNA Protein Gene Instance RNA Instance Protein Instance Gene Feature Genomic Sequence (isa NA Feature) (isa NA Sequence) RNA Feature RNA Sequence (isa NA Feature) (isa NA Sequence) Protein Feature Protein Sequence (isa NA Feature) (isa AA Sequence) RAD schema uses MAGE/MIAME 0..* MAGE Experiment Array BioMaterial BioAssay BioAssayData Protocol, Descr. HigherLevelAnalysis StudyAssay 1 Array 1 1 0..* 1 Assay 0..* 1 1 0..* Study 1 1 1 1 1 0..* 1 0..* 0..* 1 StudyDesignAssay ArrayAnnotation StudyDesign 1 0..* 0..* 0..* Control ElementAnnotation 0..* 0..1 0..* 1 1 BioMaterialCharacteristic 0..* BioMaterialImp 1 ElementImp 1 StudyFactor 0..* 1 0..* 0..* 0..* 0..* 0..* StudyDesignDescription 0..* StudyFactorValue AssayLabeledExtract 0..* 1 Channel CompositeElementImp 1 1 10..1 0..* 0..* 0..* 0..* BioMaterialMeasurement 0..* 0..1 1 0..* 1 0..* 1 0..1 0..* Acquisition 1 1 1 0..* 0..* 1 LabelMethod RelatedAcquisition 0..* 1 0..* CompositeElementAnnotation 1 0..* 0..* 1 OntologyEntry Treatment 0..* 0..1 AcquisitionParam 0..* 0..* 0..1 ElementResultImp 0..1 0..1 CompositeElementResultImp 0..* 0..* 0..* 1 ProcessResult Quantification 0..* 0..* 1 1 1 MAGEDocumentation RelatedQuantification 0..* ProtocolParam 0..* ProcessIO 1 MAGE_ML QuantificationParam 0..* 1 0..1 0..* 1 MIAME Protocol 1 0..* Experimental Design Array design Samples Hybridization, Measure Normalization . 0..* 1 0..* 0..* 1 AnalysisInput 0..* 1 1 ProcessInvocation ProcessInvocationParam ProcessImplementationParam 1 0..* 0..* 1 0..* AnalysisInvocation AnalysisInvocationParam 1 0..* AnalysisOutput 1 ProcessImplementation 0..* 1 1 Analysis 0..* 0..* AnalysisImplementation 1 0..*AnalysisImplementatio nParam 0..* http://www.mged.org Journals are Adopting the MGED Standards Use of Minimal Information About Microarray Experiment (MIAME) TESS Schema TESS.Moiety Moiety MoietyHeterodimer MoietyMultimer MoietyComplex TESS.Activity ActivityProteinDnaBinding TESS.FootprintInstance DoTS.NaFeature ActivityTissueSpecificity BindingSite TESS.TrainingSet TESS.Model ModelString Promoter ... TESS.ParameterGroup ModelConsensusString DoTS.NaSequence ModelPositionalWeightMatrix TESS.Note ModelGrammar RAD DoTS EST clustering and assembly Identify shared TF binding sites Genomic alignment and comparative Sequence analysis TESS Using GUS for Genomic Research Annotating mouse chromosome 5 – Maja Bucan • Identifying novel genes expressed in the endocrine pancreas – Klaus Kaestner, Alan Permutt, Doug Melton • Identifying genes regulated by CREB – Allan Pack, Mirek Mackiewicz Annotation of Mouse Chromosome 5 • What are all the genes? • What is their structure and function? • Where are they expressed and how is this regulated? Maja Bucan, Otto Valladeres, Kyle Gaulton Jonathan Crabtree, Yongchang Gan, Joan Mazzarelli, Jonathan Shug Areas of Focus on Mouse Chromosome 5 15 Reln Sema3a,c,d,e Nos3 Dpp6 Htr5a 4p16.3 20 23 Hdh, Adra2c Drd5 4p15.31 30 Qdpr 4q12 40 7q21-22 7q36 8 12 43 Gabrb1, a2, g1 Pdgfra, Kit, Flk1 Clock Rw as a balancer Approach to Annotating Mouse Chromosome 5 • Genomic sequence Public release: chromosome 5 has many gaps – Celera – Combine to eliminate gaps where possible • Gene models – ENSMBL prediction – Celera predictions BLAT alignment of DoTS – Comparison to human regions Known RefSeq Genes in (72-76Mb) Region as Viewed in UCSC Genome Browser Only 14 RefSeq Genes plus an additional 7 from Ensembl Known Genes on Mouse Chromosome 5 MGI approved symbols ~72Mb ~76Mb 5033405K12Rik 6030432N09Rik 1810027I20Rik AI836376 Sgcb 1700067I02Rik C78283 2700023E23Rik 1190017B18Rik 6720475M21Rik 1300019H17Rik Lnx1 Chic2 Gsh2 Pdgfra Kit Kdr Gabarapl2 (homolog) Srd5a2l Tparl Clock Pdcl2 Nmu Gene symbol synonyms KIAA1458 KIAA0826 LOC231293 KIAA0276 FLJ12552 Identified 28 known genes 15 genes have assigned GO Functions 5 enzyme 4 signal transducer 4 ligand binding or carrier 3 nucleic acid binding 2 transporter Example of Known Mouse Chromosome 5 Gene - Chic2 * *Alignment reveals exon differences between RNAs belonging to gene (Alternative forms) Putative Genes on Mouse Chromosome 5 putative gene mouse chr5 Note:multi-exon alignment; single image clone 583253; polyA signal suggests 3’ end of gene putative gene mouse chr5 Note:Singleton ESTs from IMAGE clone 551428 align putative gene mouse chr5 Note:multi-exon alignment; ESTs from single image clone 515319; possible polyA signal in 3'sequence putative gene mouse chr5 Note:multiple span alignment; 9/02- RNAs also aligning to another region of mouse chr5 putative gene mouse chr5 Note: 3 ESTs in assembly from embryo ……. ……. Total 21 (some putative genes may later be merged) Example of a Putative Mouse Gene Example DT.40155293 image clone sequences (5’ and 3’ in same assembly) Genes on Mouse Chromosome 5 • 72-76 Mb region – – – – 65 genes from automated DoTS analysis 49 manual evaluation 21 Ensembl genes 14 RefSeq genes • Whole chromosome 5 (151 Mb) – 2157 genes from automated DoTS analysis – 1275 Ensembl genes Summary • To make links between genotype and phenotype, the output of technologies such as genomic sequencing, microarrays, mass spec, etc., must be integrated • Our solution is GUS, Genomics Unified Schema, used for multiple systems: AllGenes, PlasmoDB, EPConDB – GUS is freely available as a system for use and development – RAD as part of GUS and uses microarray standards now available • Using GUS for genomic research such as annotating mouse chromosome 5. – Possibly doubling the number of genes in annotated regions! http://www.cbil.upenn.edu