* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome (book) wikipedia , lookup
Public health genomics wikipedia , lookup
Metagenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
The TM GeneCards Project at the Weizmann Institute of Science • For each gene - a card with displayed data and links to entries in major databases • Genes with HUGO nomenclature symbols and others • Automatic data mining and integration • Advanced human-computer interaction http://bioinformatics.weizmann.ac.il/cards/ chromosome gene DNA sequence disease protein research article RNA gene alias similar mouse gene mutation medical applications marker genetic map chromosomal location Databases Containing Human Genome Information EMBL Swiss-Prot UDB GeneMap'98 Genethon GDB GAC GENATLAS Whitehead/MIT CFTR Sanger Centre GeneCards UDB NCBI OMIM GenBank CEPH Stanford UniGene dbSNP PKR Marshfield TIGR PRODOM BRCA1 HOVERGEN CHLC Pfam GESTEC PDB IMGT BLOCKS HGMD PRINTS Utah TP53 WashU LDB TGD COPE PIR DDBJ GeneCards: From Chaos to Order A card for each gene o o o o o o o o o Aliases DNA, RNA Protein Chromosomal location Disorders Medical applications Related mouse gene Research articles Links to more data Data is retrieved and integrated automatically Data Related to Genes Nucleotide SEQUENCE-Genomic/cDNA, -coding/regulatory VARIATION (polymorphism, mutation) Chromosomal LOCATION G E N EXPRESSION (tissues, developmental, disease) PROTEIN - sequence, domains, 3D - subcellular location - 2D electrophoeresis Biological PATHW AYS DISEASE E PHARMA (diagnostics, vaccines, drugs) ORTHOLOGS(model organisms, knockout) Commercial DNA ARRAYS PATENTS GeneCard: Integrated Data and Starting Point Mining and Integration of Data A Starting point for More Data Entries in Data Sources of GeneCards GeneCard link to link to link to Data Sources of GeneCards other Data Sources other Data Sources A typical GeneCard: RUNX1 HUGO nomenclature gene symbol Accession ID to other databases LocusLink or HUGO location If chromosome 21 Information on proteins For chromosome 21 only Sequence accessions Single nucleotide polymorphisms Homologues Disorders and mutations Medical news from Doctor’s guide Published literature Snapshot of additional GeneCard fields Additional information Start new search Improved Single Nucleotide Polymorphisms Summaries Current GeneCards Data Sources and Links HUGO GDB OMIM SWISS-PROT LocusLink UDB UniGene MGD DOTS UCSC GenBank PubMed CroW 21 Doctor’s Guide HUGE euGenes Genatlas ATLAS HGMD TGDB BCGD MTDB RZPD MIPS PDB BLOCKS HORDE dbSNP ENSEMBL SBCELEGANS GeneLynx IMGT SOURCE Gene sources 13,046 HUGO 360 LocusLink MGD 8,951 CroW 21 63 How to search and find? Simple search box search keywords results gene 1: name - ... keyword ... - ... ... keyword . gene 2: name - keyword ... no results spell corrections query modification outside resources Some GeneCards Statistics 27,612 GeneCards (November, 2001) 13,548 HUGO approved genes 2,646,185 Accesses to GeneCards (at WIS since January 1, 1998) 25 Mirror sites around the world The Affymetrix System Genechip Procedure Sample Hybridization preparation Fluidic station Signal detection Scanner Data analysis Software ChipCards - A Functional Integration Tool for DNA Array Data Tsviya Olender, Shirley Horn-Saban, Marilyn Safran, Vered Chalifa-Caspi, Michal Ronen and Doron Lancet The Crown Human Genome Center The Weizmann Institute of Center, Rehovot 76100 About ChipCards ChipCards correlates DNA array data with comprehensive information from gene-specific databases. It is currently implemented for the Affymetrix GeneChip. ChipCards’s output is an HTML table with essential additional information for each gene including: gene symbol, functional definition, accession number, protein information, chromosomal location and EST data. Human data is integrated with GeneCards, UDB and Unigene. Mouse data is integrated with information about the human orthologue via GeneCards, HomoloGene and MGD. Example of GeneChip output before ChipCards processing An Extract of Human Expression Data After ChipCards Processing NCBI link GeneCards link UDB link A snapshot of ChipCards’s result, with human Affymetrix expression data as input. Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the gene is extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates. Murine Expression Data After ChipCards Processiong Human orthologes data NCBI link GeneCards link Murine’s Unigene link Human’s Unigene link NCBI link A snapshot of ChipCards output for Mouse Affymetrix expression data. Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table includes links to NCBI, GeneCards and Unigene. Current Research - Adding Cards for Genes that Don’t Yet Have a Name Unigene cluster 1 Assembly-based resources 2 3 4 5 Gene sequence tag Unique persistent gene identifier GeneCard for novel gene Version 3.0 Project Goals Improving flexibility, allowing automated parameterized generation from partial sets of sources and/or genes, and appending to an existing database Providing an Application Programming Interface for users of the generation software to incorporate their own data Standardizing the format of the database to use XML Project Goals (cont’d) Providing a foundation for supplying a stable identifier for each GeneCard, even when no known gene symbol exists Improving the maintainability, testability, and quality of the software Providing a seamless migration path from Version 2.xx while maintaining the current look and feel and functionality Pros and Cons of Using OOP • Perl not originally designed as an OOP language • Type safety, proper encapsulation and aggregation aren’t enforced • Can be between 20 and 50 % slower • Allows for more robust implementations • Greater modularity • More comprehensible interface to modules • Better abstraction of software components • Less namespace pollution • Greater code reusability • Software scalability • Cleaner and more compact code The 3.0 Hybrid Solution Combines an object-oriented skeleton with some non object-oriented internals •The large data structure of gene-based data is implemented as a hash of hashes, avoiding numerous costly instantiations •All other major components, including the extractors and administration classes, are implemented as objects GeneCards Architecture Generation Software UniGene Extractor SwissProt Extractor API GeneCards Database Customized Extractor Support Functions Display Software Generation Software Classes An underlying layer of support tools that manage extracting data from locally mirrored files and the internet, proxy connections, verification, security, file management, caching, conflict detection, error handling, statistics, and XML output formating A set of extractor classes, one for each source of information using source-specific algorithms and heuristics (adapted from pervious versions of GeneCards). Methods include new, prepare and search A template for building extractor classes. All such classes can create new or append to old entries, as well as generate data for all entries (genes) at once, or one at a time A main class that handles building sets of cards according to parameterized partial ordering rules The XML-Based Database XML is a meta-language that supports customized tags for describing and providing semantic meaning to structured data Typed elements are arranged within other elements to form a nested hierarchy The data is grouped by source in the XML files, but can be retrieved by function: <GCresource>SWISSPROT <GCresource>OMIM <protein> <disorder>Colorectal Cancer <disorder>Germline Cancer </disorder> </disorder> </GCresource> </protein> <GCresource>GENECLINICS <GCresource> <disorder>Li-Fraumeni Syndrome </disorder> </GCResource> Each extractor module is responsible for its own Document Type Definition (DTD) specification to ensure that the XML is well formed and valid Files are stored in a hierarchical directory structure, one file per gene The Display Software Currently in the design phase Want to maintain the current look and feel while providing the flexibility of easy customization Will use XML Perl parser modules in cgi scripts Search will be expanded beyond current text-based capabilities to include context-specific searches 3.0 Project Status and Open Issues Procedural programs/ad-hoc flat file format Object-oriented methodology/standardized XML Easy to add new extractors Flexible and extensibile Performance , Searching strategies Unified Database (UDB) Data mining and integration Original public databases Data mining Semantic Integration Thesaurus Source-specific information Megabase Integration UDB Integrated chrmosomal maps Sequence-Based Repositioning (SBR) Placing finished genomic sequences on UDB map. Map fine tuning in sequenced regions. SBR (Sequence Based Repositioning) Elimination of overlaps between contigs Object repositioning UDB original map SBR map Search Results - a Map Slice to GeneCard to Unigene to MarkerCard A MarkerCard GeneCards Success Stories • GeneCards as a bookmark for linkage analysis • Mutations that were polymorphisms and not disease-causing • Adult-onset diabetes without obesity in India • Work on Chromosome 21 at the Weizmann Institute • PVT – a heart disease found in Israeli Beduins • Parkinson’s disease paper Frequently Asked Questions • What’s special about GeneCards? • Can I interface my own data? • Can I access my own in-house database mirrors instead of public internet sites? GeneCards/UDB Team current: Avital Adato Vered Chalifa-Caspi Michal Lapidot Zvia Olender Naomi Rosen Marilyn Safran, head Orit Shmueli Irina Solomon Doron Lancet, PI alumni: Michael Rebhan Shai Shen-Orr Inga Peter Jaime Prilusky Michal Ronen Hershel Safer Julie Stampnitzky Liora Yaar