Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards an automated procedure for annotation of gene products through SAS® Methods Henrik Tveit Norwegian University of Science and Technology Overview ! Biology: the Structure of Life ! Functional genomics ! The Challenge of automated gene annotation ! Sources of information ! Methods and results ! Conclusion and general business intelligence experience The Structure of Life ! ! ! ! The body consists of cells Each cell contains a DNA molecule Each DNA molecule consists of proteincoding parts known as genes Each gene produces different proteins The Structure of Life ! ! ! ! The body consists of cells Each cell contains a DNA molecule Each DNA molecule consists of proteincoding parts known as genes Each gene produces different proteins Functional genomics ! Which genes are active in e.g. cancer cells? ! How can the activity be controlled to repair/prevent illness? ! Gene analysis ! Gene annotation The Challenge PEG3 Automatic annotation of genes using terms from the Gene Ontology (GO) DNMT1 Gene Ontology BRCA1 RES1 R8928 TRD@1 R37302 PCTP GRA3 ODC1 ... death cell death apoptosis development transduction signal transduction imprinting DNA replication methyltransferase The Source: Medline PEG3 DNMT1 Gene Ontology BRCA1 RES1 R8928 TRD@1 R37302 PCTP GRA3 ODC1 ... death cell death apoptosis development transduction signal transduction imprinting DNA replication methyltransferase Methods ! Information ! Data exploration Mining ! Text Mining ! Combining the results Using Medline entry key terms ! MeSH to GO – Implicit relations via similar term usage – Association analysis Title Abstract MeSH EC RN Dnmt1 overexpression causes genomic hypermethylation, loss of imprinting, and embryonic lethality. Biallelic expression of Igf2 is frequently seen in cancers because Igf2 functions as a survival factor. In many tumors the activation of Igf2 expression has been correlated with de novo methylation of the imprinted region. We have compared the intrinsic susceptibilities of the imprinted region of Igf2 and H19, other imprinted genes, bulk genomic DNA, and repetitive retroviral sequences to Dnmt1 overexpression. At low Dnmt1 methyltransferase levels repetitive retroviral elements were methylated and silenced. The nonmethylated imprinted region of Igf2 and H19 was resistant to methylation at low Dnmt1 levels but became fully methylated when Dnmt1 was overexpressed from a bacterial artificial chromosome transgene. Methylation caused the activation of the silent Igf2 allele in wild-type and Dnmt1 knockout cells, leading to biallelic Igf2 expression. In contrast, the imprinted genes Igf2r, Peg3, Snrpn, and Grf1 were completely resistant to de novo methylation, even when Dnmt1 was overexpressed. Therefore, the intrinsic difference between the imprinted region of Igf2 and H19 and of other imprinted genes to postzygotic de novo methylation may be the molecular basis for the frequently observed de novo methylation and upregulation of Igf2 in neoplastic cells and tumors. Injection of Dnmt1overexpressing embryonic stem cells in diploid or tetraploid blastocysts resulted in lethality of the embryo, which resembled embryonic lethality caused by Dnmt1 deficiency. Term occurrence frequency ! Occurrences of GO term synonyms in Medline texts Title Abstract MeSH EC RN Dnmt1 overexpression causes genomic hypermethylation, loss of imprinting, and embryonic lethality. Biallelic expression of Igf2 is frequently seen in cancers because Igf2 functions as a survival factor. In many tumors the activation of Igf2 expression has been correlated with de novo methylation of the imprinted region. We have compared the intrinsic susceptibilities of the imprinted region of Igf2 and H19, other imprinted genes, bulk genomic DNA, and repetitive retroviral sequences to Dnmt1 overexpression. At low Dnmt1 methyltransferase levels repetitive retroviral elements were methylated and silenced. The nonmethylated imprinted region of Igf2 and H19 was resistant to methylation at low Dnmt1 levels but became fully methylated when Dnmt1 was overexpressed from a bacterial artificial chromosome transgene. Methylation caused the activation of the silent Igf2 allele in wild-type and Dnmt1 knockout cells, leading to biallelic Igf2 expression. In contrast, the imprinted genes Igf2r, Peg3, Snrpn, and Grf1 were completely resistant to de novo methylation, even when Dnmt1 was overexpressed. Therefore, the intrinsic difference between the imprinted region of Igf2 and H19 and of other imprinted genes to postzygotic de novo methylation may be the molecular basis for the frequently observed de novo methylation and upregulation of Igf2 in neoplastic cells and tumors. Injection of Dnmt1overexpressing embryonic stem cells in diploid or tetraploid blastocysts resulted in lethality of the embryo, which resembled embryonic lethality caused by Dnmt1 deficiency. Compare term definitions and texts ! Compare gene texts with GO term definitions (‘dictionary’-based) Title Abstract MeSH EC RN Locating frequent phrases ! ! Locating known phrases indicating gene functions [symbol] [something] may be associated with [process] Title Abstract MeSH The DNMT1 gene may be associated with imprinting EC RN Singular Value Decomposition Term pr document matrix ! Decomposition and reduction ! Conceptual relations revealed ! Finding conceptual similar texts ! Mixed GO/Medline space – Compare GO definition texts and gene texts ! Medline model space – Measure new gene texts to pre-made model Title Abstract MeSH EC RN Mixed GO/Medline space Mix GO term definitions and gene texts from Medline ! Nearest neighbours ! Clustering ! GO Medline Mixed space with rolled up terms ! Roll up GO terms – Larger GO texts – Fewer GO points GO Medline The training set: Medline entries with related GO terms Gene Ontology Medline entries Title Title Title Abstract Abstract Abstract MeSH ECMeSH ECMeSH RN EC RN RN Medline model space Training set > model space ! All model points has a GO term ! Project new documents into the model space ! Memory based reasoning ! Neural Networks ! New doc Nearest neighbour Model space with binary target Same model space ! Top to bottom ! Put documents into 1 to n GO nodes ! Process each target GO term independently ! 10 6 5 5 2 4 Combining the results ! Voting scheme – Each method votes – Recommend the top 10-15 annotations ! Introduce a ‘trust weight’ – emphasize suggestions from the best performing methods ! Neural Networks ! Expansion of the annotations Conclusion Designed and implemented methods for gene annotation ! Combined text and data information ! Performs better than a widely used public tool ! Methods to be part of the annotation process at Medical Research Center ! ! SAS allowed for quick exploration of ideas: ® – Easy handling of large data sets – Pre-made tools (Enterprise Miner™) General experience ! We have transformed key terms (atomic data) and free text into hierarchical decisions and categories ! Experience useful in – Organizations with much information stored as free text – Situations where atomic data fields and free text fields are used together ! Error report forms > error prediction ! Customer feedback forms Acknowledgments ! Dr. Torulf Mollestad of SAS® Norway and Norwegian University of Science and Technology ! Dr. Astrid Lægreid of Medical Research Center, Trondheim, Norway ! SAS Norway ® ! SAS ® Academic Initiative – Ms. Nina Hanke ? [email protected]