* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining in DNA: Using the SUBDUE Knowledge Discovery
Survey
Document related concepts
Gene regulatory network wikipedia , lookup
Molecular cloning wikipedia , lookup
Gene expression wikipedia , lookup
Gene expression profiling wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Genome evolution wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Molecular evolution wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Transcript
Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences by Ronald K. Maglothin Committee Members • Dr. Lawrence B. Holder, Supervisor • Dr. Diane J. Cook • Dr. Lynn L. Peterson Outline • DNA Sequence Domain • SUBDUE Knowledge Discovery System • Experiments with Unsupervised SUBDUE • Experiments with Supervised SUBDUE • Conclusion and Future Work DNA Structure • All cells use DNA to store their genetic information. • A DNA molecule is composed of two linear strands coiled in a double helix. • Each strand is made of the bases adenine (A), thymine (T), cytosine (C), and guanine (G), joined in a linear sequence. DNA Sequence • These four bases constitute a fourletter alphabet that cells use to store genetic information. • Molecular biologists can break up a DNA molecule and determine its base sequence, which can be stored as a character string in a computer: TTCAGCCGATATCCTGGTCAGATTCTCT AAGTCGGCTATAGGACCAGTCTAAGAGA Genes • A gene is a DNA sequence that encodes instructions for building a protein. • Gene expression is the process of using a gene to make a protein: DNA gene transcription RNA transcript translation Protein product Gene Regulation • Primary mechanism is to control the rate of DNA transcription: – Faster transcription more protein – Slower transcription less protein • Transcription rate is controlled by transcription factors, which are proteins which bind to specific DNA sequences. Human Genome Project • A U.S.-led, worldwide effort to determine the complete DNA sequence for humans, as well as several other organisms. • These sequences will be used to study: – mechanisms of disease – growth and development – evolutionary relationships A Genome is a LOT of Data • Raw sequence (text) – Human (2005): 3 x 10 9 base pairs – Yeast (finished): 1.2 x 107 base pairs • Annotated sequence (Relational DB) – Links to 3D structures of protein products, other genes in family, known transcription factors, journal references, and other databases. A Rich Domain for Knowledge Discovery • Most of the sequences (and genes) have unknown function. • Efficient algorithms are needed to: – identify important patterns – identify and classify possible genes – infer relationships between genes – predict protein structure The SUBDUE Knowledge Discovery System • Input: A graph G • Output: A list of substructures that compress G well • Uses a computationally-constrained beam search and inexact graph match What is a substructure? • A definition subgraph and a list of subgraph instances : Input Graph A 1 next T next 2 A 3 next C next 4 A next 5 T 6 Substructure Definition A next T Instances A next 1 A 5 T 2 next T 6 next G 7 MDL Heuristic • SUBDUE uses the Minimum Description Length Principle to evaluate substructures. • Description Length of a graph is the number of bits needed to send the graph’s adjacency matrix to a remote computer. • Goal is to minimize DL(S) + DL(G|S). SUBDUE Parameters • Iterations: Graph is compressed using the best substructure, discovery is restarted • Threshold: Controls how much two subgraphs can differ to be considered similar • Beam Width: The number of substructures in the expansion list Graph Representations • Simple linear A next C next A next T next G 1 G • Downstream edges 4 3 2 A 1 C 3 2 1 A 2 1 T Graph Representations • Start vertex 5 4 3 A next C next A 2 next next T 1 G Start • Backbone base next name A base next name C base next name A base next name T base name G Graph Representations • Backbone-star * star base next name A star base star next name C star base next name A star base next name T base name G Unsupervised SUBDUE • Input: An entire yeast chromosome A next C next A next T next G • Heuristic: 1 Value DL(S) DL(G | S) • Results: Not good; patterns with two to three bases Polynomial Heuristic Value SizeOf(Definition)2 NumberOfInstances Threshold 0.2 Pattern Instances TTTTTTTTTTTG 196 AAATTTTTTATT 158 TTTTTTTTTTGC 158 TTTTAATTTTTT 155 GAAATTTTTTAA 144 Unsupervised SUBDUE Discussion • Random noise is not a meaningful kind of pattern variation in DNA. • Unsupervised SUBDUE finds DNA patterns that are hard to evaluate and that are not focused on any target concept. • We need to give SUBDUE more targeted input data and to modify the system to use it effectively. Supervised SUBDUE • Give SUBDUE two graphs: a graph of positive instances of a target concept, and a graph of negative instances. • SUBDUE discovers substructures in the positive graph, finds instances in the negative graph, and bases the overall heuristic value on the values in both graphs. New Data Sets • Clusters of coexpressed yeast genes compiled by Brazma et al., from expression data generated by DeRisi et al. • The expression level of each gene in a cluster changed at the same time and by a similar degree during the experiment; perhaps some genes in a cluster are regulated by similar mechanisms? New Data Sets • Positive examples: – 300-bp upstream windows (both strands) for all genes in a given cluster • Negative examples: – 300-bp upstream windows for genes not in the cluster, OR – 300-bp windows randomly selected from the complete genome (probably not involved in gene regulation) Supervised Heuristic • Based on the substructure’s values in the positive and negative graphs Value Value / Value 1 1 / DL(S) DL(G | S) DL(S) DL(G- | S) DL(S) DL(G- | S) DL(S) DL(G | S) • Numerator set to 1 when no negative instances Compression Ratio • Normalize the graph values by using the inverse of the graph compression Value DL(G ) DL(S) DL(G | S) DL(G ) DL(S) DL(G | S) Negative Graph Value DL(S) DL(G- | S) Value DL(S) DL(G | S) • When there are no negative instances, setting numerator to 1 actually penalizes such substructures. • Using 2 x DL(G-) in this situation gave better results. Ratio Heuristic Results Cluster Best Pattern Instances c2_4.2222200.39 CCCCTTA 7 c2_4.2222201.41 ATATAATA 10 c2_4.2222210.37 GATATATA 6 cr2_4.222202.55 ATATATATATATAT cr4.111101.77 CCCCTTA 6 10 Concept DL Heuristic • Based on the size of a message containing the compressed positive graph, plus the errors (negative instances). DL(S) DL(G | S) DL(G ) - DL(G | S) Value - (DL(S) DL(G | S) - DL(G- | S)) Concept DL Heuristic Results • Relative graph size affected results Cluster Best Pattern Instances c2_4.2222200.39 AAAAAA 53 c2_4.2222201.41 AAAAAA 41 c2_4.2222210.37 AAAAAA 38 cr2_4.222202.55 AAAAAA 66 cr4.111101.77 ATATAA 57 Backbone Representation base next name A base next name C base next name A base next name T base name G • “Base” vertices allowed don’t-care positions, but heuristic had to be changed to accommodate them. • Overlap became very important. DL Equations DL(G) vbits rbits ebits vbits lgv v lg lv v v rbits (v 1) lg(b 1) lg k i i 1 ebits e (1 lg l e ) (K 1) lg m Negative Graph Value • Using 2 x DL(G-) for no negative instances favored such substructures too strongly. DL(G- ) Value- DL(S) DL(G- | S) DL(G- ) DL(S) DL(G- ) where lv lv 1 Compression Difference Heuristic • Use subtraction with the compression values instead of division. Value Value ValueDL(G ) DL(S) DL(G | S) DL(G ) DL(S) DL(G | S) Results Cluster cr4.111101.77 Pattern N+ N- TRANSFAC matches ATCCAT 16 12 GGGA.G.A 19 16 MOUSE(3), HS(4), RAT(2), HCMV(1), RICE(1) MOUSE(2), HS(3), RABBIT(1) TCCCT 65 35 Y$G3PDH_01 AAGGG 95 37 CCCT 128 76 CAMV(2), RAT(9), AD(3), DROME(8), MOUSE(14), HS(30), DROOR(1), PCF(1), HPV(1), CHICK(2), RABBIT(1), EBV(1) Y$BCY_01, Y$GAL1_04, Y$X40_01, Y$CYC1_12, Y$GAL1_14, Y$G3PDH_01, Y$POX1_01, Y$DDR2_01, Y$DDR2_02, Y$TPI_02 Results Cluster c2_4.2222200.39 Pattern N+ N- TRANSFAC matches CATCC.T 6 10 T.CTGCT 13 13 AGGGA 30 35 GCTGC.G 2 10 Y$RP51A_01, Y$RPL16A_01, Y$FBP1_01 DROME(2), HS(8), MOUSE(2), RAT(2) Y$HIS3_03, Y$STE6_01, Y$STE6_02, Y$DAL4_01 Y$CTT1_02 111 202 TTGC Y$TEF2_01, Y$HMR_02, Y$CUP1_01, Y$CUP1_02, Y$MAL61_01, Y$URA3_04, Y$CYB2_02, Y$ARS1_06, Y$DAL7_01, Y$DAL7_02, Y$PGK_03 Results of Brazma et al. Cluster c2_4.2222200.39 Pattern N+ TRANSFAC matches CCCCT..T 27 A..AGGGG 27 Y$DDR2_01, Y$DDR2_02, Y$TPI_02 none GGGGC 27 GCCCC 27 G..GGGG 28 Y$GAL2_02, Y$SUC2_02, Y$RRNA_01, Y$ERG11_01 Y$CYB2_02 Y$CYC1_04, Y$CYC1_05, Y$CYC1_06 Brazma Heuristic • Based on number of positive and negative instances N Value N 4.667 SUBDUE Using Brazma Heuristic Cluster c2_4.2222200.39 Pattern N+ N- TRANSFAC matches CCCCT.AT 10 0 Y$DDR2_02 AT.AGGGG 10 0 CHICK$VIT2_18 A…GGGGG 10 2 Y$SUC2_02 CCCC..CT 14 4 Y$GAL3_01, Y$MAL2R_01 CCGGG.T 5 0 Y$CYC1_04, Y$CYC1_05, Y$CYC1_06, Y$MAL63_01 Conclusion • SUBDUE can be used to discover likely transcription factor binding sites. • Patterns found by SUBDUE are different from those found by stringbased algorithms, due to the graph representation, beam search, and different search heuristic. Conclusion • Patterns found by unsupervised SUBDUE in DNA are difficult to evaluate. • Using supervised SUBDUE can greatly focus the search on the target concept. • Choosing the right graph representation and heuristic are critical to success. Future Work • Further refinement of the supervised MDL heuristic. • Application of graph grammar theory to SUBDUE’s search. • Close collaboration with molecular biologists to select data sets and evaluate results.