Download Data Mining in DNA: Using the SUBDUE Knowledge Discovery

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences by Ronald K. Maglothin Committee Members • Dr. Lawrence B. Holder, Supervisor • Dr. Diane J. Cook • Dr. Lynn L. Peterson Outline • DNA Sequence Domain • SUBDUE Knowledge Discovery System • Experiments with Unsupervised SUBDUE • Experiments with Supervised SUBDUE • Conclusion and Future Work DNA Structure • All cells use DNA to store their genetic information. • A DNA molecule is composed of two linear strands coiled in a double helix. • Each strand is made of the bases adenine (A), thymine (T), cytosine (C), and guanine (G), joined in a linear sequence. DNA Sequence • These four bases constitute a fourletter alphabet that cells use to store genetic information. • Molecular biologists can break up a DNA molecule and determine its base sequence, which can be stored as a character string in a computer: TTCAGCCGATATCCTGGTCAGATTCTCT AAGTCGGCTATAGGACCAGTCTAAGAGA Genes • A gene is a DNA sequence that encodes instructions for building a protein. • Gene expression is the process of using a gene to make a protein: DNA gene transcription RNA transcript translation Protein product Gene Regulation • Primary mechanism is to control the rate of DNA transcription: – Faster transcription more protein – Slower transcription less protein • Transcription rate is controlled by transcription factors, which are proteins which bind to specific DNA sequences. Human Genome Project • A U.S.-led, worldwide effort to determine the complete DNA sequence for humans, as well as several other organisms. • These sequences will be used to study: – mechanisms of disease – growth and development – evolutionary relationships A Genome is a LOT of Data • Raw sequence (text) – Human (2005): 3 x 10 9 base pairs – Yeast (finished): 1.2 x 107 base pairs • Annotated sequence (Relational DB) – Links to 3D structures of protein products, other genes in family, known transcription factors, journal references, and other databases. A Rich Domain for Knowledge Discovery • Most of the sequences (and genes) have unknown function. • Efficient algorithms are needed to: – identify important patterns – identify and classify possible genes – infer relationships between genes – predict protein structure The SUBDUE Knowledge Discovery System • Input: A graph G • Output: A list of substructures that compress G well • Uses a computationally-constrained beam search and inexact graph match What is a substructure? • A definition subgraph and a list of subgraph instances : Input Graph A 1 next T next 2 A 3 next C next 4 A next 5 T 6 Substructure Definition A next T Instances A next 1 A 5 T 2 next T 6 next G 7 MDL Heuristic • SUBDUE uses the Minimum Description Length Principle to evaluate substructures. • Description Length of a graph is the number of bits needed to send the graph’s adjacency matrix to a remote computer. • Goal is to minimize DL(S) + DL(G|S). SUBDUE Parameters • Iterations: Graph is compressed using the best substructure, discovery is restarted • Threshold: Controls how much two subgraphs can differ to be considered similar • Beam Width: The number of substructures in the expansion list Graph Representations • Simple linear A next C next A next T next G 1 G • Downstream edges 4 3 2 A 1 C 3 2 1 A 2 1 T Graph Representations • Start vertex 5 4 3 A next C next A 2 next next T 1 G Start • Backbone base next name A base next name C base next name A base next name T base name G Graph Representations • Backbone-star * star base next name A star base star next name C star base next name A star base next name T base name G Unsupervised SUBDUE • Input: An entire yeast chromosome A next C next A next T next G • Heuristic: 1 Value  DL(S)  DL(G | S) • Results: Not good; patterns with two to three bases Polynomial Heuristic Value  SizeOf(Definition)2  NumberOfInstances Threshold  0.2 Pattern Instances TTTTTTTTTTTG 196 AAATTTTTTATT 158 TTTTTTTTTTGC 158 TTTTAATTTTTT 155 GAAATTTTTTAA 144 Unsupervised SUBDUE Discussion • Random noise is not a meaningful kind of pattern variation in DNA. • Unsupervised SUBDUE finds DNA patterns that are hard to evaluate and that are not focused on any target concept. • We need to give SUBDUE more targeted input data and to modify the system to use it effectively. Supervised SUBDUE • Give SUBDUE two graphs: a graph of positive instances of a target concept, and a graph of negative instances. • SUBDUE discovers substructures in the positive graph, finds instances in the negative graph, and bases the overall heuristic value on the values in both graphs. New Data Sets • Clusters of coexpressed yeast genes compiled by Brazma et al., from expression data generated by DeRisi et al. • The expression level of each gene in a cluster changed at the same time and by a similar degree during the experiment; perhaps some genes in a cluster are regulated by similar mechanisms? New Data Sets • Positive examples: – 300-bp upstream windows (both strands) for all genes in a given cluster • Negative examples: – 300-bp upstream windows for genes not in the cluster, OR – 300-bp windows randomly selected from the complete genome (probably not involved in gene regulation) Supervised Heuristic • Based on the substructure’s values in the positive and negative graphs Value  Value / Value 1 1 / DL(S)  DL(G | S) DL(S)  DL(G- | S) DL(S)  DL(G- | S)  DL(S)  DL(G | S) • Numerator set to 1 when no negative instances Compression Ratio • Normalize the graph values by using the inverse of the graph compression Value  DL(G ) DL(S)  DL(G | S) DL(G ) DL(S)  DL(G | S) Negative Graph Value DL(S)  DL(G- | S) Value  DL(S)  DL(G | S) • When there are no negative instances, setting numerator to 1 actually penalizes such substructures. • Using 2 x DL(G-) in this situation gave better results. Ratio Heuristic Results Cluster Best Pattern Instances c2_4.2222200.39 CCCCTTA 7 c2_4.2222201.41 ATATAATA 10 c2_4.2222210.37 GATATATA 6 cr2_4.222202.55 ATATATATATATAT cr4.111101.77 CCCCTTA 6 10 Concept DL Heuristic • Based on the size of a message containing the compressed positive graph, plus the errors (negative instances). DL(S)  DL(G | S)  DL(G ) - DL(G | S) Value  - (DL(S)  DL(G | S) - DL(G- | S)) Concept DL Heuristic Results • Relative graph size affected results Cluster Best Pattern Instances c2_4.2222200.39 AAAAAA 53 c2_4.2222201.41 AAAAAA 41 c2_4.2222210.37 AAAAAA 38 cr2_4.222202.55 AAAAAA 66 cr4.111101.77 ATATAA 57 Backbone Representation base next name A base next name C base next name A base next name T base name G • “Base” vertices allowed don’t-care positions, but heuristic had to be changed to accommodate them. • Overlap became very important. DL Equations DL(G)  vbits  rbits  ebits vbits  lgv  v lg lv v v rbits  (v  1) lg(b  1)   lg k i  i 1 ebits  e (1  lg l e )  (K  1) lg m    Negative Graph Value • Using 2 x DL(G-) for no negative instances favored such substructures too strongly. DL(G- ) Value-  DL(S)  DL(G- | S) DL(G- )  DL(S)  DL(G- ) where lv   lv  1 Compression Difference Heuristic • Use subtraction with the compression values instead of division. Value  Value  ValueDL(G )   DL(S)  DL(G | S) DL(G ) DL(S)  DL(G | S) Results Cluster cr4.111101.77 Pattern N+ N- TRANSFAC matches ATCCAT 16 12 GGGA.G.A 19 16 MOUSE(3), HS(4), RAT(2), HCMV(1), RICE(1) MOUSE(2), HS(3), RABBIT(1) TCCCT 65 35 Y$G3PDH_01 AAGGG 95 37 CCCT 128 76 CAMV(2), RAT(9), AD(3), DROME(8), MOUSE(14), HS(30), DROOR(1), PCF(1), HPV(1), CHICK(2), RABBIT(1), EBV(1) Y$BCY_01, Y$GAL1_04, Y$X40_01, Y$CYC1_12, Y$GAL1_14, Y$G3PDH_01, Y$POX1_01, Y$DDR2_01, Y$DDR2_02, Y$TPI_02 Results Cluster c2_4.2222200.39 Pattern N+ N- TRANSFAC matches CATCC.T 6 10 T.CTGCT 13 13 AGGGA 30 35 GCTGC.G 2 10 Y$RP51A_01, Y$RPL16A_01, Y$FBP1_01 DROME(2), HS(8), MOUSE(2), RAT(2) Y$HIS3_03, Y$STE6_01, Y$STE6_02, Y$DAL4_01 Y$CTT1_02 111 202 TTGC Y$TEF2_01, Y$HMR_02, Y$CUP1_01, Y$CUP1_02, Y$MAL61_01, Y$URA3_04, Y$CYB2_02, Y$ARS1_06, Y$DAL7_01, Y$DAL7_02, Y$PGK_03 Results of Brazma et al. Cluster c2_4.2222200.39 Pattern N+ TRANSFAC matches CCCCT..T 27 A..AGGGG 27 Y$DDR2_01, Y$DDR2_02, Y$TPI_02 none GGGGC 27 GCCCC 27 G..GGGG 28 Y$GAL2_02, Y$SUC2_02, Y$RRNA_01, Y$ERG11_01 Y$CYB2_02 Y$CYC1_04, Y$CYC1_05, Y$CYC1_06 Brazma Heuristic • Based on number of positive and negative instances N Value  N  4.667 SUBDUE Using Brazma Heuristic Cluster c2_4.2222200.39 Pattern N+ N- TRANSFAC matches CCCCT.AT 10 0 Y$DDR2_02 AT.AGGGG 10 0 CHICK$VIT2_18 A…GGGGG 10 2 Y$SUC2_02 CCCC..CT 14 4 Y$GAL3_01, Y$MAL2R_01 CCGGG.T 5 0 Y$CYC1_04, Y$CYC1_05, Y$CYC1_06, Y$MAL63_01 Conclusion • SUBDUE can be used to discover likely transcription factor binding sites. • Patterns found by SUBDUE are different from those found by stringbased algorithms, due to the graph representation, beam search, and different search heuristic. Conclusion • Patterns found by unsupervised SUBDUE in DNA are difficult to evaluate. • Using supervised SUBDUE can greatly focus the search on the target concept. • Choosing the right graph representation and heuristic are critical to success. Future Work • Further refinement of the supervised MDL heuristic. • Application of graph grammar theory to SUBDUE’s search. • Close collaboration with molecular biologists to select data sets and evaluate results.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining in DNA: Using the SUBDUE Knowledge Discovery