Download 02_incob1 - NDSU Computer Science

Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard, Francis Larson; North Dakota State University {william.perrizo, willy.valdivia, edward.deckard, francis.larson @ndsu.nodak.edu} Patents pending on bSQ and Ptree technology The Problem •There is a lot of data available today (e.g., gene expression data), but too little information. •Data Mining attempts to reduce raw data to information for decision support. Decisions (often 1 bit – Y/N, T/F, Do/Don’t_do ) •Data mining •Classification (supervised learning) •Clustering (unsupervised learning) •Association Rule Mining (ARM) •Statistics •Machine Learning •Data Structuring •Signal Processing 0/1 raw data (gigs, teras, petas, exas…) A Solution? Currently the predominant method employed in bioinformatics is clustering (a little classification) on isolated microarray datasets. • Needed:? A data mining software suite able to: • transform copies of pertinent data from a variety of databases into a data mining-ready form in real-time (our solution based on P-trees?) “transform copies” rather than “standardize” since standardization rarely works! There will always be an MS (and I don’t mean Martha Stewart) to frustrate/destroy the standardization effort. • facilitate Association Rule Mining, Clustering, Classification in an uniform way (so data mining results from other areas can be used) Bioinformatics: a Walmart or a Kmart?!? Walmart took DM seriously (early, comprehensive approach borrowing useful techniques from a variety of application areas) Kmart? Too little, too late. Using data mining techniques developed for other application areas in bioinformatics? Remotely Sensed Images (RSI) can be viewed as collections of pixels. Each pixel has a value for each feature attribute TIFF image Yield Map For example, the RSI dataset above has 1320 rows and 1320 columns of pixels (1,742,400 pixels) and 4 feature attributes (Red,Green,Blue,Yield). The (R,G,B) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map. Microarray or DNA chip data is not much different (multiple attributes corresponding to treatments or conditions). Much data mining (ARM) has been done on RSI data. Can it be useful in bioinformatics? Regulation Pathway Discovery is not very different from Market Basket Research (ala Walmart)  The results of clustering microarray data may indicate that genes (1 – 9) are involved in a regulation pathway.  High confident rule mining on that cluster can discover the relationships among those genes (e.g., the expression of one gene, Gene2, might be discovered to be regulated by 1,3,5,6,8,9 and Gene4 and Gene7 may not be directly regulating Gene2 and can therefore be excluded. Clustering ARM Gene4 Gene1 Gene6 Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Gene7 Gene3 Gene8 Gene5 Gene9 Gene2 ARM for Microarray Data • A gene regulatory pathway component can be represented as an association rule, {G1..Gn}  Gm where {G1…Gn} is the antecedent & Gm is the consequent. • Microarray data is most often represented as a relation G(Gid, T1 …Tn) where Gid is the gene identifier; T1... Tn are the treatments (or conditions) and the data values represent gene expression levels. Call this the " Gene Table”. • Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table). Trmt-ID Gene-ID . G1 G2 G3 G4 T1 …. …. …. …. T2 …. …. …. …. T3 …. …. …. …. T4 …. …. …. …. Gene expression values ARM for Microarray Data (Contd.) • An alternate data format exits (called the “Treatment Table”.) T(Tid, G1, G2, …. , Gn) where Tid is the treatment identifier and G1…Gn are the gene identifiers. • Treatment table provides a convenient form for ARM of gene expression levels. • Goal is to mine for rules among genes by associating treatment table columns. GeneID TrtmtID . T1 T2 T3 T4 G1 …. …. …. …. G2 …. …. …. …. G3 …. …. …. …. G4 …. …. …. …. Gene expression values The form of the Treatment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8 years. Gene Table Treatment Table G1 G2 G3 G4 T1 T2 T3 T4 G1 … …. … … T1 … …. …. … G2 … …. … … T2 … …. …. … G3 … …. … … T3 … …. …. … G4 … …. … … T4 … …. …. … Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements. It is a 2-D data cube which can be rotated (to the Treatment Table), rolledup, sliced, diced, drilled down, association rule mined etc. What are Peano Trees? First what are the Spatial Data Formats 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) Band SeQuential (2 files) (BSQ) Band 1: 254 127 14 193 Band 2: 37 240 200 19 Spatial Data Formats (Cont.) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 Band InterLeaved by Line (BIL) 254 127 37 240 14 193 200 19 Spatial Data Formats (Cont.) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 Band Interleaved by Pixel (1 file) (BIP) 254 37 127 240 14 200 193 19 Spatial Data Formats (Cont.) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 bit SeQuential (bSQ) format (16 files) (related to bit planes in graphics) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 B27 0 0 0 1 B28 1 0 0 1  Reasons of using bSQ format – – – Different bits contribute to the value differently. bSQ format facilitates representation of precision hierarchy (1 bit, 2 bit, … n-bit precision). bSQ format facilitates the creation of an efficient P-tree data structure and P-tree algebra. BSQ and bSQ formats – BSQ and bSQ are “tabular” formats • BSQ consist of a separate table for each band (e.g., Gene or Treatment) • bSQ consist of a separate table for each bit of each band – One can view it this way: • Data set is initially 1 relation or table, R(K1,..,Kk, A1, A2,…, An), K1,..,Kk are structure attributes and each Ai is a feature attribute. – Structure attributes of an RSI are X and Y coordinates (could put the same structure on the Gene Table, but I want to focus on the Treatment table). – Structure attributes of the Treatment Table might be a collection of Treatment dimensions, based on MIAME standard (Minimum info about microarray exp): http://www.mged.org/Annotations-wg/index.html » Experimental design » Array design » Samples » Hybridisations » Measurements » Normalization Control A Universal Format?  E.g., One large universal table with 5 dimensions based on MIAME standard? – E = Experimental design – Hybridisation Procedures – A = Array design – S = Samples – M = Measurements – N = Normalization Control for data mining across all treatments and genes? "GREASMN" (5-D Universal Gene Expression Cube) Gene-Rep G1 G2 … Gn Tid (E,A,S,M,N) E,A,S,M,N1 E,A,S,M,N2 …. …. …. …. …. …. …. …. …. Gene expression values ... E,A,S,M,Nm Cardinatlity is high, but compression will be substantial (next slide). GREASMN datacube rolled up onto (E,S) E (Lab…) S (Organism..) E1 E2 1 5 2 0… 90. The non-zero blocks may occur off the diagonal. The Point: Massive but very sparse dataset! 0 8 1 7 6 5... zeros 70. Sn . zeros 1 7 0... . . . . En Yeast S1 S2 . Peano Count Tree (P-tree) P-tree represents spatial bSQ data bit-by-bit in a recursive quadrant-by-quadrant arrangement. P-tree is a lossless, compressed, data-miningready representation of the data. – partially run-length compressed using the structure attributes. – “count pre-computed”. 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 An example of Peano Count tree Given a bSQ file, Bij, (shown in spatial positions below) we create its basic PC-tree, Pij as follows. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 55 16 8 15 3 0 4 1 4 4 3 4 1 1 1 0 0 0 1 0 1 1 0 1  Peano or Z-ordering  Level  Pure (Pure-1/Pure-0) quadrant  Fan-out  Root Count  QID (Quadrant ID) 16 An example of PC-tree 001 111 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 55 0 16 1 Level-3 2 3 8 15 2 3 0 4 1 4 4 3 4 3 1 1 1 0 0 0 1 0 1 1 0 1 16 2.2.3  Peano or Z-ordering  Level  Pure (Pure-1/Pure-0) quadrant  Fan-out  Root Count  QID (Quadrant ID) ( 7, 1 ) ( 111, 001 ) 10.10.11 Level-2 Level-1 Level-0 Alternative forms for Ptrees (all lossless) 1 means quadrant is pure-1, 0 otherwise (pure0 if no sub-tree ptrs, otherwise mixed) P1: 0 1 means quadrant is pure-0, 0 otherwise P0: ______/ / \ \______ / / \ \ 00 11 01 10 / / \ \ 1 0 0 1 / 01/ \10 \11 00/ 01/ \10 \11 00 0 0 1 0 11 0 1 //|\ //|\ //|\ 1110 0010 1101 0 ______/ / \ \______ / / \ \ / / \ \ 0 0 0 0 / / \ \ / / \ \ 0 1 0 0 0 0 0 0 //|\ //|\ //|\ 0001 1101 0010 1 means quadrant is Not pure-Zero, 0 otherwise (Note: PM = PNZ XOR P1 ) PNZ (=P0’) 1 ________ / / \ \___ / ____ / \ \ / / \ \ 1 1 1 1 / / \ \ / / \ \ 1 0 1 1 1 1 1 1 //|\ //|\ //|\ 1110 0010 1101 Vector forms (A table entry for each mixed inode containing its qid and its children bit-vector ; P1V (as a table): qid vector [ ] 1001 [01] 0010 [10] 1101 [01.00] 1110 [01.11] 0010 [10.10] 1101 Since there is no qid=[01.01] in the table we know it’s pure0, not mixed P0V: qid [ ] [01] [10] [01.00] [01.11] [10.10] vector 0000 0100 0000 0001 1101 0010 Eliminate need for subtree pointers) PNZV: qid [ ] [01] [10] [01.00] [01.11] [10.10] vector 1111 1011 1111 1110 0010 1101 Basic, Value and Tuple Ptrees Basic Ptrees (i.e., P11, P12, …, P18, P21, …, P28, …, P71, …, P78) AND Value Ptrees (i.e., P1, 001 = P11’ AND P12’ AND P13) AND Tuple Ptrees (i.e., P001, 010, 111 = P1, 001 AND P2, 010 AND P3, 111) qid [ ] [01] [10] [01.00] [01.11] [10.10] NZ 1111 1011 1111 P1 1001 0010 1101 1110 0010 1101 P11 qid NZ Distributed [ ] 1010 [10] 1111 [10.11] P12 P1 P trees? 1000 1110 0111 qid [ ] [01] [10] [01.11] [10.00] NZ 0111 1111 1110 P1 P13 0001 1110 0110 0110 1000 Assume a 5-computer cluster; NodeC, Node00, Node01, Node10, Node11. Send to Nodeij if qid ends in ij: Bp qid 11[ ] 12[ ] 13[ ] NZ 1111 1010 0111 P1 C 1001 1000 0001 Bp qid NZ 11[01.00] 13[10.00] P1 00 1110 1000 Bp qid 11[01] 13[01] Bp qid 11[10] 11[10.10] 12[10] 13[10] P1 10 1101 1101 1110 0110 Bp qid NZ 11[01.11] 12[10.11] 13[01.11] NZ 1111 1111 1110 NZ 1011 1111 P1 01 0010 1110 P1 11 0010 0111 0110 A data mining request involves a series of multicast invocations and at most one unicast reply for each receiving node. A distributed Genomic data mining federation of Beowulf clusters? Each node computes only a tiny portion of the necessary count information then sends to the requesting node? Non-ARM Ptree-based Microarray data mining methods Hierarchical Clustering Agglomerative Supervised Learning or Classification Non-Hierarchical Clustering Divisive 1 2 3 4 5 6 7 8 K-clustering SOM 8 7 8 6 5 6 7 6 … 5 PCA SVM Decision Trees KNN bSQ format: Bit files of intervalized, normalized, Red/green ratios for each Microarray. Ptree format: One P-tree for each bit position of each bSQ file (e.g., the high-order bit) 55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 depth=0 level=3 depth=1 level=2 depth=2 level=1 depth=3 level=0 A plan Temporal Gene Exp. Analysis Spatial Gene Exp. Analysis Genotypic Gene Exp. Analysis Data Repository bSQ Ptrees Development Of Data Mining Tools User JAVA Graphical Interface SQL, XML Other Microarray Data Repositories Stanford EMBL SGDB Data Mining in Genomics: Conclusion •Data Mining in application areas, with huge raw data stores such as Market Basket Research, Remotely Sensed Imagery, and Genomics (Proteomics?, Transcriptomics, Metabolomics?), are remarkably similar in terms of data and data mining needs. •There should be more collaboration across applications. •In the application areas data cube rotation can open data mining possibilities. •We suggest a universal data structure (GREASMN Table and P-trees) •striped across a wide federation of computer nodes, •using P-tree technology to facilitate data mining •eliminate barriers introduced by scale limitations, incompatible data formats, etc.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 02_incob1 - NDSU Computer Science