Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz [email protected] Department of Computer Science Worcester Polytechnic Institute WPI Center for Research in Exploratory Data and Information Analysis CREDIA Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining – – – – Systems performance Data Sleep Data Financial Data Web Data • Data Mining for Genetic Analysis – Correlating genetic information with diseases – Predicting gene expression patterns • Data Mining for Electronic Commerce – Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks WPI Center for Research in Exploratory Data and Information Analysis Analyzing Sleep Data Purpose: CREDIA Associations between sleep patterns and health/pathology Obtain patterns of different sleep stages (4 sleep+REM +Wake) DATA SET Clinical (sequential) Electro-encephalogram (EEG), Electro-oculogram (EOG), (Source: http://www. blsc.com) Electro-myogram (EMG), Diagnostic (tabular) Questionnaire responses Patient’s demographic info. Patient’s medical history Probe measuring flow of Oxygen in blood etc. Potential Rules: (A) Association Rules (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% (B) Classification Rules (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** => (Race = Caucasian) confidence=70%, support= 8% WPI, UMassMedical, BC*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI Center for Research in Exploratory Data and Information Analysis CREDIA Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses {depression, P1 fatigue} heart rate age oxygen 27 gender Epworth M 5 {stroke, P2 dementia, fatigue} 97,72,67,80,… 73 90,92,96,89,86,… F 23 P3 {arthritis} 102,99,87,96,… 49 97,100,82,80,70, … M 14 … … … … … … … WPI Center for Research in Exploratory Data and Information Analysis CREDIA Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data – sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: – If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases WPI Center for Research in Exploratory Data and Information Analysis CREDIA Events – Financial Data Basic events: 16 or so financial templates [Little&Rhodes78] difficult pattern matching – alignments and time warping Panic Reversal Rounding Top Reversal Head & Shoulders Reversal Descending Triangle Reversal WPI Center for Research in Exploratory Data and Information Analysis WPI Weka CREDIA Tool for mining complex temporal/spatial associations WPI Center for Research in Exploratory Data and Information Analysis CREDIA Data Mining for Genetic Analysis w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis – discovering correlations between sequence variations and diseases • Gene expression – discovering patterns that cause a gene to be expressed in a particular cell WPI Center for Research in Exploratory Data and Information Analysis Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness. CREDIA WPI Center for Research in Exploratory Data and Information Analysis CREDIA Genomic Data Resources Patient Gender SMA Type (Severity) SNP Location C212 AG1-CA Father / Mother Father / Mother Female Severe Y272C 31 / 28 29 102 / 108 112 Male Mild Y272C 28 29 / 25 108 112 / 114 Wirth, B. et al. Journal of Human Molecular Genetics CREDIA WPI Center for Research in Exploratory Data and Information Analysis Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell Gene 1 Gene 2 CAGE On Gene 3 Seam Cells Gene 1 Gene 2 Off Gene 3 WPI Center for Research in Exploratory Data and Information Analysis Gene expression Analysis PR1 PROMOTER(S) M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT Gene 1 CELL TYPES neural PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA Gene 2 neural Gene 3 muscle Gene 4 neural Gene 5 muscle Gene 6 neural Gene 7 neural Gene 8 neural Gene 9 muscle PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA CREDIA WPI Center for Research in Exploratory Data and Information Analysis CREDIA Gene Expression • Transcription of DNA into RNA TRANSCRIPTIONAL PROTEINS TF 1 TF 3 TF 2 PROMOTER REGION GENE M1 M4 M2 ..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA 240 100 MOTIFS M1, M2, M4 MUSCLE CELL WPI Center for Research in Exploratory Data and Information Analysis PR1 PROMOTER(S) M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT Gene 1 neural Gene 2 neural Gene 3 muscle CREDIA PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA Gene 4 PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA Gene 5 PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA Gene 6 Gene 7 R1: M1, M4, M5 => Neural supp neural =22%, conf=100% [Supp. instances: PR1, PR2] muscle R2: M2, M4, M5 => Neural neural supp =22% , conf=100% [Supp. instances: PR1,PR8] neural Gene 8 neural Gene 9 muscle WPI Center for Research in Exploratory Data and Information Analysis CREDIA “Well-clustered” motifs M1 M1 260 M4 M1 M4 240 M4 M2 M5 120 M2 M4 18 60 Coefficient of variation of distances (cvd) between two motifs: M4 M3 IRn ( Mj , Mk ) IRn ( Mj , Mk ) IRn ( Mj , Mk ) M4 M3 21 IR1={M1,M2,M5} M5 150 100 M4 cvd M5 190 M2 210 M5 210 350 M1 M3 M5 150 M1 360 100 M2 100 110 M5 M1 (M1,M2) = 120.1 (M1,M2) = 216.6 cvd(M1,M2) = 0.55 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Distance-based Association Rules Sample distance-based assoc. rule • Given: – min-support – min-confidence – max-cvd R1: M1, M2, M5=>Neural (sup=33%, conf=100%) M2 M1 thresholds • Mine: – all distance-based association rules M2 M5 cvd 0.554 0.076 mean 216.6 462.0 sdev 120.1 35.0 cvd 0.433 mean 237.0 sdev 103.0 WPI Center for Research in Exploratory Data and Information Analysis Grad. & Undergrad. Students • • • • • • • • • • • • • Ali Benamara. Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. • • • • • • • • • • • • • • • • • CREDIA Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB), Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB) Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB). Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock.