Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WPI Center for Research in Exploratory Data and Information Analysis Research Bytes 2004 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute CREDIA WPI Center for Research in Exploratory Data and Information Analysis CREDIA Need for Data Mining • Data are being gathered and stored extremely fast – Currently, the amount of new data stored in digital computer systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000). • Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data WPI Center for Research in Exploratory Data and Information Analysis CREDIA What is Data Mining? or more generally, Knowledge Discovery in Databases (KDD) “Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996] • Raw Data Data Mining • Patterns » Analytical and Statistical Patterns (rules, decision trees, …) » Visual Patterns Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996. WPI Center for Research in Exploratory Data and Information Analysis Data Analysis (KDD)Process clean data data “pre”processing CREDIA data analysis data mining • analytical statistical • visual models 90 80 70 60 50 40 30 20 10 0 East W est North 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr • noisy/missing data • dim. reduction data sources model/pattern evaluation data • quantitative • qualitative data management • databases • data warehouses new data model/patterns deployment • prediction • decision support “good” model WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDD is Interdisciplinary techniques come from multiple fields • Machine Learning (AI) – Contributes (semi-)automatic induction of empirical laws from observations & experimentation • Statistics – Contributes language, framework, and techniques • Pattern Recognition – Contributes pattern extraction and pattern matching techniques • Databases – Contributes efficient data storage, data cleansing, and data access techniques • Data Visualization – Contributes visual data displays and data exploration • High Performance Comp. – Contributes techniques to efficiently handling complexity • Application Domain – Contributes domain knowledge CREDIA WPI Center for Research in Exploratory Data and Information Analysis What do you want to learn from your data? KDD approaches A B C D blue blue orange regression IF A & B THEN IF A & D THEN classification clustering Data change/deviation detection summarization 90 80 70 60 50 40 30 20 10 0 East W est North 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr dependency/assoc. analysis 0.5 A 0.3 C B 0.75 D IF a & b & c THEN d & k IF k & a THEN e A, B -> C 80% C, D -> A 22% WPI Center for Research in Exploratory Data and Information Analysis CREDIA Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining – – – – Systems performance Data Sleep Data Financial Data Web Data • Data Mining for Genetic Analysis – Correlating genetic information with diseases – Predicting gene expression patterns • Data Mining for Electronic Commerce – Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks WPI Center for Research in Exploratory Data and Information Analysis Analyzing Sleep Data Purpose: CREDIA Associations between sleep patterns and health/pathology Obtain patterns of different sleep stages (4 sleep+REM +Wake) DATA SET Clinical (sequential) Electro-encephalogram (EEG), Electro-oculogram (EOG), (Source: http://www. blsc.com) Electro-myogram (EMG), Diagnostic (tabular) Questionnaire responses Patient’s demographic info. Patient’s medical history Probe measuring flow of Oxygen in blood etc. Potential Rules: (A) Association Rules (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% (B) Classification Rules (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** => (Race = Caucasian) confidence=70%, support= 8% WPI, UMassMedical, BC *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI Center for Research in Exploratory Data and Information Analysis CREDIA Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses {depression, P1 fatigue} heart rate age oxygen 27 gender Epworth M 5 {stroke, P2 dementia, fatigue} 97,72,67,80,… 73 90,92,96,89,86,… F 23 P3 {arthritis} 102,99,87,96,… 49 97,100,82,80,70, … M 14 … … … … … … … WPI Center for Research in Exploratory Data and Information Analysis CREDIA Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data – sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: – If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases WPI Center for Research in Exploratory Data and Information Analysis CREDIA Events – Financial Data Basic events: 16 or so financial templates [Little&Rhodes78] difficult pattern matching – alignments and time warping Panic Reversal Rounding Top Reversal Head & Shoulders Reversal Descending Triangle Reversal WPI Center for Research in Exploratory Data and Information Analysis Closer Look: WPI Weka CREDIA Tool for mining complex temporal/spatial associations WPI Center for Research in Exploratory Data and Information Analysis CREDIA Data Mining for Genetic Analysis w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis – discovering correlations between sequence variations and diseases • Gene expression – discovering patterns that cause a gene to be expressed in a particular cell WPI Center for Research in Exploratory Data and Information Analysis Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness. CREDIA WPI Center for Research in Exploratory Data and Information Analysis CREDIA Genomic Data Resources Patient Gender SMA Type (Severity) SNP Location C212 AG1-CA Father / Mother Father / Mother Female Severe Y272C 31 / 28 29 102 / 108 112 Male Mild Y272C 28 29 / 25 108 112 / 114 Wirth, B. et al. Journal of Human Molecular Genetics CREDIA WPI Center for Research in Exploratory Data and Information Analysis Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell Gene 1 Gene 2 CAGE On Gene 3 Seam Cells Gene 1 Gene 2 Off Gene 3 WPI Center for Research in Exploratory Data and Information Analysis Grad. & Undergrad. Students • • • • • • • • • • • • • Ali Benamara Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. • • • • • • • • • • • • • • • • • CREDIA Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird, Jay Farmer, Rebecca Gougian, Ken Monterio, Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch, Ben Lucas, Sarah Towey Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu, Ian Pushee, Frederick Tan. Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano. Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock.