Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory What is common among all of them? 2 Who are the data producers? What data? Application Data • Application Category: Finance • Producer: Wall Street • Data: stocks, stock prices, stock purchases,… • Application Category: Academia • Producer: NCSU • Data: students admission data (name, DOB, GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc. 3 Application Categories • • • • • • • Finance (e.g., banks) Entertainment (e.g., games) Science (e.g., weather forecasting) Medicine (e.g., disease diagnostics) Cybersecurity (e.g., terrorists, identity theft) Commerce (e.g., e-Commerce) … 4 What questions to ask about the data? DataQuestions • Academia:NCSU:Admission data 1. Is there any correlation between the students’ GRE scores and their successful completion of a PhD program? 2. What are the groups of students that share common academic performance? 3. Are there any admitted students who would stand out as an anomaly? What type of anomaly is that? 4. If the student majors in Physics, what other major is he/she likely double-major? 5 Questions by Types? • • • • • • • • Correlation, similarity, comparison,… Association, causality, co-occurrence,… Grouping, clustering,… Categorization, classification,… Frequency or rarity of occurrence,… Anomalous or normal objects, events, behaviors, Forecasting: future classes, future activity,… … 6 What information we need to answer? QuestionsData Objects and Object Features • Academia:NCSU:Admission data – – Objects: Students Object’s Features=Variables=Attributes=Dimensions & Types • Name:String (e.g., Name=Neil Shah) • GPA:Numeric (e.g., GPA=5.0) • Recommendation:Text (e.g., … the top 2% in my career…) • Etc. 7 How to compare two objects? Data Object Object Pairs • Academia:NCSU:Admission data – – Objects: Students Based on a single feature: • Similar GPA • The same first letter in the last name – Based on a set of features: • Similar academic records (GPA, GRE, etc.) • Similar demographic records – Can you compute a numerical value for your similarity measure used for comparison? Why or Why not? 8 How to represent data mathematically? Data Object & its Features Data Model • What mathematical objects have you studied? – – – – – – – – – – – Scalar Points Vectors Vector spaces Matrices Sets Graphs, networks (maybe) Tensors (maybe) Time series (maybe) Topological manifolds (maybe) … 9 Data object as vector with components… Vector components: • Features, or City=(Latitude, Longitude)--2-dimensional object • Attributes, or Raleigh=(35.46, 78.39) Boston=(42.21, 71.5) • Dimensions Proximity(Raleigh, Boston)=? • Geodesic distance • Euclidean distance • Length of the interstate route 10 A set of data objects as vector spaces 3-dimensional vector space Altitude Moscow Raleigh Latitude Longitude Mining such data ~ studying vector spaces 11 Multi-dimensional vectors… Vector components: • Features, or • Attributes, or • Dimensions Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional S1=(John Smith, 5.0, 180, 6.0, 200) S2=(Jane Doe, 3.0, 140, 5.4, 70) Proximity(S1, S2)=? • How to compare when vector components are of heterogeneous type, or different scales? • How to show the results of the comparison? 12 as matrices… Example: A collection of text documents on the Web Original Documents Parsed Documents D1: Child Safety at Home D2: Infant & Toddler First Aid Your Baby's Health and D3: Safety: From Infant to Toddler D1: Child Safety Home D2: Infant Toddler Bab Health Safety Infant D3: Toddler t-d term-document matrix Terms=Features=Dimensions T1: T2: T3: T4: T5: T6: T7: D1: 0 1 0 1 0 1 0 D2: 0 0 0 0 1 0 1 D3: 1 0 1 0 1 1 1 T1: T2: T3: T4: T5: T6: T7: Mining such data ~ studying matrices Bab Child Health Home Infant Safety Toddler 13 or as trees t-d term-document matrix T1: T2: T3: T4: T5: T6: T7: D1: 0 1 0 1 0 1 0 D2: 0 0 0 0 1 0 1 document D3: 1 0 1 0 1 1 1 D2 president government party election political elected national districts held district independence vice minister parties Is D2 similar to D3? D3 What if there are 10,000 terms? population area climate city miles province land topography total season 1999 square rate Mining such data ~ studying trees terms economy million products 1996 growth copra economic 1997 food scale exports rice 14 fish 0r as networks, or graphs w/ nodes & links president government party election political elected national districts held district independence vice minister parties Nodes=Documents Links=Document similarity (e.g., if document references another document ) population area climate city miles province land topography total season 1999 square rate economy million products 1996 growth copra economic 1997 food scale exports rice fish Mining such data ~ studying graphs, or graph mining 15 What apps naturally deal w/ graphs? Social Networks Semantic Web Drug Design, Computer networks Chemical compounds Credit: Images are from Google images via search of keywords World Wide Web Sensor networks 16 What questions to ask about graph data? Graph Data Graph Mining Questions • Academia:NCSU:Admission data 1. Nodes=students; links=similar academics/demographics 2. How many distinct academically performing groups of students admitted to NCSU? 3. Which academic group is the largest? 4. Given a new student applicant, can we predict which academic group the student will likely belong to? 5. Are groups of student with similar demographics usually share similar academic performance? 6. Over the last decade, has the diversity in demographics of accepted student groups increased or decreased? 7. … 17 Recap: Data Mining and Graph Mining Application Data Questions Data Objects + Features Mathematical Data Representation (Data Model) Vectors Graphs Matrices Not one hat fits all Time series Sets Tensors More than one models are needed Manifolds Models are related 18 How much data? 1PB/year Ecology 20-40TB/simulation Biology 30TB/day Climate Astrophysics Cosmology 850TB Web My laptop: 60 GB (GigaBytes) – 109 Bytes 1 TB (TeraByte) – 1012 Bytes 1 PB (PetaByte) – 1015 Bytes 19 It is not just the Size – but the Complexity Petabytes Data 20 Data Describes Complex Patterns/Phenomena How to untangle the riddles of the complexity? Complex regulation Single gene ~30k genes Analytical tools that find the “dots” from data significantly reduce data. 50 trans elements control single gene expression Challenge: How to “connect the dots” to answer important science/business questions? 21 Connecting the Dots Finding the Dots Sheer Volume of Data Climate Now: 20-40 Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min Connecting the Dots Advanced Math+Algorithms Huge dimensional space Combinatorial challenge Complicated by noisy data Requires high-performance computers Understanding the Dots Providing Predictive Understanding Produce bioenergy Stabilize CO2 Clean toxic waste 22 Why Would Data Mining Matter? Enables solving many large-scale data problems Finding the Dots Connecting the Dots Understanding the Dots • How to effectively produce bioenergy? • How to stabilize carbon dioxide? • How to convert toxic into non-toxic waste? ... 23 Science Questions How to Move and Access the Data? Technology trends are a rate limiting factor Most of these data will NEVER be touched! Data doubles every 9 months; CPU ―18 months. Naturally distributed but effectively immovable Retrieval Rate Mbytes/s Latency and Speed – Storage Performance Streaming/Dynamic but not re-computable CPU, Disk, Network Trend Doubling: 105 Memory Disk Tape CPU: every 1.2 years Disk: every 1.4 years WAN: 0.7 years log10(Object Size Bytes) Src: Richard Mount, SLAC J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000 MIPS/$M GB/$M kB/s 24 How to Make Sense of Data? Know Your Limits & Be Smart More data Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest. Ultrascale Computations: Must be smart about which probe combinations to see! Petabytes Terabytes Physical Experiments: Must be smart about probe placement! Gigabytes Megabytes More analysis To see 1 percent of a petabyte at 10 megabytes per second takes: 35 8-hour days! 25 What Analysis Algorithms to Use? Even a simple big O analysis can elucidate simplicity. Analysis algorithms fail for a few gigabytes. Algorithmic Complexity: Calculate means O(n) Calculate FFT O(n log(n)) Calculate SVD O(r • c) Clustering algorithms O(n2) If n=10GB, then what is O(n) or O(n2) on a teraflop computers? 1GB = 109 bytes 1Tflop = 1012 op/sec Data size n Algorithm Complexity n nlog(n) n2 100B 10-10sec. 10-10 sec. 10-8 sec. 10KB 10-8 sec. 10-8 sec. 10-4sec. 1MB 10-6 sec. 10-5 sec. 1 sec. 100MB 10-4 sec. 10-3 sec. 3 hrs 10GB 10-2 sec. 0.1 sec. 3 yrs. For illustration chart assumes 10-12 sec. 26 (1Tflop/sec) calculation time per data point