Download CSE591 Data Mining

CSE591 (575) Data Mining 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU 1 Introduction Introduction to this Course Introduction to Data Mining 2 Introduction to the Course  First, about you - why take this course?  Your background and strength    AI, DBMS, Statistics, Biology, … Your interests and requests What is this course about?   Problem solving Handling data   transform data to workable data Mining data   turn data to knowledge validation and presentation of knowledge 3 This course  What can you expect from this course?    How is this course conducted?    Knowledge and experience about DM Problem solving and solution presentation Presentations Individual projects Course Format    Individual Projects 40% Exams and/or quizzes 40% Class participation 20%  off-campus students? 4 Projects - Start NOW!    How to start? Projects should be sufficiently challenging but reasonable, suitable for one semester How to choose your individual project    Real-world problems Problems that might make differences Two types of projects   Available projects Self-proposed projects (Approval’s needed) 5 Some project ideas  Dealing with high dimensional data   Image mining       Feature extraction, clustering of images Active sampling   Data of supervised, unsupervised learning Various data structures (kd-trees, R-trees, Multi-Dimen Scaling) Meta data (RDF, namespace) for mining Ensemble learning Sequence mining (HMM learning) Bioinformatics and applications (feature selection) Intelligent driving data analysis  Data integration, data reduction (random projection) 6 How is a project evaluated?  It depends on     What do you want to achieve Its impact Your effort The sooner you start, the better  The beginning is not easy 7 Course Web Site     http://www.public.asu.edu/~huanliu/cse591. html My office and office hours  GWC 342  T 10:30 - 11:30am and Th 4:00-5:00pm My email: [email protected] Slides and relevant information will be made available at the course web site 8 Any questions and suggestions?  Your feedback is most welcome! I need it to adapt the course to your needs. Please feel free to provide yours anytime. Share your questions and concerns with the class – very likely others may have the same. No pain no gain – no magic for data mining.       The more you put in, the more you get Your grades are proportional to your efforts. 9 Introduction to Data Mining Definitions Motivations of DM Interdisciplinary Links of DM 10 What is DM?  Or more precisely KDD (knowledge discovery from databases)?   Many definitions A process, not plug-and-play raw data  transformed data  preprocessed data  data mining  post-processing  knowledge  One definition is  A non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data 11 Need for Data Mining      Data accumulate and double every 9 months There is a big gap from stored data to knowledge; and the transition won’t occur automatically. Manual data analysis is not new but a bottleneck Fast developing Computer Science and Engineering generates new demands Seeking knowledge from massive data  Any personal experience? 12 When is DM useful  Data rich   Large data (dimensionality and size)    Two invited talks so far have convincingly demonstrate it Image data (size) Gene data (dimensionality) Little knowledge about data (exploratory data analysis)  What if we have some knowledge? 13 DM perspectives     Prediction, description, explanation, optimization, and exploration Completion of knowledge (patterns vs. models) Understandability and representation of knowledge Some applications    Business intelligence (CRM) Security (Info, Comp Systems, Networks, Data, Privacy) Scientific discovery (bioinformatics) 14 Challenges    Increasing data dimensionality and data size Various data forms New data types    Streaming data, multimedia data Efficient search and data access Intelligent update and integration 15 Interdisciplinary Links of DM       Statistics Databases AI Machine Learning Visualization High Performance Computing  supercomputers, distributed/parallel/cluster computing 16 Statistics  Discovery of structures or patterns in data sets   Optimal strategies for collecting data   efficient search of large databases Static data   hypothesis testing, parameter estimation constantly evolving data Models play a central role   algorithms are of a major concern patterns are sought 17 Relational Databases  A relational databases can contain several tables   The goal in data organization is to maintain data and quickly locate the requested data   Queries and index structures Query execution and optimization   Tables and schemas Query optimization is to find the best possible evaluation method for a given query Providing fast, reliable access to data for data mining 18 AI  Intelligent agents   Search    uniform cost and informed search algorithms Knowledge representation   Perception-Action-Goal-Environment FOL, production rules, frames with semantic networks Knowledge acquisition Knowledge maintenance and application 19 Machine Learning     Focusing on complex representations, data-intensive problems, and search-based methods Flexibility with prior knowledge and collected data Generalization from data and empirical validation  statistical soundness and computational efficiency  constrained by finite computing & data recourses Challenges from KDD  scaling up, cost info, auto data preprocessing 20 Visualization   Producing a visual display with insights into the structure of the data with interactive means  zoom in/out, rotating, displaying detailed info Various branches of visualization methods     show summary properties and explore relationships between variables investigate large databases and convey lots of information analyze data with geographic/spatial location A pre- and post-processing tool for KDD 21 Bibliography  W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of Data Mining and Knowledge Discovery. 22

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSE591 Data Mining