Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Introduc0on CENG 514 Spring 2011 • Data mining (knowledge discovery from data) – Extrac0on of interes0ng (non-‐trivial, implicit, previously unknown and poten0ally useful) paJerns or knowledge from huge amount of data • Alterna0ve names – Knowledge discovery (mining) in databases (KDD), knowledge extrac0on, data/paJern analysis, data archeology, data dredging, informa0on harves0ng, business intelligence, etc. Definition by Gartner Group • “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” • (Deductive) query processing • Expert systems or small ML/statistical programs • The Explosive Growth of Data: from terabytes to petabytes – Data collec0on and data availability: Automated data collec0on tools, database systems, Web, computerized society • Data is everywhere, informa0on is nowhere • Market: From focus on product/service to focus on customer • IT: From focus on up-‐to-‐date balances to focus on paJerns in transac0ons -‐ Data Warehouses -‐ OLAP • Increase in complexity of data Artificial Intelligence Machine Learning Database Management Statistics Visualization Algorithms Data Mining Data Mining: History of the Field • Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series started 2001 7 A Brief History of Data Mining Society • 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-‐ Shapiro) – Knowledge Discovery in Databases (G. Piatetsky-‐Shapiro and W. Frawley, 1991) • 1991-‐1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-‐ Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-‐1998 Interna0onal Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-‐98) – Journal of Data Mining and Knowledge Discovery (1997) • 1998 ACM SIGKDD, SIGKDD’1999-‐2001 conferences, and SIGKDD Explora0ons • More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-‐Data Mining (2001), (IEEE) ICDM (2001), etc. CS490D 8 • Market Analysis, Customer Rela0onships Management (CRM) • Churn Analysis • Risk Analysis and Management • Fraud Detec0on, Counter Terrorism • Network Intrusion Detec0on • Web Site Restructring • Recommenda0on • Scien0fic Applica0ons Corporate Analysis & Risk Management • Finance planning and asset evalua0on – cash flow analysis and predic0on – con0ngent claim analysis to evaluate assets – cross-‐sec0onal and 0me series analysis (financial-‐ra0o, trend analysis, etc.) • Resource planning – summarize and compare the resources and spending • Compe00on – monitor compe0tors and market direc0ons – group customers into classes and a class-‐based pricing procedure – set pricing strategy in a highly compe00ve market 10 Fraud Detec0on & Mining Unusual PaJerns • Approaches: Clustering & model construc0on for frauds, outlier analysis • Applica0ons: Health care, retail, credit card service, telecomm. – Auto insurance: ring of collisions – Money laundering: suspicious monetary transac0ons – Medical insurance • Professional pa0ents, ring of doctors, and ring of references • Unnecessary or correlated screening tests – Telecommunica0ons: phone-‐call fraud • Phone call model: des0na0on of the call, dura0on, 0me of day or week. Analyze paJerns that deviate from an expected norm – An0-‐terrorism 11 Example: Use in retailing • Goal: Improved business efficiency – Improve marke0ng (adver0se to the most likely buyers) – Inventory reduc0on (stock only needed quan00es) • Informa0on source: Historical business data – Example: Supermarket sales records – Size ranges from 50k records (research studies) to terabytes (years of data from chains) – Data is already being warehoused • Sample ques0on – what products are generally purchased together? • The answers are in the data, if only we could see them 12 Example: Churn Analysis • Business Problem: Prevent loss of customers, avoid adding churn-‐prone customers • Solu0on: Use neural nets, 0me series analysis to iden0fy typical paJerns of telephone usage of likely-‐ to-‐defect and likely-‐to-‐churn customers • Benefit: Reten0on of customers, more effec0ve promo0ons 13 Example: Clicks to Customers • Business problem: 50% of Dell’s clients order their computer through the web. However, the reten0on rate is 0.5%, i.e. of visitors of Dell’s web page become customers. • Solu0on Approach: Through the sequence of their clicks, cluster customers and design website, interven0ons to maximize the number of customers who eventually buy. • Benefit: Increase revenues 14 What Can Data Mining Do? • Cluster • Classify – Categorical, Regression • Summarize – Summary sta0s0cs, Summary rules • Link Analysis / Model Dependencies – Associa0on rules • Sequence analysis – Time-‐series analysis, Sequen0al associa0ons • Detect Devia0ons 15 Clustering • Find groups of similar data items • Sta0s0cal techniques require some defini0on of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descrip0ons “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob 16 Classifica0on • Find ways to separate data items into pre-‐defined groups • Requires “training data”: Data items where group is known “Route documents to most likely interested par0es” – English or non-‐english? – Domes0c or Foreign? 17 Associa0on Rules • Iden0fy dependencies in the data: – X makes Y likely • Indicate significance of each dependency “Find groups of items commonly purchased together” – People who purchase fish are extraordinarily likely to purchase wine – People who purchase Turkey are extraordinarily likely to purchase cranberries 18 Sequen0al Associa0ons • Find event sequences that are unusually likely “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switch 19 Recommenda0on Techniques • Given database of user preferences, predict preference of new user • Example: – Predict what new movies you will like based on • your past preferences • others with similar past preferences • their preferences for the new movies – Predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer) 20 Knowledge Discovery in Databases: Process Interpretation/ Evaluation Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press 21 Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA • Learning the applica0on domain – relevant prior knowledge and goals of applica0on • Crea0ng a target data set: data selec0on • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduc0on and transforma0on – Find useful features, dimensionality/variable reduc0on, invariant representa0on • Choosing func0ons of data mining – summariza0on, classifica0on, regression, associa0on, clustering • Choosing the mining algorithm(s) • Data mining: search for paJerns of interest • PaJern evalua0on and knowledge presenta0on – visualiza0on, transforma0on, removing redundant paJerns, etc. • Use of discovered knowledge • Mining methodology – – – – – – – • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effec0veness, and scalability PaJern evalua0on: the interes0ngness problem Incorpora0on of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integra0on of the discovered knowledge with exis0ng one: knowledge fusion User interac0on – Data mining query languages and ad-‐hoc mining – Expression and visualiza0on of data mining results – Interac0ve mining of knowledge at mul0ple levels of abstrac0on • Applica0ons and social impacts – Domain-‐specific data mining & invisible data mining – Protec0on of data security, integrity, and privacy (From J. Ullman’s Notes) • A big data-‐mining risk is that you will “discover” paJerns that are meaningless. • Sta0s0cians call it Bonferroni’s principle: (roughly) if you look in more places for interes0ng paJerns than your amount of data will support, you are bound to find meaningless results. • When looking for a property make sure that the property does not allow so many possibili0es that random data will surely produce facts “of interest.” • Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-‐Sensory Percep0on. • He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue. • He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. • What did he conclude? • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. • What did he conclude? – He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.