Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Kathy S Schwaig Outline Motivation Definitions Techniques Applications Portions of this presentation are adapted from J. Han Simon Fraser University, Canada Motivation Data found in data warehouses is not, by itself, of great intrinsic value. Value comes from the knowledge that can be discovered from data. What do you do with it? Data Volume Problems • Magnitude of data due to machine-readable text disseminated across networks. • Difficult to distill information for analysis. • Tools needed to 'mine' information to bring out key, relevant facts. •Users need to rapidly filter and assimilate useful information from a variety of data sources. Data Mining The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Extraction of hidden, predictive information from large databases. Provide answers to questions a decision maker had previously not thought to ask Data Mining Search for relationships, patterns, and trends which, prior to the search were not known to exist or were not visible. “Find related buying patterns.” “There is a pattern that occurs X% of the time that when someone buys window coverings (not shades, blinds, or other specifics), and within 1 to 3 months buys linens, within the next 4 months buys furniture.” E.g. Data Mining Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Data Mining Analysis Techniques Examples Characterization Association Classification Prediction Clustering (Data Segmentation) Characterization Demographics: address, income, recreational equipment ownership, etc. Psychographics: lifestyle/personality characteristics like “highly protective of children; impulsive shopper Technographic(web based): attributes of your computer system; browser, operating system, modem speed, etc. Association Occurrences linked to a single event; Identify items that are likely to be purchased or viewed at the same session (web) Example: Amazon.com…..Customers that bought Grapes of Wrath also bought Great Gatsby Classification Recognize patterns that describe a group to which an item belongs by examing existing items that have been classified and by inferring a set of rules Example: Credit Card companies have discovered the characteristics of customers likely to leave and have provided a model to help predict who will leave in the future. Prediction Guesses an unknown value such as income when you know other things about a person. Example: lifetime monetary value, Often used in demographic data to fill in blank information. For example, we know someone’s address, car preference and job title but not their income. We can look at others with similar characteristics and from their data infer the missing income figure. Clustering Identify people who share common characteristics. A way of identifying differing groups within the data Patterns Scuba gear and Australian vacations Skim milk and whole wheat bread AT&T’s stock rises at least 2% after every 3-day slump in DOW Camelot Music Inc. • Discovered what appeared to be a curious purchasing trend. • Music retailer’s 493 stores were selling a lot of rap and alternative CDs to people older than 65. Are All the “Discovered” Patterns Interesting? A data mining query may generate thousands of patterns. Are they interesting? Why or why not? Interesting if: easily understood by humans valid on new or test data with some degree of certainty potentially useful novel validates some hypothesis that a user seeks to confirm Applications: MCI How to find the customers you want to keep from among the millions? Comb marketing data on 140 million households, each evaluated on as many as 10,000 attributes— e.g. income, lifestyle, and details about past calling habits. But which set of those attributes is the most important to monitor, and within what range of values? MCI •IBM SP/2 super computer, its data warehouse, has identified variables it finds most telling about it’s customers, and from that, compiled a set of 22 very detailed and highly confidential statistical customer profiles– none of which could have been developed without data mining programs Wal-Mart Point of sale transaction data is captured at each retail store and transmitted to Wal-Mart’s Arkansas data warehouse. Over 3,500 independent suppliers have online access to information about their respective products in that data warehouse. They may query that data to analyze trends by item and store, using that information to find the products that need replenishment, and thus allow them to get the right products to each store on time Data Mining Should Not be Used Blindly! Data mining find regularities from history, but history is not the same as the future. Association does not dictate trend nor causality!? Drink diet drinks lead to obesity! David Heckerman’s counter-example (1997) Barbecue source, hot dogs and hamburgers. Web Mining: Lots To Be Done! Types of Web mining Web usage mining: which page or graphic was served(URL) linked to date, time, browser information Web content mining: how are visitors responding to your content (which links they select, where they spend time, which search terms they use, where they browse) Other than managers, who could REALLY use this information? Challenges to Web Mining Web: A huge, widely-distributed, highly heterogeneous, semistructured, interconnected, evolving, hypertext/hypermedia information repository. Problems: the “abundance” problem limited coverage of the Web (hidden Web sources) limited query interface: keyword-oriented search limited customisation to individual users DBMS, and data miners will play an increasingly important role in the new generation of Internet Summary •Need for data mining • Approaches • Problems • Applications • Web data mining Appendix: Market Analysis and Management Data sources Credit card transactions, loyalty cards, discount coupons, customer complaint calls, studies. Target marketing Clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Customer purchasing patterns Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information. Appendix: Market Analysis and Management (Con’t) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification). Customer requirements identify best products for different customers prediction to find what factors will attract new customers Summary information multi-dimensional summary reports; statistical summary information Appendix: Corporate Analysis and Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare resources and spending Competition Monitor competitors and market directions. Segment customers into classes with class-based pricing procedure. Set pricing strategy in a highly competitive market. Appendix: Fraud Detection and Management Applications Widely used in health care, retail, credit card services, telecommunications (phone card fraud). Approach use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. Examples Auto Insurance: detect a group of people who stage accidents to collect insurance Money Laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) Medical Insurance: detect professional patients and ring of doctors and ring of references Appendix: Fraud Detection and Management (Con’t) Telephone fraud: Telephone call model: destination of call, duration, time of day or week. Analyze patterns that deviate from expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Appendix: Other Application Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Appendix: Decision Support and OLAP DSS: Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions what were the sales volumes by region and product category for the last year? How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? Will a 10% discount increase sales volume sufficiently? •OLAP- On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multi-dimensional databases. An element of a decision support system. •Data mining is a powerful, high-performance data analysis tool for decision support.