Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business 1 Outline • Introduction – Why data mining? – What is data mining? – Data mining process • Types of Data Mining Tasks • Main Data Mining Tools • Reading – T2, Ch.1 2 Why Business Intelligence Systems? • Knowledge Management Problems (Drowning in data, starving for knowledge) 1. Can’t access data (easily) E.g., data from different branches, years, functional areas, etc. 2. Give me only what’s important (knowledge) E.g., which products do customers tend to buy together? 3. I need to reduce data to what’s important by slicing and dicing. E.g., by branch, product, year, etc. 3 Why Business Intelligence Systems? 4. Data inconsistency and poor data quality E.g., the 2001 PC sales amount in SLC from the CFO and the SLC Account Manager are not the same. 5. Need to improve the practices of making informed decisions. E.g., Did the VP for Marketing decide on the advertising budgets for branches in the SW region based on their sales performances over the last five years? 6. Hard and slow to query the database? E.g., VP for Marketing, CFO and Account Manager had to wait for the MIS Department to generate sales performance reports and analyses. 4 Why Business Intelligence Systems? • ROI Problems 7. Can I get more value out of my data? Ans: Make informed, potent decisions using knowledge extracted from integrated and consistent data over a long period of time. 8. Can I do this cost-effectively? 9. Can I easily scale up or change how I get knowledge out of my data? Options: manually versus automatically identifying knowledge 5 Why data mining? • OLAP can only provide shallow data analysis -what – Ex: sales distribution by product 6 Why data mining? • Shallow data analysis is not sufficient to support business decisions -- how – Ex: how to boost sales of other products – Ex: when people buy product 6 what other products do they are likely to buy? – cross selling 7 Why data mining? • OLAP can only do shallow data analysis – OLAP is based on SQL SELECT PRODUCTS.PNAME, SUM(SALESFACTS.SALES_AMT) FROM DBSR.PRODUCTS PRODUCTS, DBSR.SALESFACTS SALESFACTS WHERE ( ( PRODUCTS.PRODUCT_KEY = SALESFACTS.PRODUCT_KEY ) ) GROUP BY PRODUCTS.PNAME; – The nature of SQL decides that complicated algorithm cannot be implemented with SQL. • Complicated algorithms need to be developed to support deep data analysis – data mining 8 Why Data Mining? Walmart (!?) Diaper + Beer = $$$ ? 9 Market Basket (Association Rule) Analysis A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity Ex: •a customer’s visit a grocery store •an online purchase from a virtual store such as ‘Amazon.com’ 10 Market Basket (Association Rule) Analysis Market basket analysis is a common analysis run against a transaction database to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Applications: •improve the placement of items in a store •the layout of mail-order catalog pages •the layout of Web pages •others? 11 •Degenerate key provides additional grouping of fact records CUSTOMER TIME # * * * * * * * * * * TIME_KEY ORDERDATE DAY_ OF_WEEK DAY_ NUMBER_IN_ MONT H DAY_ NUMBER_IN_ YEAR WEEK_ NUMBER MONTH QUART ER HOLIDAY_FL AG FISCAL _YEAR FISCAL _QUARTER referenced by referenced by # * * * * CUSTOMER_ KEY CID CNAME ST AT E CITY SALES reference # # # * * * * TIME_KEY PRODUCT_ KEY CUSTOMER_ KEY ORDER_NO PRICE QUANT IT Y SALES reference reference referenced by PRODUCT # * * * PRODUCT_ KEY PID PNAME PCNAME Impractical to view market baskets using OLAP tools Degenerate Key: ORDER_NO 12 Why data mining? • OLAP results generated from data sets with large number of attributes are difficult to be interpreted – Ex: cluster customers of my company --- target marketing – Pick two attributes related to a customer: income level and sales amount 13 Why data mining? – Ex: cluster customers of my company --- target marketing – Pick three attributes related to a customer: income level, education level and sales amount 14 What is data mining? • Data mining is a process to extract hidden and interesting patterns from data. • Data mining is a step in the process of Knowledge Discovery in Database (KDD). 15 What is NOT Data Mining? • Not SQL language – SQL : extraction of detailed data • Not OLAP – OLAP : summary,trends, forecasts • Not Magic: – Data Mining: Based on algorithms that can discover hidden patterns. It is interactive, not fully automated 16 Major data mining tasks • Association rule mining – e.g., to cross sell, identify other items that a customer tends to buy if the customer has already purchased item A • Clustering – e.g., for target marketing identify clusters of similar customers • Classification – e.g., for fraud detection, identify which customer or transaction is fraudulent 17 Steps of the KDD Process Step 4: Data Mining Step 2: Cleaning Step 5: Interpretation & Evaluation Knowledge Step 3: Transformation Patterns Step 1: Selection Transformed Data Preprocessed Data Data Target Data 18 Steps of the KDD Process • Step 1: select interested columns (attributes) and rows (records) to be mined. • Step 2: clean errors from selected data • Step 3: data are transformed to be suitable for high performance data mining • Step 4: data mining • Step 5: filter out non-interesting patterns from data mining results 19 Data mining – on what kind of data • • • • Transactional Database Data warehouse Flat file Web data – Web content – Web structure – Web log 20 Step 4: Data mining Step 5: Interpretation & evaluation Discovered knowledge Step 3: Transformation Step 2: Cleaning & preprocessing Step 1: Selection Target data for DM Patterns Transformed data for DM Preprocessed data for DM OLAP & reporting Data warehouse Step 2: Selection Domain expert Step 3: Cleaning & preprocessing Interactive querying & report Step 4: Transformation Transformed data for DW Step 1: Acquisition Raw data Target data for DW Preprocessed data for DW 21 Data Mining Tools • Over 100 commercial data mining tools available, new entries keep arriving • Tools offer a variety of functionality and features, making evaluation and comparison difficult 22 Evaluation Criteria 1. System Requirements 2.Data Access 3. Mining Performance Data Mover (Data Access) Server Side Database or Flat files 4.User Interface Data Mining Engine Tool Manager (Often GUI) Visualization Tools Client Side End Users 5. Visualization 23 Data Mining Tools: Market Leaders Class choice 24 Web Analytics Software Providers • • • • • • • • • • • • • • • • • • http://surfaid.dfw.ibm.com/web/home/index.html http://pro.blogger.com/ http://www.clickstream.com/ http://www.deepmetrix.com/index.asp?source=google&keyword=web+analytics http://www.eloqua.com/srch/analytics.asp http://surfaid.dfw.ibm.com/web/home/index.html http://www.intellitracker.com/ http://www.maxamine.com/ http://www.mediahouse.com/ http://www.netiq.com/webtrends/default.asp http://www.omniture.com/products.html http://www.sitebrand.com/?source=jan http://www.statsoftinc.com/ http://www.urchin.com/ http://www.webabacus.com/ http://www.websidestory.com/ http://www.databeacon.com/index_IE.html http://www.sane.com/ads/whoiscoming.html 25