Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Databases & Data Mining Types of database systems How are they related to data mining 3-2 Contemporary Database • Gain competitive advantage – customer information systems • data mining • Develop and market new products • micromarketing McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-3 Systems • Database – Personal, small business level • On-Line Analytic Processing (OLAP) – Ability to use many dimensions, reports & graphics • Data Mart – Usually temporary analysis • Data Warehouse – Usually permanent repository McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-4 Data Warehousing Price Waterhouse definition: A data warehouse is an orderly and accessible repository of known facts and related data that is used as a basis for making better management decisions. The data warehouse provides a unified repository of consistent data for decision making that is subject oriented, integrated, time variant, and nonvolatile. McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-5 Data Warehousing • Provide business users views of data appropriate to mission • Consolidate & reconcile data • Give macro views of critical aspects • Timely & detailed access to information • Provide specific information to groups • Ability to identify trends McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-6 Data Warehousing Price Waterhouse: Not just a technology; an architecture and process designed to support decision making special-purpose database systems to improve query performance significantly index, partition, pre-aggregate data McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-7 Data Warehousing Beyond OLAP: Data warehouse OLAP On-Line Transactional Processing summary data detailed operational data few users many concurrent users data driven transaction driven effectiveness efficiency use EIS, spreadsheets to access McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-8 Data Marts • Intermediate-level database system • Often used as temporary storage – Gather data for study from data warehouse, other sources (including external) – Clean & transform for data mining McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-9 OLAP • Multidimensional spreadsheet • Hypercube – term to reflect ability to sort on many dimensions • Many forms – – – – – McGraw-Hill/Irwin MOLAP – multidimensional ROLAP – relational (uses SQL) DOLAP – desktop WOLAP – web enabled HOLAP - hybrid ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-10 Key Concepts • Scalability – Ability to accurately cope with changing conditions (especially magnitude of computing) • Granularity – Level of detail • Data warehouse – tends to be fine granularity • OLAP – tends to aggregate to coarse granularity McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-11 Data Warehouse Implementation • Reliable, comprehensive source of clean data – Accurate, complete, in correct format • Processes – System development – Data acquisition – Data extraction for use McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-12 Data Warehouse Generation • • • • Extract data from sources Transform Clean Load into data warehouse – 60-80% of effort in operating data warehouse McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-13 Data Extraction Routines • Interpret data formats • Identify changed records • Copy information to intermediate file McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-14 Data Transformation • Consolidate data from multiple sources • Filter to eliminate unnecessary details • Clean data – eliminate incorrect entries – eliminate duplications • Convert & translate data into proper format • Aggregate data as designed McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-15 Data Management • Retrieve information • Extraction programs • Problems: – Required data not available – Initial data warehouse scope too broad – Not enough time to do prototyping, or needs analysis – Insufficient senior direction McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-16 Meta Data • Data to keep track of data • Life cycle: – Manage meta data – Design data warehouse – Ensure data quality – Manage system during operations McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-17 Business Meta Data • • • • • • What data are available Source of each data element Frequency of data updates Location of specific data Predefined reports & queries Methods of data access McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-18 Technical Meta Data • Data source – (internal or external) • Data preparation features – (transformation & aggregation rules) • • • • Logical structure of data Physical structure & content Data ownership Security aspects – (access rights, restrictions) • System information – (date of last update, retention policy, data usage) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-19 Wal-Mart’s Data Warehouse • Heavy user of IT • Core competency – supply chain distribution – – – – 2900 outlets Data warehouse of 101 terabytes ($4 billion) 65 million transactions per week Subject-oriented, integrated, time-variant, nonvolatile data – 65 weeks of data by item, store, day McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-20 Wal-Mart • Use data warehouse to: – Support decision making – Buyers, merchandisers, logistics, forecasters – 3,500 vendor partners can query – Can handle 35 thousand queries per week • Benefit $12,000 per query • Some users about 1 thousand queries per day McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-21 Summers Rubber Company • Distribution firm – 7 operating locations – 10,000 items – 3,000 customers • Old system: – OLAP – Databases transactional & summarized, distributed McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-22 Summers Data Storage System • Built in-house, PCs, Access database • Visual Basic & Excel • Distributed system – Data warehouse server controlled queries, managed resources • Security – Passwords gave some protection – To protect from leaving employees, used data marts with small versions of central database McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-23 Summers • Move from transactional databases to new system • Small prototype, iterative feedback from users • Data came from many sources • Scrubbing data – Reformatting (time units, scales, currency measures, etc.) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-24 Summers – Negative features • Too much disk space on user local drives • Often difficult to understand & use • Updating multiple data sites slow, limited access • Summary data often wrong • Couldn’t use data mining tools – Problem was aggregated data stored McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-25 Comparison Product Use Duration Granularity Warehouse Repository Permanent Finest Mart Specific study Temporary Aggregate OLAP Report & analysis Repetitive Summary McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-26 Examples of Data Uses • Customer information systems • Fingerhut McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-27 Customer Information Systems • Massive databases • Detailed information about individuals and households • Use automated analysis – identify focused market target McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-28 Micromarketing • Target small groups of highly responsive customers • Own niches like smaller competitors • EXAMPLES: – Great Atlantic & Pacific Tea Company (A&P) • target customers, centralize buying – Fingerhut • sell on credit to households <$25,000 income McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-29 Media Companies • R. R. Donnelley & Sons – world’s largest printer – provide consumer & life-style data – customized individual publications • Mass marketing has become less effective • Profit in developing niche-oriented strategy • Need marketing information technology McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-30 Information Overload • Retail food (groceries) – average store - 20,000 items • larger stores 40,000 to 60,000; • with weights, flavors, etc., hundreds of thousands – every year 10,000 new items – 550 corporate and regional buying offices – 100,000 salespeople – several hundred thousand price changes/year McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-31 Information Overload • Grocery data collection – point-of-sale scanning – used to allocate shelf space – used to optimize product mix – control inventories – avoid shortages McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-32 Customer Information Systems • tens of thousands of characters of information • tens of millions of customers • enormous data storage – hundreds of gigabytes • parallel computing • YOU HAVE TO BE BIG TO AFFORD McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-33 Customer Information Systems • USES – adjust prices – see new product possibilities – develop promotions – personalized advertising McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-34 Customer Information Systems • OPERATION – artificial intelligence • neural networks to wade through data • identify shopping trends • segment groups of customers McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-35 Customer Information Systems • AIRLINE INDUSTRY – 1980s - deregulation – number of possible fares & rates skyrocketed – SABRE - 45 million fares, 40 million changes/month – industry now dominated by American (SABRE) & United (Apollo) – cost - hundreds of millions of dollars McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-36 Own the Customer • A&P – point-of-sale scanning – frequent shopper programs • used to build customer database • sign up, get free bonus saver cards, check cashing, hundreds of special discounts • A&P gathers list of purchases, feeds database – centralized buying, better inventory, advertising McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-37 Versioning • Assemble hundreds of versions of the same ad • Switch & reassemble products & prices • Cigarette makers – some of most advanced database marketing – direct mail, discount coupons, freebies – have built databases on smoker demographics – anticipate market changes, target promotions McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-38 Versioning • FINGERHUT – 150 catalog mailings in 1992 – based on statistically predicted consumer response – 13 million customers, 14% annual growth – database captures 1400 pieces of information about a household • demographics, purchasing histories McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-39 FINGERHUT • identify your kid’s birthdays, send ideas – FRONT-END programs • get new customers (purchased from others) – TRANSITION programs • evaluate new purchasers, keep best – BACK-END programs • maximize profit McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-40 FINGERHUT • FRONT-END – newspaper, magazines, TV, postcards, catalogs – predictive models – lists from other companies – if you respond • TRANSITION – sort out good credit risks, good purchasers McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-41 FINGERHUT • BACK-END – 80% of revenue from repeat customers – customers segmented • 75 specialty catalogs • personalized messages McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-42 Marketing Budgets • Saturated advertising channels – expenditures more than doubled in 1980s – too much advertising, too little relevant • Shift to – promotional discounts – slotting - buy shelf space – undermines brand loyalty McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-43 Narrowcasting • Cable TV • In-store coupons • Special monitors – doctors’ offices, airport lounges • Interactive kiosks • Interactive home TV shopping McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-44 R.R. Donnelley & Sons • • • • Will manage customer’s database Supply consumer data Identify market segments Printing – Farm Journal - 8000 different editions/month – tailored editorial & advertising content McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 3-45 Customer Information Systems • • • • Barriers to competition Cost up to $100 million to develop Years to gather data and build Basic shift in source of competitive advantage McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved