Download Data source

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Databases
& Data Mining
Types of
database systems
How are they
related to data
mining
3-2
Contemporary Database
• Gain competitive advantage
– customer information systems
• data mining
• Develop and market new products
• micromarketing
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-3
Systems
• Database
– Personal, small business level
• On-Line Analytic Processing (OLAP)
– Ability to use many dimensions, reports & graphics
• Data Mart
– Usually temporary analysis
• Data Warehouse
– Usually permanent repository
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-4
Data Warehousing
Price Waterhouse definition:
A data warehouse is an orderly and accessible
repository of known facts and related data
that is used as a basis for making better
management decisions. The data warehouse
provides a unified repository of consistent
data for decision making that is subject
oriented, integrated, time variant, and
nonvolatile.
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-5
Data Warehousing
• Provide business users views of data
appropriate to mission
• Consolidate & reconcile data
• Give macro views of critical aspects
• Timely & detailed access to information
• Provide specific information to groups
• Ability to identify trends
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-6
Data Warehousing
Price Waterhouse:
Not just a technology;
an architecture and process designed to
support decision making
special-purpose database systems to
improve query performance significantly
index, partition, pre-aggregate data
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-7
Data Warehousing
Beyond OLAP:
Data warehouse
OLAP
On-Line Transactional Processing
summary data detailed operational data
few users
many concurrent users
data driven
transaction driven
effectiveness
efficiency
use EIS, spreadsheets to access
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-8
Data Marts
• Intermediate-level database system
• Often used as temporary storage
– Gather data for study from data
warehouse, other sources (including
external)
– Clean & transform for data mining
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-9
OLAP
• Multidimensional spreadsheet
• Hypercube – term to reflect ability to sort on
many dimensions
• Many forms
–
–
–
–
–
McGraw-Hill/Irwin
MOLAP – multidimensional
ROLAP – relational (uses SQL)
DOLAP – desktop
WOLAP – web enabled
HOLAP - hybrid
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-10
Key Concepts
• Scalability
– Ability to accurately cope with changing
conditions (especially magnitude of
computing)
• Granularity
– Level of detail
• Data warehouse – tends to be fine granularity
• OLAP – tends to aggregate to coarse
granularity
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-11
Data Warehouse Implementation
• Reliable, comprehensive source of
clean data
– Accurate, complete, in correct format
• Processes
– System development
– Data acquisition
– Data extraction for use
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-12
Data Warehouse Generation
•
•
•
•
Extract data from sources
Transform
Clean
Load into data warehouse
– 60-80% of effort in operating data
warehouse
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-13
Data Extraction Routines
• Interpret data formats
• Identify changed records
• Copy information to intermediate file
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-14
Data Transformation
• Consolidate data from multiple sources
• Filter to eliminate unnecessary details
• Clean data
– eliminate incorrect entries
– eliminate duplications
• Convert & translate data into proper format
• Aggregate data as designed
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-15
Data Management
• Retrieve information
• Extraction programs
• Problems:
– Required data not available
– Initial data warehouse scope too broad
– Not enough time to do prototyping, or
needs analysis
– Insufficient senior direction
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-16
Meta Data
• Data to keep track of data
• Life cycle:
– Manage meta data
– Design data warehouse
– Ensure data quality
– Manage system during operations
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-17
Business Meta Data
•
•
•
•
•
•
What data are available
Source of each data element
Frequency of data updates
Location of specific data
Predefined reports & queries
Methods of data access
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-18
Technical Meta Data
• Data source
– (internal or external)
• Data preparation features
– (transformation & aggregation rules)
•
•
•
•
Logical structure of data
Physical structure & content
Data ownership
Security aspects
– (access rights, restrictions)
• System information
– (date of last update, retention policy, data usage)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-19
Wal-Mart’s Data Warehouse
• Heavy user of IT
• Core competency – supply chain distribution
–
–
–
–
2900 outlets
Data warehouse of 101 terabytes ($4 billion)
65 million transactions per week
Subject-oriented, integrated, time-variant,
nonvolatile data
– 65 weeks of data by item, store, day
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-20
Wal-Mart
• Use data warehouse to:
– Support decision making
– Buyers, merchandisers, logistics,
forecasters
– 3,500 vendor partners can query
– Can handle 35 thousand queries per week
• Benefit $12,000 per query
• Some users about 1 thousand queries per day
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-21
Summers Rubber Company
• Distribution firm
– 7 operating locations
– 10,000 items
– 3,000 customers
• Old system:
– OLAP
– Databases transactional & summarized,
distributed
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-22
Summers Data Storage System
• Built in-house, PCs, Access database
• Visual Basic & Excel
• Distributed system
– Data warehouse server controlled queries,
managed resources
• Security
– Passwords gave some protection
– To protect from leaving employees, used data
marts with small versions of central database
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-23
Summers
• Move from transactional databases to
new system
• Small prototype, iterative feedback from
users
• Data came from many sources
• Scrubbing data
– Reformatting (time units, scales, currency
measures, etc.)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-24
Summers – Negative features
• Too much disk space on user local drives
• Often difficult to understand & use
• Updating multiple data sites slow, limited
access
• Summary data often wrong
• Couldn’t use data mining tools
– Problem was aggregated data stored
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-25
Comparison
Product
Use
Duration
Granularity
Warehouse
Repository
Permanent
Finest
Mart
Specific
study
Temporary
Aggregate
OLAP
Report &
analysis
Repetitive
Summary
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-26
Examples of Data Uses
• Customer information systems
• Fingerhut
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-27
Customer Information Systems
• Massive databases
• Detailed information about individuals
and households
• Use automated analysis
– identify focused market target
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-28
Micromarketing
• Target small groups of highly responsive
customers
• Own niches like smaller competitors
• EXAMPLES:
– Great Atlantic & Pacific Tea Company (A&P)
• target customers, centralize buying
– Fingerhut
• sell on credit to households <$25,000 income
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-29
Media Companies
• R. R. Donnelley & Sons
– world’s largest printer
– provide consumer & life-style data
– customized individual publications
• Mass marketing has become less effective
• Profit in developing niche-oriented strategy
• Need marketing information technology
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-30
Information Overload
• Retail food (groceries)
– average store - 20,000 items
• larger stores 40,000 to 60,000;
• with weights, flavors, etc., hundreds of thousands
– every year 10,000 new items
– 550 corporate and regional buying offices
– 100,000 salespeople
– several hundred thousand price changes/year
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-31
Information Overload
• Grocery data collection
– point-of-sale scanning
– used to allocate shelf space
– used to optimize product mix
– control inventories
– avoid shortages
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-32
Customer Information Systems
• tens of thousands of characters of
information
• tens of millions of customers
• enormous data storage
– hundreds of gigabytes
• parallel computing
• YOU HAVE TO BE BIG TO AFFORD
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-33
Customer Information Systems
• USES
– adjust prices
– see new product possibilities
– develop promotions
– personalized advertising
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-34
Customer Information Systems
• OPERATION
– artificial intelligence
• neural networks to wade through data
• identify shopping trends
• segment groups of customers
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-35
Customer Information Systems
• AIRLINE INDUSTRY
– 1980s - deregulation
– number of possible fares & rates
skyrocketed
– SABRE - 45 million fares,
40 million changes/month
– industry now dominated by
American (SABRE) & United (Apollo)
– cost - hundreds of millions of dollars
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-36
Own the Customer
• A&P
– point-of-sale scanning
– frequent shopper programs
• used to build customer database
• sign up, get free bonus saver cards, check
cashing,
hundreds of special discounts
• A&P gathers list of purchases, feeds database
– centralized buying, better inventory,
advertising
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-37
Versioning
• Assemble hundreds of versions of the same ad
• Switch & reassemble products & prices
• Cigarette makers
– some of most advanced database marketing
– direct mail, discount coupons, freebies
– have built databases on smoker
demographics
– anticipate market changes, target
promotions
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-38
Versioning
• FINGERHUT
– 150 catalog mailings in 1992
– based on statistically predicted consumer
response
– 13 million customers, 14% annual growth
– database captures 1400 pieces of
information about a household
• demographics, purchasing histories
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-39
FINGERHUT
• identify your kid’s birthdays, send ideas
– FRONT-END programs
• get new customers (purchased from others)
– TRANSITION programs
• evaluate new purchasers, keep best
– BACK-END programs
• maximize profit
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-40
FINGERHUT
• FRONT-END
– newspaper, magazines, TV, postcards,
catalogs
– predictive models
– lists from other companies
– if you respond
• TRANSITION
– sort out good credit risks, good purchasers
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-41
FINGERHUT
• BACK-END
– 80% of revenue from repeat customers
– customers segmented
• 75 specialty catalogs
• personalized messages
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-42
Marketing Budgets
• Saturated advertising channels
– expenditures more than doubled in 1980s
– too much advertising, too little relevant
• Shift to
– promotional discounts
– slotting - buy shelf space
– undermines brand loyalty
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-43
Narrowcasting
• Cable TV
• In-store coupons
• Special monitors
– doctors’ offices, airport lounges
• Interactive kiosks
• Interactive home TV shopping
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-44
R.R. Donnelley & Sons
•
•
•
•
Will manage customer’s database
Supply consumer data
Identify market segments
Printing
– Farm Journal - 8000 different
editions/month
– tailored editorial & advertising content
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
3-45
Customer Information Systems
•
•
•
•
Barriers to competition
Cost up to $100 million to develop
Years to gather data and build
Basic shift in source of competitive
advantage
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved