Download Data Mining - KSU Web Home

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Kathy S Schwaig
Outline
 Motivation
 Definitions
 Techniques
 Applications
Portions of this presentation are adapted from J. Han
Simon Fraser University, Canada
Motivation
Data found in data warehouses is
not, by itself, of great intrinsic
value.
Value comes from the knowledge
that can be discovered from data.
What do you do with it?
Data Volume Problems
• Magnitude
of data due to machine-readable text
disseminated across networks.
• Difficult to distill information for analysis.
• Tools needed to 'mine' information to bring out
key, relevant facts.
•Users need to rapidly filter and assimilate useful
information from a variety of data sources.
Data Mining
 The process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data.
 Extraction of hidden, predictive information
from large databases.
 Provide answers to questions a decision
maker had previously not thought to ask
Data Mining
 Search for relationships, patterns, and trends which,
prior to the search were not known to exist or were not
visible.
“Find related buying patterns.”
“There is a pattern that occurs X% of the time
that when someone buys window coverings (not
shades, blinds, or other specifics), and within 1
to 3 months buys linens, within the next 4
months buys furniture.”
 E.g.
Data Mining
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Data Mining
Analysis Techniques Examples
 Characterization
 Association
 Classification
 Prediction
 Clustering (Data Segmentation)
Characterization
 Demographics: address, income, recreational
equipment ownership, etc.
 Psychographics: lifestyle/personality
characteristics like “highly protective of
children; impulsive shopper
 Technographic(web based): attributes of your
computer system; browser, operating system,
modem speed, etc.
Association
 Occurrences linked to a single event; Identify
items that are likely to be purchased or
viewed at the same session (web)
 Example: Amazon.com…..Customers that
bought Grapes of Wrath also bought Great
Gatsby
Classification
 Recognize patterns that describe a group to
which an item belongs by examing existing
items that have been classified and by
inferring a set of rules
 Example: Credit Card companies have
discovered the characteristics of customers
likely to leave and have provided a model to
help predict who will leave in the future.
Prediction
 Guesses an unknown value such as income when
you know other things about a person.
 Example: lifetime monetary value, Often used in
demographic data to fill in blank information. For
example, we know someone’s address, car
preference and job title but not their income. We
can look at others with similar characteristics and
from their data infer the missing income figure.
Clustering
 Identify people who share common
characteristics. A way of identifying
differing groups within the data
Patterns
 Scuba gear and Australian vacations
 Skim milk and whole wheat bread
 AT&T’s stock rises at least 2% after
every 3-day slump in DOW
Camelot Music Inc.
• Discovered what appeared to be a
curious purchasing trend.
• Music retailer’s 493 stores were selling
a lot of rap and alternative CDs to
people older than 65.
Are All the “Discovered” Patterns
Interesting?
 A data mining query may generate thousands of
patterns.
 Are they interesting? Why or why not?
 Interesting if:





easily understood by humans
valid on new or test data with some degree of certainty
potentially useful
novel
validates some hypothesis that a user seeks to confirm
Applications: MCI
How to find the customers you want to keep
from among the millions?
Comb marketing data on 140 million
households, each evaluated on as many as
10,000 attributes— e.g. income, lifestyle,
and details about past calling habits.
But which set of those attributes is the most
important to monitor, and within what
range of values?
MCI
•IBM SP/2 super computer, its data
warehouse, has identified variables it
finds most telling about it’s customers,
and from that, compiled a set of 22 very
detailed and highly confidential statistical
customer profiles– none of which could
have been developed without data mining
programs
Wal-Mart
Point of sale transaction data is captured at each retail
store and transmitted to Wal-Mart’s Arkansas data
warehouse.
Over 3,500 independent suppliers have online access to
information about their respective products in that data
warehouse. They may query that data to analyze trends by
item and store, using that information to find the products
that need replenishment,
and thus allow them to get the right products to each
store on time
Data Mining Should Not be Used
Blindly!
 Data mining find regularities from history,
but history is not the same as the future.
 Association does not dictate trend nor
causality!?


Drink diet drinks lead to obesity!
David Heckerman’s counter-example (1997)
 Barbecue source, hot dogs and hamburgers.
Web Mining: Lots To Be Done!
 Types of Web mining

Web usage mining: which page or graphic was
served(URL) linked to date, time, browser information

Web content mining: how are visitors responding to your
content (which links they select, where they spend time,
which search terms they use, where they browse)
 Other than managers, who could REALLY use
this information?
Challenges to Web Mining
 Web: A huge, widely-distributed, highly heterogeneous, semistructured, interconnected, evolving, hypertext/hypermedia
information repository.
 Problems:

the “abundance” problem

limited coverage of the Web (hidden Web sources)

limited query interface: keyword-oriented search

limited customisation to individual users
 DBMS, and data miners will play an increasingly important role in
the new generation of Internet
Summary
•Need for data mining
• Approaches
• Problems
• Applications
• Web data mining
Appendix: Market Analysis and
Management
 Data sources
 Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, studies.
 Target marketing
 Clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Customer purchasing patterns
 Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
 Associations/co-relations between product sales
 Prediction based on the association information.
Appendix: Market Analysis and
Management (Con’t)
 Customer profiling

data mining can tell you what types of customers buy
what products (clustering or classification).
 Customer requirements

identify best products for different customers

prediction to find what factors will attract new customers
 Summary information

multi-dimensional summary reports;

statistical summary information
Appendix: Corporate Analysis and
Risk Management
 Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)
 Resource planning
 summarize and compare resources and spending
 Competition
 Monitor competitors and market directions.
 Segment customers into classes with class-based pricing
procedure.
 Set pricing strategy in a highly competitive market.
Appendix: Fraud Detection and
Management
 Applications
 Widely used in health care, retail, credit card services, telecommunications
(phone card fraud).
 Approach
 use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances.
 Examples
 Auto Insurance: detect a group of people who stage accidents to collect
insurance
 Money Laundering: detect suspicious money transactions (US Treasury's
Financial Crimes Enforcement Network)
 Medical Insurance: detect professional patients and ring of doctors and ring
of references
Appendix: Fraud Detection and
Management (Con’t)
 Telephone fraud:



Telephone call model: destination of call, duration,
time of day or week.
Analyze patterns that deviate from expected norm.
British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile
phones, and broke a multimillion dollar fraud.
Appendix: Other Application
 Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
Appendix: Decision Support and
OLAP
 DSS: Information technology to help the knowledge worker
(executive, manager, analyst) make faster and better decisions
 what were the sales volumes by region and product category
for the last year?
 How did the share price of computer manufacturers
correlate with quarterly profits over the past 10 years?
 Will a 10% discount increase sales volume sufficiently?
•OLAP- On-line analytical processing. Refers to
array-oriented database applications that allow
users to view, navigate through, manipulate, and
analyze multi-dimensional databases. An element of
a decision support system.
•Data mining is a powerful, high-performance data
analysis tool for decision support.