Download Ceng514-Fall2012-Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CENG 514
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Definition by Gartner Group
• “Data mining is the process of discovering
meaningful new correlations, patterns and
trends by sifting through large amounts of
data stored in repositories, using pattern
recognition technologies as well as
statistical and mathematical techniques.”
• (Deductive) query processing
• Expert systems or small ML/statistical
programs
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability: Automated data collection tools,
database systems, Web, computerized society
• Data is everywhere, information is nowhere
• Market: From focus on product/service to focus on customer
• IT: From focus on up-to-date balances to focus on patterns in
transactions - Data Warehouses - OLAP
• Increase in complexity of data
Artificial
Intelligence
Machine
Learning
Database
Management
Statistics
Visualization
Algorithms
Data
Mining
Data Mining:
History of the Field
• Knowledge Discovery in Databases workshops started ‘89
– Now a conference under the auspices of ACM SIGKDD
– IEEE conference series started 2001
7
• Market Analysis, Customer Relationships Management (CRM)
• Churn Analysis
• Risk Analysis and Management
• Fraud Detection, Counter Terrorism
• Network Intrusion Detection
• Web Site Restructring
• Recommendation
• Scientific Applications
Corporate Analysis & Risk Management
• Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
• Resource planning
– summarize and compare the resources and spending
• Competition
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
10
Fraud Detection & Mining Unusual
Patterns
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
– Anti-terrorism
11
Example: Use in retailing
• Goal: Improved business efficiency
– Improve marketing (advertise to the most likely buyers)
– Inventory reduction (stock only needed quantities)
• Information source: Historical business data
– Example: Supermarket sales records
Date/Time/Register
12/6 13:15 2
12/6 13:16 3
Fish
N
Y
Turkey
Y
N
Cranberries
Y
N
Wine
N
Y
...
...
...
– Size ranges from 50k records (research studies) to terabytes (years of
data from chains)
– Data is already being warehoused
• Sample question – what products are generally purchased
together?
• The answers are in the data, if only we could see them
12
Example: Churn Analysis
• Business Problem: Prevent loss of customers, avoid
adding churn-prone customers
• Solution: Use neural nets, time series analysis to
identify typical patterns of telephone usage of likelyto-defect and likely-to-churn customers
• Benefit: Retention of customers, more effective
promotions
13
Example: Clicks to Customers
• Business problem: 50% of Dell’s clients order their
computer through the web. However, the retention
rate is 0.5%, i.e. of visitors of Dell’s web page become
customers.
• Solution Approach: Through the sequence of their
clicks, cluster customers and design website,
interventions to maximize the number of customers
who eventually buy.
• Benefit: Increase revenues
14
What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
15
Clustering
• Find groups of similar data items
• Statistical techniques require
some definition of “distance”
(e.g. between travel profiles)
while conceptual techniques use
background concepts and logical
descriptions
“Group people with
similar travel profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
16
Classification
• Find ways to separate data items
into pre-defined groups
• Requires “training data”: Data
items where group is known
“Route documents to
most likely interested
parties”
– English or non-english?
– Domestic or Foreign?
Training Data
tool produces
Groups
classifier
17
Association Rules
• Identify dependencies in the
data:
– X makes Y likely
• Indicate significance of each
dependency
“Find groups of items
commonly purchased
together”
– People who purchase fish are
extraordinarily likely to
purchase wine
– People who purchase Turkey
are extraordinarily likely to
purchase cranberries
Date/Time/Register
12/6 13:15 2
12/6 13:16 3
Fish
N
Y
Turkey Cranberries Wine
Y
Y
Y
N
N
Y
18
…
…
…
Sequential Associations
• Find event sequences that are
unusually likely
“Find common sequences of
warnings/faults within 10
minute periods”
– Warn 2 on Switch C preceded
by Fault 21 on Switch B
– Fault 17 on any switch
preceded by Warn 2 on any
switch
Time Switch Event
B
Fault 21
21:10
A
Warn 2
21:11
C
Warn 2
21:13
A
Fault 17
21:20
19
Recommendation Techniques
• Given database of user preferences, predict preference of new
user
• Example:
– Predict what new movies you will like based on
• your past preferences
• others with similar past preferences
• their preferences for the new movies
– Predict what books/CDs a person may want to buy
(and suggest it, or give discounts to tempt customer)
20
Knowledge Discovery in Databases:
Process
Interpretation/
Evaluation
Data Mining
Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
21
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
•
Learning the application domain
– relevant prior knowledge and goals of application
•
•
•
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant representation
•
Choosing functions of data mining
– summarization, classification, regression, association, clustering
•
•
•
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
•
Use of discovered knowledge
•
Mining methodology
–
–
–
–
–
–
–
•
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
•
Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
(From J. Ullman’s Notes)
• A big data-mining risk is that you will “discover” patterns that
are meaningless.
• Statisticians call it Bonferroni’s principle: (roughly) if you look
in more places for interesting patterns than your amount of
data will support, you are bound to find meaningless results.
• When looking for a property make sure that the property
does not allow so many possibilities that random data will
surely produce facts “of interest.”
• Joseph Rhine was a parapsychologist in the
1950’s who hypothesized that some people
had Extra-Sensory Perception.
• He devised (something like) an experiment
where subjects were asked to guess 10 hidden
cards – red or blue.
• He discovered that almost 1 in 1000 had ESP –
they were able to get all 10 right!
• He told these people they had ESP and called
them in for another test of the same type.
• Alas, he discovered that almost all of them
had lost their ESP.
• What did he conclude?
• He told these people they had ESP and called
them in for another test of the same type.
• Alas, he discovered that almost all of them
had lost their ESP.
• What did he conclude?
– He concluded that you shouldn’t tell people they
have ESP; it causes them to lose it.