Download What is data mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
1
Agenda
• Examples
• What is data mining?
• The Industry comments
• Techniques
2
Examples
• “On Friday evenings, shoppers who buy diapers also
buy beer”.
– Supermarket transaction database
• “People with good credit ratings have fewer accidents”
– Insurance database, http://wtonline.com
• “A one-dollar gas station credit-card transaction
followed by a large transaction is likely to be indicative
of fraud”.
– Credit card transactions database
3
More Examples
• Marketing
– Targeted marketing using decision trees
• Stock selection / Fraud detection
– Using neural networks
• Telecommunications
– Churn modeling, identifying valuable
customers
4
Even More Examples
• Healthcare
– Fish oil and Reynaud’s disease
• Finding communities on the Web
– Abortion example
• Personalization
– Recommender systems
5
Even More More Examples
• Games (e.g. Hollywood Stock Exchange)
– www.hsx.com
• Viral Marketing
– Social networks and network mining
• Sports – NBA Scout
6
Agenda
• Examples
• What is data mining?
• The Industry comments
• Techniques
7
What is Data Mining?
8
What is Data Mining?
• Querying large databases?
• Learning patterns from data?
• Building models from data?
9
What is Data Mining?
• Learning “structure” from large data
– “reverse engineering”
– “structure” could be patterns or models
• How is this different from statistics?
10
Data mining techniques
• Lots of them exist!
• How to categorize these?
– Two approaches
• Description vs prediction
• RES framework
11
Classification of the main engines/techniques
Main Use  Description Prediction
Technique 
OLAP
Decision Trees
Support Vector Machines
Neural Nets
Rule Discovery Methods
Clustering Methods
Genetic Algorithms
Nearest neighbor
Expert Systems
Fuzzy logic systems
Bayesian Approaches












12
Representation, Evaluation &
Search: Linear Model Example
•
Representation
– Risk = 0.93*prior_default + 0.23*num_cards –
1.3* employed –0.734
• Evaluation
– R-squared/degree of fit
• Search
– How did the technique find the coefficients?
13
Representation, evaluation and
search
• Different techniques represent, evaluate and search for
patterns differently.
– Methods can be characterized based on how they do these things.
• Data mining methods use very different representation
schemes, use predictive accuracies as the main evaluation
measure and use heuristic search procedures
• Strengths: Can build very accurate models and learn interesting
patterns in a bottom-up manner
• Weaknesses: Can find false patterns and may “overfit” the
learning data
– How to mitigate these?
• This is one way to think about the difference between DM
methods and traditional statistical methods
14
Agenda
• Examples
• What is data mining?
• The Industry comments
• Techniques
15
The Industry Space
• Data gathering and management
– External data sources
– Integrating databases to design unified views
• For realtime support
• For historical warehouse driven apps
• Firms
– Data vendors, consulting services
16
Customer Centric Architecture
channels
email
web
Action
Database
Other
Data
Sources
phone
golfcourse
17
The Industry Space
• Broad Data Analytics
– Traditional statistical tools
– Data mining tools
• Firms
– www.kdnuggets.com
– SPSS, SAS, Trajecta, IBM, SGI, Gainsmarts,
HNC Software
• Other common sources
– In-house analytics development and academia18
The Industry Space
• Niche Market Analytics and Services
–
–
–
–
–
–
Fraud detection
Customer Segmentation
Direct Marketing
Bioinformatics
Internet Advertising
Personalization
• Firms
– Examples: Doubletwist, Celera, HNC Software,
Knowledge Stream Partners, Adknowledge (acquired by
Engage), Epiphany.
19
The Industry Space
• Broad CRM Technologies and Services
– General features
•
•
•
•
•
Some data collection and integration tools
Some analytics and profitability analyses
Some features to streamline operations
Often customizable based on client needs
Boils down to client needs
• Firms
– E.g. Siebel.
20
Data Mining Revisited
• Smart techniques
– Data mining
• Not a problem.
• Engineering
– Integrating this into an overall data management
architecture
• The more difficult problem
• When and how to use
– The hard part is figuring out which problem to solve,
what data to use etc
– The importance of thinking “bottom up” for solving
problems
21
The Chief Data Officer
22
The Chief Data Officer
23
Agenda
• Examples
• What is data mining?
• The Industry comments
• Techniques
24
Example DM Models: Neural
Networks
Attempts to mimic the way
neurons work in translating input
data into an output (dependent
variable)
25
Structure of a Neural Network
26
Surface-fitters or Function Approximators
27
Example DM Models: OLAP
(On Line Analytical
Processing)
Provides visual tools to slice and
dice the data
28
Browsing a Data Cube
29
Example: Clustering
• Identify homogeneous and separable
groups (“clusters”) so that:
– maximum similarity between points within a
group
– maximum difference between groups
• Applications
– group customers into categories useful for
targeted marketing.
– Identify clusters in image data
30
What clusters can look like
31
Example: Classification
32
Example: Nearest neighbor
methods
Read “Amazon.com
recommendations” paper
33
Online Recommender Systems
• Opportunities
– Customized stores and all the associated
benefits
– Easy measurement
– Permits experimentation
• Challenges
– Scale (tens of millions of users, and millions of
items)
– Need for real-time results
– Amount of info on customers varies, but often
sparse data
34
Simple collaborative filtering
I1 I2 I3 ….. Im
C1
C2
C3
..
..
..
Cn
1
1
1
1
1 1
1
1
1. Let C1 be the vector of zeros and ones corresponding to customer 1.
2. Define similarity between customers A and B as cos(A, B) = A . B
||A|| . ||B||
3. In traditional collaborative filtering, for a given customer find the closest
customer and then recommend the other products purchased by this closest cust.
Advantages and Disadvantages?
35
Content based recommendations
• Treat recommendations as search for
related items.
• E.g. if you liked “Men In Black” you may
get recommendations for comedy films.
• Advantages and disadvantages?
36
Item-to-Item Collaborative Filtering
C1 C2 C3 ….. Cm
I1
I2
I3
..
..
..
In
1
1
1
1
1 1
1
1
1. For each item, find all similar items in an offline computation
2. Create a similar items table where for each items the set of all related items
Are stored.
37
Example : Rule discovery
methods
Read: “On the discovery of
statistical quantitative rules”
38
On Evaluation
• Apparently I would like watching movies on gang
violence in New York theaters.
– Why?
• Because…
• Hamburger grills product recommendation
• On evaluation
– absolutely critical in a world in which more
interactions are being structured automatically
– ‘evaluation’ has multiple aspects, not just how
“accurate” a model may seem to be.
39
Agenda
• Examples
• What is data mining?
• The Industry comments
• Techniques
40