Download Introduction - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction
Jun Du
The University of Western Ontario
[email protected]
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
1
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Hardware
– Data collection and data availability
• Automated data collection tools, database systems, Web
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, …
• Society and everyone: news, digital cameras, YouTube, facebook, …
• We have everything ready for data
– But, data is useless, unless it becomes knowledge
• We are drowning in data, but starving for knowledge!
2
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
3
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, predictive modeling, data science,
business intelligence, etc.
• Examples of data mining:
– Search engine (Google, Bing, Yahoo, …)
– Online shopping (Amazon, eBey, …)
– Social network (Facebook, LinkedIn, …)
– Email service (uwo, gmail, hotmail, …)
– ……
4
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
5
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
6
KDD Process:
A Typical View from ML and Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
7
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
8
Multi-Dimensional View of DM
• Data to be mined
– What kind of data can be mined?
• Knowledge to be mined
– What kind of pattern can be mined?
• Techniques utilized
– What technology are used?
• Applications adapted
– What kind of applications are targeted?
9
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
10
On What Kinds of Data?
• Most commonly used:
– Table data (in raw format or in relational database)
• Advanced data sets
–
–
–
–
–
–
–
–
Transaction data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Spatial data and spatiotemporal data
Multimedia data
Text data
The World-Wide Web
• Poll (June 2011)
– What data types you analyzed/mined in the past 12 months?
11
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
12
Association Rule
• Given a set of transaction records each of which
contains some items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper} --> {Beer}
• Story of “Diaper” and “Beer”
13
Association Rule Application 1
Marketing and Sales Promotion:
• Let the rule discovered be
{Bagels} --> {Potato Chips}
– If bagels are on sale, potato chips might go fast as well.
– If the store discontinues selling bagels, potato chips selling
might be affected.
–…
14
Association Rule Application 2
Supermarket shelf management
• Let the rule discovered be
“Diaper”  “Beer”
– Can put beer beside diaper, customers might feel
convenient;
– Or, can put beer far away from diaper, customers might
pick up some other items on their way from “diaper” to
“beer”;
–…
15
Classification & Regression
• Construct models (functions) based on some existing data and
make predictions on some future unseen data
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
Refund Marital
Status
Taxable
Income Cheat
2
No
Married
100K
No
No
Single
75K
?
3
No
Single
70K
No
Yes
Married
50K
?
4
Yes
Married
120K
No
No
Married
150K
?
5
No
Divorced 95K
Yes
Yes
Divorced 90K
?
6
No
Married
No
No
Single
40K
?
7
Yes
Divorced 220K
No
No
Married
80K
?
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
10
Training
Set
Learning
Algorithm
Test
Set
Model
16
Classification Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the
class attribute.
• Collect customer data (demographic, lifestyle, etc.)
• Use this information as input attributes to learn a
classification model.
•
New York Times article (Feb, 2012): How Companies Learn Your Secrets
17
Classification Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its accountholder as attributes.
– When, what and where does a customer buy, etc
• Label past transactions as fraud or fair transactions (class
attribute).
• Learn a model for the class of the transactions.
• Use this model to detect fraud transactions.
18
Classification Application 3
Customer Attrition/Churn:
– Goal: To predict whether a cell-phone plan customer is
likely to be lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
19
Clustering
• Given a set of data points, each having a set of
attributes, group data points into different clusters.
– Data points in one cluster are
more similar to each other.
– Data points in separate clusters
are less similar to each other.
20
Clustering Application
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers, which may be selected as market targets
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
21
Outlier Analysis
• Outlier: A data object that does not comply with the
general behavior of the data
• Noise or exception? ― One person’s garbage could
be another person’s treasure
• Methods: classification, regression, clustering, …
• Application:
– Credit Card Fraud Detection
– Network Intrusion Detection
22
Other Patterns
• Recommendation system
– “people you might know” (Facebook)
– “jobs you might be interested” (LinkedIn)
– “people who bought this product also bought” (Amazon)
– “movies (Tvs) that you might like to watch” (Netflix)
– ….
• Social network analysis
– A new and very popular area
– Can be applied to a lot of applications: fraud detection,
marketing, terrorism and crime prevention, …
• …
23
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
24
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
25
Top 10 Algorithms in DM
• IEEE International Conference of Data Mining 2006
1. Decision Trees
2. The K-Means Algorithm
3. Support Vector Machines
4. The Apriori Algorithm
5. The EM Algorithm
6. PageRank Algorithm
7. AdaBoost Algorithm
8. K-Nearest Neighbor Algorithm
9. Naive Baye
10. CART Algorithm
26
Algorithms in DM
• Kdnuggets Poll (Nov, 2011)
– Algorithms for data analysis / data mining
• Rexer Analytics Survey (2012)
27
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
28
Applications of Data Mining
• Kdnuggets Poll (December, 2011):
– Industries / Fields where you applied Data Mining in 2011
• Rexer Analytics Survey (2012)
29
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
30
10 Challenging Problems in DM
• IEEE International Conference of Data Mining 2005
1. Developing a Unifying Theory of Data Mining
2. Scaling Up for High Dimensional Data and High Speed Data Streams
3. Mining Sequence Data and Time Series Data
4. Mining Complex Knowledge from Complex Data
5. Data Mining in a Network Setting
6. Distributed Data Mining and Mining Multi-agent Data
7. Data Mining for Biological and Environmental Problems
8. Data-Mining-Process Related Problems
9. Security, Privacy and Data Integrity
10. Dealing with Non-static, Unbalanced and Cost-sensitive Data
31
Hot Topics and Trends in DM
• Kdnuggets Poll (Jan, 2012)
– Hottest Analytics / Data Mining Topics in 2012
• Rexer Analytics Survey (2012)
32
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
33
Conferences
• Data Mining Conferences
– ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data
Mining (KDD)
– IEEE Int. Conf. on Data Mining (ICDM)
– SIAM Data Mining Conf. (SDM)
– European Conf. on Machine Learning and Principles and Practices of
Knowledge Discovery and Data Mining (ECML-PKDD)
– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
• Other Related Conferences
– DB conferences: ACM SIGMOD, VLDB, ICDE
– Web and IR conferences: WWW, SIGIR, CIKM
– ML conferences: ICML, NIPS
– AI conferences: IJCAI, AAAI
34
Journals and Online Resources
• Data Mining Journals
–
–
–
–
Data Mining and Knowledge Discovery (DMKD)
IEEE Trans. On Knowledge and Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
• Online Resources
–
–
–
–
Kdnuggets
Kaggle
UCI Machine Learning Repository
……
35
Software
• Kdnuggets poll (May 2012)
– What Analytics, Data mining, Big Data software you used in
the past 12 months for a real project?
• Rexer Analytics Survey (2012)
36
Programming Languages
• Kdnuggets poll (August 2012)
– Programming languages for analytics / data mining?
37
Outline
•
•
•
•
•
•
•
•
•
•
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Trends and Challenges in Data Mining
Data Mining Resources
Summary
38
Summary
• Data mining: Discovering interesting patterns and knowledge
from massive amount of data
• A natural evolution of science and information technology, in
great demand, with wide applications
• A KDD process includes data pre-processing, data mining, data
post-processing pattern, and knowledge presentation
• Mining can be performed in a variety of data
• Data mining patterns: association, classification, clustering,
outlier analysis, recommendation system, social network
analysis, etc.
• A variety of data mining technologies and applications
• Data mining resources
39