Download What is Data Mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Modul 1:
Introduction
Topics
 Definitions
 Business intelligence
 DW & OLAP
 Data mining
 Data Warehousing and Data Mining Motivation
 Data mining tasks
 Classification,
 clustering,
 association, etc.
Definitions
What is business intelligence?
 The new technology for understanding the past and
predicting the futture
 A broad category of technologies that allows for


Gathering, storing, accessing and analyzing the data
business users make better decisions
Analyzing business performance through data-driven
insight
 A broad category of applications, which includes the
activities of




Decision support systems
Query and reporting
OLAP
Statistical, forecasting and data mining
What is data warehouse?
 Barry Devlin, IBM Consultant
What is data warehouse?
 W. H. Inmon, Building the Data Warehouse
Data in OLTP and OLAP
What is data mining?
 Many Definitions
 Search for valuable information (knowledge) from large
volumes of data
 Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns & rules
 Alternative terms:
 Data analysis, pattern analysis, data dredging, data
exploration, data understanding, data summarization
 Data mining: a misnomer?
Knowledge Discovery Process
KDD process
 Data cleaning: remove noise and inconsistent data
 Data integration: from multiple sources -> data
warehouse
 Data selection and transformation: transform data
into forms appropriate for data mining, select
relevant data
 Data mining: extract patterns
 Pattern evaluation/interpretation: using
interestingness measures
 Knowledge presentation: visualization and
knowledge representation are used to present
mined knowledge to the user
What is (not) Data Mining?
What is not Data
Mining?

– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
 What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com)
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
 Enormity of data
Statistics/
Machine Learning/
AI
Pattern
 High dimensionality
Recognition
of data
Data Mining
 Heterogeneous,
distributed nature
of data
Database
systems
Data mining in the BI context
The complete DSS from BI perspective
Data Warehousing and Data Mining
Motivations
Motivation:
 Data explosion problem:
 Automated data collection tools and mature database
technology lead to large amounts of data stored in
databases and data warehouses
 We are drowning in data, but starving for
knowledge!
Do not believe it?
See the following for proof!
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive pressure is strong
 Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data
Big Data Examples
Largest Databases in 2003
What tools do we have?




Query processing
Reporting tool
Spreadsheet
Statistics
 OLAP (On Line Analytical Processing)
Are there enough data analysts?
 Much of the data is never analyzed at all
4,000,000
3,500,000
3,000,000
2,500,000
The Data Gap
2,000,000
1,500,000
1,000,000
500,000
Total new disk (TB) since 1995
Number of analysts
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What we need is
New technology that can
intellectually and automatically
assist humans in
analyzing and transforming
rapidly growing volume of
digital data into useful information
Data mining
Largest Database Data Mined (Jun’06)
Data Mining Tasks
Data Mining Tasks
Prediction Methods
 Use some variables to predict unknown or future
values of other variables.
Description Methods
 Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the
model.
Illustrating Classification Task
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Application: Credit card application
 Institution: a credit card company typically
receives thousands of applications for new cards.
The application contains information: annual
salary, any outstanding debts, age etc.
 The problem: A decision has to be taken whether
to accept or reject the applications.
 Data mining task: To categorize applications into
those who have good credit, bad credit, or fall
into a gray area (thus requiring further human
analysis).
Application: Satellite image classification
Application: General image
Application: Biological image
Protein classes: nucleus, cytoplasm, and mitochondria.
RBC classes: discocyte, stomatocyte, and echinocyte
Clustering
 Groups data into meaningful classes/clusters
 Unsupervised learning
 Motivation:
 We do not know what to look for
 The first step in identifying useful patterns is to group
data by their similarity
 Once data are grouped (clustered), properties of each
cluster can be analyzed
 High quality clusters:
 the intra-class similarity is high
 the inter-class similarity is low
Clustering: Basic concept
 Given points in some spaces, group the
points into a small number of clusters
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
Application: web clustering
Association Rule Discovery: Definition
Given a set of records each of which contain
some number of items from a given collection;
 Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule (Plane Form)
Sequential Pattern Discovery: Definition
 Given is a set of objects, with each object
associated with its own timeline of events, find
rules that predict strong sequential dependencies
among different events.
Sequence Data
Timeline
10
Sequence Database:
Object
A
A
A
B
B
B
B
C
Timestamp
10
20
23
11
17
21
28
14
Events
2, 3, 5
6, 1
1
4, 5, 6
2
7, 8, 1, 2
1, 6
1, 8, 7
15
20
25
30
Object A:
2
3
5
6
1
1
Object B:
4
5
6
2
Object C:
1
7
8
7
8
1
2
1
6
35
Examples of Sequence Data
Sequence
Database
Sequence
Element
(Transaction)
Event
(Item)
Customer
Purchase history of a given
customer
A set of items bought by
a customer at time t
Books, diary products,
CDs, etc
Web Data
Browsing activity of a
particular Web visitor
A collection of files
viewed by a Web visitor
after a single mouse click
Home page, index
page, contact info, etc
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
Sequential Pattern Discovery: Examples
Stock market
 (IBM_UP SUN_UP) --> (Microsoft_UP)
In point-of-sale transaction sequences,
 Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
 Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
 Medical field
 If a patient underwent cardiac bypass surgery for
blocked arteries (blood vessel) and later developed
high blood urea within a year of surgery, he or she is
likely to suffer from kidney failure within the next 18
months.
Deviation/Anomaly Detection
 Detect significant deviations from normal behavior
 Applications:
 Credit Card Fraud Detection
 Network Intrusion
Detection