Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Modul 1:
Introduction
Topics
Definitions
Business intelligence
DW & OLAP
Data mining
Data Warehousing and Data Mining Motivation
Data mining tasks
Classification,
clustering,
association, etc.
Definitions
What is business intelligence?
The new technology for understanding the past and
predicting the futture
A broad category of technologies that allows for
Gathering, storing, accessing and analyzing the data
business users make better decisions
Analyzing business performance through data-driven
insight
A broad category of applications, which includes the
activities of
Decision support systems
Query and reporting
OLAP
Statistical, forecasting and data mining
What is data warehouse?
Barry Devlin, IBM Consultant
What is data warehouse?
W. H. Inmon, Building the Data Warehouse
Data in OLTP and OLAP
What is data mining?
Many Definitions
Search for valuable information (knowledge) from large
volumes of data
Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns & rules
Alternative terms:
Data analysis, pattern analysis, data dredging, data
exploration, data understanding, data summarization
Data mining: a misnomer?
Knowledge Discovery Process
KDD process
Data cleaning: remove noise and inconsistent data
Data integration: from multiple sources -> data
warehouse
Data selection and transformation: transform data
into forms appropriate for data mining, select
relevant data
Data mining: extract patterns
Pattern evaluation/interpretation: using
interestingness measures
Knowledge presentation: visualization and
knowledge representation are used to present
mined knowledge to the user
What is (not) Data Mining?
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com)
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
Enormity of data
Statistics/
Machine Learning/
AI
Pattern
High dimensionality
Recognition
of data
Data Mining
Heterogeneous,
distributed nature
of data
Database
systems
Data mining in the BI context
The complete DSS from BI perspective
Data Warehousing and Data Mining
Motivations
Motivation:
Data explosion problem:
Automated data collection tools and mature database
technology lead to large amounts of data stored in
databases and data warehouses
We are drowning in data, but starving for
knowledge!
Do not believe it?
See the following for proof!
Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive pressure is strong
Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Big Data Examples
Largest Databases in 2003
What tools do we have?
Query processing
Reporting tool
Spreadsheet
Statistics
OLAP (On Line Analytical Processing)
Are there enough data analysts?
Much of the data is never analyzed at all
4,000,000
3,500,000
3,000,000
2,500,000
The Data Gap
2,000,000
1,500,000
1,000,000
500,000
Total new disk (TB) since 1995
Number of analysts
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What we need is
New technology that can
intellectually and automatically
assist humans in
analyzing and transforming
rapidly growing volume of
digital data into useful information
Data mining
Largest Database Data Mined (Jun’06)
Data Mining Tasks
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model.
Illustrating Classification Task
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Application: Credit card application
Institution: a credit card company typically
receives thousands of applications for new cards.
The application contains information: annual
salary, any outstanding debts, age etc.
The problem: A decision has to be taken whether
to accept or reject the applications.
Data mining task: To categorize applications into
those who have good credit, bad credit, or fall
into a gray area (thus requiring further human
analysis).
Application: Satellite image classification
Application: General image
Application: Biological image
Protein classes: nucleus, cytoplasm, and mitochondria.
RBC classes: discocyte, stomatocyte, and echinocyte
Clustering
Groups data into meaningful classes/clusters
Unsupervised learning
Motivation:
We do not know what to look for
The first step in identifying useful patterns is to group
data by their similarity
Once data are grouped (clustered), properties of each
cluster can be analyzed
High quality clusters:
the intra-class similarity is high
the inter-class similarity is low
Clustering: Basic concept
Given points in some spaces, group the
points into a small number of clusters
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
Application: web clustering
Association Rule Discovery: Definition
Given a set of records each of which contain
some number of items from a given collection;
Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule (Plane Form)
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object
associated with its own timeline of events, find
rules that predict strong sequential dependencies
among different events.
Sequence Data
Timeline
10
Sequence Database:
Object
A
A
A
B
B
B
B
C
Timestamp
10
20
23
11
17
21
28
14
Events
2, 3, 5
6, 1
1
4, 5, 6
2
7, 8, 1, 2
1, 6
1, 8, 7
15
20
25
30
Object A:
2
3
5
6
1
1
Object B:
4
5
6
2
Object C:
1
7
8
7
8
1
2
1
6
35
Examples of Sequence Data
Sequence
Database
Sequence
Element
(Transaction)
Event
(Item)
Customer
Purchase history of a given
customer
A set of items bought by
a customer at time t
Books, diary products,
CDs, etc
Web Data
Browsing activity of a
particular Web visitor
A collection of files
viewed by a Web visitor
after a single mouse click
Home page, index
page, contact info, etc
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
Sequential Pattern Discovery: Examples
Stock market
(IBM_UP SUN_UP) --> (Microsoft_UP)
In point-of-sale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Medical field
If a patient underwent cardiac bypass surgery for
blocked arteries (blood vessel) and later developed
high blood urea within a year of surgery, he or she is
likely to suffer from kidney failure within the next 18
months.
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection