Download Knowledge Discovery in Databases - Sorry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Knowledge Discovery in Databases
T1: introduction
Knowledge Discovery in Databases
(Information Harvesting, Data Archeology, Data
Mining, Knowledge Destilery, ....)
 Non-trivial process of identifying valid, novel,
potentially useful and ultimately understandable
patterns from data (Fayyad a kol., 1996)
 Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, valid
patterns and relationships in large data sets
(Adriaans, Zantinge, 1999)
 Analysis of observational data sets to find
unsuspected relationships and summarize data in
novel ways that are both understandable and useful
to the data owner (Hand, Manilla, Smyth, 2001)
 Data mining is the process of analyzing hidden
patterns of data from different perspectives and
categorizing
them
into
useful
information
(techopedia.org, 2011)
Three sources
 databases (query languages, OLAP), statistics
(data analysis), artificial intelligence (machine
learning)
P. Berka, 2012
1/19
Knowledge Discovery in Databases
T1: introduction
KDD Tasks
(Klosgen, Zytkow, 1997)
 classification/prediction: the task is to
find knowledge applicable to automatically
process new examples
 desription: the task is to find dominant
structure or relationships
P. Berka, 2012
2/19
Knowledge Discovery in Databases
T1: introduction
 search for „nuggets“: the task is to find
partial novel and surprising knowledge
(Chapman a kol, 2000)
 data description and summarisation:
concise description of characteristics of
the data, typically in elementary and
aggregated form
 segmentation: separation of the data into
interesting and meaningful subgroups or
classes
 concept description: understandable
description of concepts or classes to gain
insight
P. Berka, 2012
3/19
Knowledge Discovery in Databases
T1: introduction
 classification: build classification models
(sometimes called classifiers) which assign
the correct class label to previously unseen
and unlabeled objects
 prediction: similar to classification, but the
target attribute (class) is not a qualitative
discrete attribute but a continuous one.
Prediction also often deals with time
dependent concepts
 dependency analysis: describe significant
dependencies (or associations) between data
items or events
P. Berka, 2012
4/19
Knowledge Discovery in Databases
T1: introduction
Managerial viewpoint
Manažerský
problém
Znalosti
pro řešení
1. Řešitelský
tým
7. Interpretace
2. Specifikace
problému
6. Data
mining
3. Získání
dat
5.Předzpracování dat
4. Výběr
metod
Data processing viewpoint
P. Berka, 2012
5/19
Knowledge Discovery in Databases
T1: introduction
Application areas of KDD
 Segmentation and classification (clients of a bank
or insurance company),
 Credit Risk Assessment,
 Fraud detection
 Prediction of stock market prices,
 Prediction of energy consumption,
 Intrusion detection,
 Churn Analysis (telco services providers, internet
providers),
 Microarray data analysis (molecular biology),
 Targeted marketing,
 Medical diagnosis,
 Market Basket Analysis.
P. Berka, 2012
6/19
Knowledge Discovery in Databases
T1: introduction
Market basket analysis: data expolration
Collected data – content of market baskets in transactional form
Basket_id
10011
10011
10012
10012
10012
10012
10013
10014
10014
...
P. Berka, 2012
Item_id
152
37
1
152
785
6
10
15
811
...
7/19
Knowledge Discovery in Databases
T1: introduction
Market basket analysis: dependency analysis
P. Berka, 2012
8/19
Knowledge Discovery in Databases
T1: introduction
Market basket analysis: classification
P. Berka, 2012
9/19
Knowledge Discovery in Databases
T1: introduction
KDD Standards
1. Methodologies
(Marban a kol, 2009)
5A
Developed in mid. 90th by SPSS. The name is an
acronym for the performed steps:
 Assess – assess the requirements of the project,
 Access – access the available data,
 Analyze – perform the analyses,
 Act – turn knowledge into actions,
 Automate – deploy the models in an automatic way.
P. Berka, 2012
10/19
Knowledge Discovery in Databases
T1: introduction
SEMMA
Developed in mid. 90th by SAS:
 Sample the data by creating one or more data
tables,
 Explore the data by searching for relationships,
trends or anomalies,
 Modify the data by creating, selecting, and
transforming the variables,
 Model the relationships between input and output
variables by using various data mining techniques,
 Assess the quality of the models.
P. Berka, 2012
11/19
Knowledge Discovery in Databases
T1: introduction
CRISP-DM
Currently a de-facto standard supported by most
data mining systems
P. Berka, 2012
12/19
Knowledge Discovery in Databases
T1: introduction
2. Standards to describe models
Predictive Modeling Markup Language
Standard based on XML developed at Data Mining
Group (www.dmg.org), that allows to describe data,
data transformations and created models. Main parts
of a PMML document:
 Header

Data Dictionary
 Data Transformations
 Model
P. Berka, 2012
13/19
Knowledge Discovery in Databases
T1: introduction
<?xml version="1.0" ?>
<PMML version="4.0">
<Header copyright="P.B." description="An example decision tree model."/>
<DataDictionary numberOfFields="5" >
<DataField name="income" optype="categorical" />
<Value value="low"/>
<Value value="high"/>
<DataField name=account" optype= categorical " />
<Value value="low"/>
<Value value="medium"/>
<Value value="high"/>
<DataField name="sex" optype="categorical" >
<Value value="male"/>
<Value value="female"/>
</DataField>
<DataField name="unemployed" optype="categorical" >
<Value value="yes"/>
<Value value="no"/>
</DataField>
<DataField name=loan" optype="categorical" >
<Value value="A"/>
<Value value="n"/>
</DataField>
</DataDictionary>
<TreeModel modelName="loan aproval decision tree" >
<MiningSchema>
<MiningField name=“income"/>
<MiningField name="account"/>
<MiningField name="sex"/>
<MiningField name="unemployed"/>
<MiningField name="loan" usageType="predicted"/>
</MiningSchema>
<Node score="A">
<True/>
<Node score="A">
<SimplePredicate field="income" operator="equal" value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="income" operator="equal" value="low"/>
<Node score="A">
<SimplePredicate field="account" operator="equal"
value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="account" operator="equal"
value="low"/>
<Node score="n">
<SimplePredicate field="unemployed" operator="equal"
value="yes“/>
</Node>
<Node score="A">
<SimplePredicate field="unemployed" operator="equal"
value="no“/>
</Node>
</Node>
</Node>
</Node>
</TreeModel>
</PMML>
P. Berka, 2012
14/19
Knowledge Discovery in Databases
T1: introduction
3. Programming standards (API)
SQL/MM Data Mining
Standard interface that enables to access data
mining algorithms from relational databases
OLE DB for Data Mining
API developed by Microsoft
CREATE MINING MODEL CreditRisk
(
CustomerId long key,
Income text discrete,
Account text discrete,
Sex text discrete,
Unemployed boolean discrete,
Loan text discrete predict,
)
USING [Microsoft Decision Tree]
Java Data Mining
P. Berka, 2012
15/19
Knowledge Discovery in Databases
T1: introduction
Data Mining Systems
 cover the whole KDD process (from data
preprocessing to model evaluation),
 offer more data mining algorithms (than singlepurpose machine learning systems),
 focus on visualization (both in the way how to
use the system and in the way how to present
and interpret data and results).
System
Vendor
URL
SPM
Salford
Systems
SPSS
www.salford-systems.com
Clementine
Enterprise
Miner
GhostMiner
SAS Institute
Intelligent
Miner
KnowledgeSt
udio
Oracle Data
Mining
PolyAnalyst
Statistica
Data Miner
IBM
Fujitsu
Angoss
Oracle
Megaputer
StatSoft
LISp Miner VŠE
RapidMiner Rapid-I
University of
Weka
Waikato
P. Berka, 2012
www-01.ibm.com/software/analytics/
spss/products/modeler/
www.sas.com/technologies/analytics/
datamining/miner/
www.fqs.pl/business_intelligence/prod
ucts/ghostminer
www-01.ibm.com/software/data/
infosphere/warehouse/enterprise.html
www.angoss.com
www.oracle.com/us/products/database/
options/data-mining/index.html
www.megaputer.com/
www.statsoft.com/products/datamining-solutions/
lispminer.vse.cz
rapid-i.com/
www.cs.waikato.ac.nz/ml/weka/index.
html
16/19
Knowledge Discovery in Databases
T1: introduction
Weka
Rapid Miner
P. Berka, 2012
17/19
Knowledge Discovery in Databases
T1: introduction
SAS Enterprise Miner
IBM SPSS Modeler (Clementine)
P. Berka, 2012
18/19
Knowledge Discovery in Databases
P. Berka, 2012
T1: introduction
19/19