Download 6 slides per page - DataBase and Data Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DB
MG
Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino
Data analysis
Most companies own huge databases
containing
operational data
textual documents
experiment results
Data mining fundamentals
DB
MG
These databases are a potential
source of useful information
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
DB
MG
Data analysis
Data mining
Information is “hidden” in huge datasets
not immediately evident
human analysts need a large amount of time for the
analysis
most data is never analyzed at all
The Data Gap
information from available data
Extraction is automatic
2,000,000
Disk space (TB)
since 1995
1,500,000
1,000,000
0
1995
1996
1997
1998
1999
From R. Grossman, C. Kamath, V. Kumar, “Data
Mining for Scientific and Engineering Applications”
Extracted information is represented by means of
abstract models
3
DB
MG
Example: profiling
Query keywords, searched topics and objects
Integration of different dimensions
E.g., location, time of the day, user interest
Text mining
5
User registration, fidelity cards
Context-aware data analysis
Localization, interesting locations for users
DB
MG
Correlated objects for cross selling
Facebook, google+ profiles
Dynamic data: posts on blogs, FB, tweets
Maps and georeferenced data
Recommendation systems
Advertisements
Market basket analysis
Social network data
User/service profiling
Selected products, requested information, …
Search engines and portals
4
Example: profiling
Consumer behavior in e-commerce sites
denoted as pattern
Analyst
number
500,000
performed by appropriate algorithms
2,500,000
implicit
previously unknown
potentially useful
3,000,000
DB
MG
Non trivial extraction of
4,000,000
3,500,000
2
DB
MG
Brand reputation, sentiment analysis, topic trends
6
Elena Baralis
Politecnico di Torino
1
DB
MG
Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino
Example: biological data
Clinical analysis
detecting the causes of a pathology
monitoring the effect of a therapy
⇒ diagnosis improvement and definition of new specific
therapies
expression level of genes in a cellular tissue
various types (mRNA, DNA)
Patient clinical records
CLID
personal and demographic data
exam results
Microarray
Biological analysis objectives
PATIENT shx013: shv060: shq077: shx009: shx014: shq082:
ID
49A34 45A9 52A28 4A34 61A31 99A6
IMAGE:740604
ISG20 || interferon-1.02
stimulated-2.34
gene 20kDa
1.44
0.57 -0.13
0.12
IMAGE:767176
TNFSF13 || tumor-0.52
necrosis -4.06
factor (ligand)
-0.29superfamily,
0.71member1.03
13 -0.67
IMAGE:366315
LOC93343 || **hypothetical
-0.25 -4.08
protein BC011840
0.06
0.13
0.08
0.06
IMAGE:235135
ITGA4 || integrin,-1.375
alpha 4 (antigen
-1.605 CD49D,
0.155alpha -0.015
4 subunit of0.035
VLA-4 receptor)
-0.035
shq083:
46A15
0.34
0.22
-0.08
0.505
shx008:
41A31
-0.51
-0.09
-0.05
-0.865
Bio-discovery
Textual data in public collections
heterogeneous formats, different objectives
scientific literature (PUBMed)
ontologies (Gene Ontology)
Pharmacogenesis
DB
MG
7
gene network discovery
analysis of multifactorial genetic pathologies
lab design of new drugs for genic therapies
DB
MG
Knowledge Discovery Process
8
Preprocessing
data cleaning
selection
preprocessing
preprocessing
data integration
transformation
data
selected
data
preprocessed
data
transformed
data
selected
data
preprocessed
data
data mining
interpretation
Without good quality data, no good quality
pattern
knowledge
DB
MG
10
DB
MG
Data mining origins
statistics, artificial intelligence (AI)
pattern recognition, machine
learning
Statistics,
database systems
AI
Traditional techniques are not
appropriate because of
significant data volume
large data dimensionality
heterogeneous and distributed
nature of data
DB
MG
11
Analysis techniques
Draws from
• reconciles data extracted
from different sources
• integrates metadata
• identifies and solves data
value conflicts
• manages redundancy
Real world data is “dirty”
pattern
KDD = Knowledge Discovery from Data
• reduces the effect of noise
• identifies or removes outliers
• solves inconsistencies
Descriptive methods
Machine Learning,
Pattern
Recognition
Predictive methods
Data Mining
Database
systems
From: P. Tan, M. Steinbach, V. Kumar,
“Introduction to Data Mining”
12
Extract interpretable models describing data
Example: client segmentation
DB
MG
Exploit some known variables to predict
unknown or future values of (other) variables
Example: “spam” email detection
13
Elena Baralis
Politecnico di Torino
2
DB
MG
Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino
Classification
Classification
• Approaches
Objectives
training data
training data
model
classified data
DB
MG
14
DB
MG
15
Classification
• Requirements
Applications
– accuracy
– interpretability
– scalability
– noise and outlier
management
training data
detection of customer propension to leave a company
(churn or attrition)
fraud detection
classification of different pathology types
…
dati di training
model
unclassified data
DB
MG
modello
classified data
dati classificati
dati non classificati
16
DB
MG
Clustering
17
Clustering
• Approaches
Objectives
classified data
unclassified data
Classification
decision trees
bayesian classification
classification rules
neural networks
k-nearest neighbours
SVM
model
unclassified data
–
–
–
–
–
–
prediction of a class label
definition of an interpretable model of a given
phenomenon
– partitional (K-means)
detecting groups of similar data objects
identifying exceptions and outliers
– hierarchical
– density-based (DBSCAN)
– SOM
• Requirements
– scalability
– management of
– noise and outliers
– large dimensionality
– interpretability
DB
MG
18
DB
MG
19
Elena Baralis
Politecnico di Torino
3
DB
MG
Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino
Association rules
Clustering
Applications
Objective
customer segmentation
clustering of documents containing similar information
grouping genes with similar expression pattern
…
Tickets at a supermarket
counter
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diapers, Milk
4
Beer, Bread, Diapers, Milk
5
Coke, Diapers, Milk
…
DB
MG
20
market basket analysis
cross-selling
shop layout or catalogue design
Tickets at a supermarket
counter
TID
Items
1
Bread, Coca Cola, Milk
2
Beer, Bread
Beer, Coca Cola, Diapers, Milk
4
Beer, Bread, Diapers, Milk
5
Coca Cola, Diapers, Milk
…
2% of transactions contains
both items
30% of transactions
containing diapers also
contain beer
21
Sequence mining
2% of transactions contains
both items
30% of transactions
containing diapers also
contain beer
temporal and spatial information are considered
example: sensor network data
Regression
ordering criteria on analyzed data are taken into
account
example: motif detection in proteins
Time series and geospatial data
Association rule
diapers ⇒ beer
3
…
Association rule
diapers ⇒ beer
Other data mining techniques
Applications
DB
MG
Association rules
extraction of frequent correlations or pattern from a
transactional database
prediction of a continuous value
example: prediction of stock quotes
Sensor network
Outlier detection
example: intrusion detection in network traffic
analysis
…
DB
MG
22
DB
MG
23
Open issues
Scalability to
huge data volumes
Data dimensionality
Complex data structures, heterogeneous data
formats
Data quality
Privacy preservation
Streaming data
DB
MG
24
Elena Baralis
Politecnico di Torino
4