Download DATA MINING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2013-05-22
DATA MINING
Concepts, Models and Methods.
Part I
Paweł Lula
Department of Computational Systems,
Cracow University of Economics
[email protected]
Outline
• Part I
– Data mining approach
– Types of data and the concept of similarity and distance
• Part II
– Classification of research problems,
– Data mining models and methods
– Software for data mining
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
2
1
2013-05-22
DATA MINING APPROACH
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
3
Information deluge
Never before in human history have our brains had to process
as much information as they do today. We have a generation of
people who I call computer suckers because they are spending
so much time in front of a computer screen or on their mobile
phone or BlackBerry.
Edward Hallowell, Psychiatrist
The Sunday Times, December 13, 2009
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
4
2
2013-05-22
Information overload
Information overload: a situation in which you get more
information than you can deal with at one time and become
tired and confused.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
5
Flood of data
Computers have promised us a fountain of wisdom but
delivered a flood of data.
W. J. Frawley, G.Piatetsky-Shapiro, and C. J. Matheus,
1992
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
6
3
2013-05-22
Data mining definition
Data mining: the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data.
W. Frawley and G. Piatetsky-Shapiro and C. Matheus
Knowledge Discovery in Databases: An Overview
AI Magazine, Fall 1992: pp. 213–228. ISSN 0738-4602.
Database
Data mining
process
Knowledge
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
7
Data mining definition
Data mining: the science of extracting useful information from
large data sets or databases.
D. Hand, H. Mannila, P. Smyth
Principles of Data Mining. MIT Press,
Cambridge
2001
Database
Data mining
process
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
Knowledge
8
4
2013-05-22
Data mining definition
Data mining: the statistical and logical analysis of large sets of
transaction data, looking for patterns that can aid decision
making.
Ellen Monk, Bret Wagner (2006).
Concepts in Enterprise Resource Planning,
Thomson Course Technology, Boston
2006
Database
Data mining
process
Knowledge
Decisions
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
9
Key properties of data mining approach
• data-based approach (data-driven approach):
– models are based on data, not on theory
– huge databases and warehouses can be analyzed,
– data mining methods belong to computational techniques
• outcomes: easy-to-understand and easy-to-use
• main field of application: business
• main goals: decision support
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
10
5
2013-05-22
Data mining as an interdisciplinary field
Statistics
High
Performance
Computer
Visualization
Mathematics
Data
mining
Machine
Learning
Databases
Artificial
Intelligence
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
11
Data mining process
Gain knowledge about the process!
Define the goal of analysis!
DATABASE,
WAREHOUSE
• Selection
• Transformation
DATA SET
MODEL
KNOWLEDGE
• Model building
• Verification
• Evaluation
• Management
• Decision
support
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
12
6
2013-05-22
TYPES OF DATA AND THE CONCEPT OF SIMILARITY AND
DISTANCE
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
13
Distance vs. similarity
• Distance – the measure which reflects how far from each
other two objects are.
• Similarity – the measure which reflects how close to each
other two objects are.
• Very often a transformation between distance and similarity
exists:
• Example of the transformation:
similarity = 1 / distance
similarity = 1 - distance
similarity = max(distance) - distance
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
14.
7
2013-05-22
The formal definition of distance
Let X be a set and x, y  X. Then a function d(x,y) is a called a
distance if:
• d(x, y)  0,
• d(x, y) = d(y, x),
• d(x, x) = 0.
The distance function d(x, y) which satisfies the condition:
• d(x, y)  d(x, z) + d(z, y) /triangle inequality/
is called a metric.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
15
Datum and data
• Datum (plural: data):
–
–
–
–
something given,
a piece of information,
a single piece of information,
a fact or proposition used to draw a conclusion or make a decision.
• Data – a collection of facts.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
16
8
2013-05-22
Classification of data according to the type of values
• quantitative = numerical, number-based
– discrete values (integer values),
– continuous values (real values).
• qualitative = not numerical, word-based data
– two-state data (logical data, True/False, Yes/No),
– many-state data (color of eyes).
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
17
Classification of data according to their structure
• Simple types of data (one object represents one value)
• Complex types of data (one objects represents many values)
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
18
9
2013-05-22
Distance for quantitative data
• z, y – numbers
• dist(x, y) = |x – y|
• For example:
dist(2, 6) = |2 – 6| = |-4| = 4
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
19
Distance for qualitative data
• Nominal values
X = {Kragujevac, Rome, London, New York}
Kragujevac = Rome  NO
Kragujevac  Rome  YES
Example of distance:
dist(a,a) = 0
dist(a, b) = 1
We can calculate distance based on additional knowledge
distance by car(Kragujevac, Rome)= 1425 km
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
20
10
2013-05-22
Distance for qualitative data
• Ordered values
X = {small, medium, big}
Operations: =, , >, <
dist(small, medium) < dist(small, big)
dist(small, small) = 0
dist(small, medium) = dist(medium, big) PROBLEM!
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
21
Types of complex data
•
•
•
•
•
•
•
•
Matrices,
Lists (sequence of elements),
Records,
Data frames (tables),
Sets,
Trees,
Networks / Graphs,
Texts (in natural languages).
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
22
11
2013-05-22
Matrix
•
•
•
•
a rectangular structure of elements,
homogenous,
elements are arranged in rows and columns,
a position of the element is described by indices.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
23
Objects representation in matrices
Features
Objects
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
24
12
2013-05-22
Vector
• A matrix with one row (a 1 × m matrix) is called a row vector.
• A matrix with one column (an m × 1 matrix) is called a column
vector.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
25
Record
•
•
•
•
a complex structure with fields,
fields store values,
fields are identified by names,
record is a heterogonous structure.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
26
13
2013-05-22
Data frame
•
•
•
•
a table-based structure,
row = record,
column = field in the record,
data frame = vector of records.
very popular
in data analysis problems!
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
27
Objects as points
X
Y
Z
1
x1
y1
z1
2
x2
y2
z2
3
x3
y3
z3
4
x4
y4
z4
...
...
...
...
N
xN
yN
zN
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
28
14
2013-05-22
Distance between points
Assume that we have two points:
x(x1, x2, ..., xn)
y(y1, y2,..., yn)
the distance can be calculated:
𝑛
𝑑 𝑥, 𝑦 =
𝑛
𝑥𝑖 − 𝑦𝑖
𝑑 𝑥, 𝑦 =
𝑖=1
𝑥𝑖 − 𝑦𝑖
2
𝑖=1
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
29
The curse of dimensionality
• The curse of dimensionality – problems with huge number of
dimensions (features)
• Questions:
–
–
–
–
–
Can distance be calculated  YES
Do dimensions have interpretation  YES (features)
Can points be presented on the graph  NO
Which features are important?  PROBLEM!
Which features have the strongest impact on the distance? 
PROBLEM!
– Is it possible to order features according to their importance? 
PROBLEM!
• Solution: Principal Component Analysis
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
30
15
2013-05-22
The goal of Principal Component Analysis
Data set
Transformation
New data set
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
31
Aspects of PCA
Aspect
Original data set
New data set
easy
difficult
Importance
The importance of
variables is difficult to
predict
every sequential variable
has smaller importance
Correlation
generally variables are
correlated
variables are
uncorrelated
Interpretation
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
32
16
2013-05-22
How measure the importance of the feature (dimension)
The importance of the feature = the range of the feature
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
33
The idea of the PCA
1. Find a point in the center
of the data set (it is the
origin of the new
coordinate system),
2. define the first axis to
maximize the
importance of the new
feature,
3. define the second axis
which is perpendicular to
the first,
4. ....
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
34
17
2013-05-22
PCA
> pca <- princomp(iris[-5])
> summary(pca)
Importance of components:
Comp.1
Comp.2
Comp.3
Comp.4
Standard deviation
2.0494032 0.49097143 0.27872586 0.153870700
Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000
>
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
35
New features
> pca$scores
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
Comp.1
-2.684125626
-2.714141687
-2.888990569
-2.745342856
-2.728716537
-2.280859633
-2.820537751
-2.626144973
-2.886382732
-2.672755798
-2.506947091
Comp.2
-0.319397247
0.177001225
0.144949426
0.318298979
-0.326754513
-0.741330449
0.089461385
-0.163384960
0.578311754
0.113774246
-0.645068899
Comp.3
-0.027914828
-0.210464272
0.017900256
0.031559374
0.090079241
0.168677658
0.257892158
-0.021879318
0.020759570
-0.197632725
-0.075318009
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
Comp.4
0.0022624371
0.0990265503
0.0199683897
-0.0755758166
-0.0612585926
-0.0242008576
-0.0481431065
-0.0452978706
-0.0267447358
-0.0562954013
-0.0150199245
36
18
2013-05-22
The importance of new components
> screeplot(pca)
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
37
New components
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
38
19
2013-05-22
Singular Value Decomposition
object
Expenditures:
Food/
Zywnosc
Books/
Ksiazki
Travels/
Podroze
Health/
Zdrowie
Janek
1300
200
25
500
Agata
1140
870
450
120
Wacek
900
30
2300
400
Krysia
890
700
500
0
Andrzej
2500
200
4500
200
Wojtek
700
0
0
3100
Jacek
1300
500
900
300
Zygmunt
5000
4000
0
100
Marysia
500
300
400
200
Teresa
300
300
300
300
Viola
2000
0
3400
2500
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
object
39
The goal of SVD
• definition of the new coordinate system,
• new dimensions form new features/components/latent
variables,
• new coordinate system is common for objects represented by
rows and by columns,
• new features are not correlated,
• every subsequent feature has smaller importance,
• new features are hard to interpret.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
40
20
2013-05-22
SVD
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
41
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
42
SVD
21
2013-05-22
List (sequence)
List – a ordered collection of:
• values,
• events,
• tasks,
• goods,
• cities,
• ...
The sentence is a sequence of words.
The word is a sequence of letters.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
43
Distance between sequences
• Editing operation:
– Substitution – replacing one element in the sequence by another,
– Deletation – removing a given element in the sequence,
– Insertion – inserting a new element.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
44
22
2013-05-22
Distance between sequences
• Assumption:
cost(substitution) = cost(deletation) = cost(insertion) = 1
• Edit distance between two sequences is the minimum
number of editing operations required to change one
sequence into another.
• Example:
d(phone, bone) = 2
phone  hone  bone
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
45
Distance between sequences
• Assumption:
cost(substitution), cost(deletation), cost(insertion) are defined separately
• Edit distance between two sequences is the sequence of
editing operations required to change one sequence into
another with minimal cost.
• Example:
dist(“This building is big”, “This building is huge”) < dist(“This building is
big”, “This building is small”)
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
46
23
2013-05-22
Tree
The best model for
hierarchy representation
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
47
Distance between nodes
Distance based on the length of the path between nodes
dist(A, B) = 1
dist(A, H) = 5
dist(G, G) = 0
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
48
24
2013-05-22
Similarity between classes
C0
C1
C2
Dekang Lin:
sim(C1 , C2 ) 
sim(C1, C2 ) 
I C0 
I C1   I C2 
2  logP C0 
logP C1   logP C2 
Distance based
on the information theory
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
49
WordNet
WordNet – a lexical database
for the English language.
it contains more than 150000 words.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
50
25
2013-05-22
Ontology
Ontology - a model of domain knowledge. A set of concepts within a domain,
and the relationships between pairs of concepts.
Ontology-based distance = distance between concepts.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
51
Distance between trees – tree edit distance
• Editing operation:
– Substitution/Relabel – changing the label of a node,
– Deletation – removing a given node in the tree,
– Insertion – inserting a new node.
• Cost for editing operations:
– assume that cost(relabel), cost(deletation) and cost(insertion) is
defined
• Assume that we have
– two trees: T1 and T2
– the sequence of operations which turns T1 into T2 with minimal cost
• The cost of this sequence is the tree edit distance.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
52
26
2013-05-22
Graph / Network
• Graph – a set of nodes (vertices) connected by edges (links).
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
53
Network modelling
Network model – a formal representation of a group of real
objects and relationships between them.
APPLICATION PERSPECTIVE:
• network,
• real objects,
• real relationships.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
MATHEMATICAL PERSPECTIVE:
• graph,
• vertices,
• edges, arcs.
54
27
2013-05-22
Examples of networks
• Web networks,
• Social networks – persons (organisations) and relationships
between them,
• Communication networks (phones networks, planes
connections),
• Computer networks,
• Trade networks (export/import),
• Terrorist networks,
• ...
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
55
Similarity of nodes in the network
Types of node similarities
• attribute-based similarity (based on the values of node attributes) ,
• taxonomy similarity (based on the type of nodes)
• relationship similarity (based on the connections between nodes).
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
56
28
2013-05-22
Relationship similarity
• Two objects are similar if they have similar relationships with
other objects.
similar
objects
dissimilar
objects
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
57
Relationships dissimilarity measures
B
1,7
A
6,3
7,1
C
3
A
B
C
D
A
0,0
1,7
6,3
0,0
B
0,0
0,0
0,0
0,0
C
0,0
0,0
0,0
0,0
D
7,1
0,0
3,0
0,0
D
Network
d1 u, v  
d 2 u, v  
Adjacency matrix
2
2
 qus  qvs   qsu  qsv  
n
Euclidean-like dissimilarity
s 1
s  u ,v
  qus  qvs
n
s 1
s  u ,v
 qsu  qsv 
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
Manhattan-like dissimilarity
58
29
2013-05-22
Distance between graphs
A graph can be transformed to another one by a finite sequence
of graph edit operations which may be defined differently in
various algorithms, and GED is defined by the least-cost edit
operation sequence.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
59
Set
Set – a collection of objects without any particular order.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
60
30
2013-05-22
Distance/similarity of sets
The Jacckard index (similarity measure):
The Jacckard index (distance measure):
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
61
Text
• Text – representation of written language.
• Text can carry information, opinions or feelings.
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
62
31
2013-05-22
Frequency matrix as a tool for text representation
• Pieces of information are represented by words,
• Stages:
– cutting text into words,
– calculation of word occurrence frequencies,
– forming frequency matrix
 x11
x
words  21
 ...

 xn1
documents
x12 ... x1m 
x22 ... x2 m 
... ... ... 

xn 2 ... xnm 
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
63
Distance between words
 x11
x
words  21
 ...

 xn1
documents
x12 ... x1m 
x22 ... x2 m 
... ... ... 

xn 2 ... xnm 
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
distance
between
vectors
64
32
2013-05-22
Distance between documents
 x11
x
words  21
 ...

 xn1
documents
x12 ... x1m 
x22 ... x2 m 
... ... ... 

xn 2 ... xnm 
distance
between
vectors
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
65
Distance between words and documents
 x11
x
words  21
 ...

 xn1
documents
x12 ... x1m 
x22 ... x2 m 
... ... ... 

xn 2 ... xnm 
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
SVD – Latent
Semantic Analysis
66
33
2013-05-22
Part I
Data mining approach
Types of data and the concept of similarity and distance
THANK YOU!
Paweł Lula, Cracow University of Economics, Kragujevac, May 2013
67
34
Related documents