Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
2013-05-22 DATA MINING Concepts, Models and Methods. Part I Paweł Lula Department of Computational Systems, Cracow University of Economics [email protected] Outline • Part I – Data mining approach – Types of data and the concept of similarity and distance • Part II – Classification of research problems, – Data mining models and methods – Software for data mining Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 2 1 2013-05-22 DATA MINING APPROACH Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 3 Information deluge Never before in human history have our brains had to process as much information as they do today. We have a generation of people who I call computer suckers because they are spending so much time in front of a computer screen or on their mobile phone or BlackBerry. Edward Hallowell, Psychiatrist The Sunday Times, December 13, 2009 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 4 2 2013-05-22 Information overload Information overload: a situation in which you get more information than you can deal with at one time and become tired and confused. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 5 Flood of data Computers have promised us a fountain of wisdom but delivered a flood of data. W. J. Frawley, G.Piatetsky-Shapiro, and C. J. Matheus, 1992 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 6 3 2013-05-22 Data mining definition Data mining: the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. W. Frawley and G. Piatetsky-Shapiro and C. Matheus Knowledge Discovery in Databases: An Overview AI Magazine, Fall 1992: pp. 213–228. ISSN 0738-4602. Database Data mining process Knowledge Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 7 Data mining definition Data mining: the science of extracting useful information from large data sets or databases. D. Hand, H. Mannila, P. Smyth Principles of Data Mining. MIT Press, Cambridge 2001 Database Data mining process Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 Knowledge 8 4 2013-05-22 Data mining definition Data mining: the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making. Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning, Thomson Course Technology, Boston 2006 Database Data mining process Knowledge Decisions Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 9 Key properties of data mining approach • data-based approach (data-driven approach): – models are based on data, not on theory – huge databases and warehouses can be analyzed, – data mining methods belong to computational techniques • outcomes: easy-to-understand and easy-to-use • main field of application: business • main goals: decision support Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 10 5 2013-05-22 Data mining as an interdisciplinary field Statistics High Performance Computer Visualization Mathematics Data mining Machine Learning Databases Artificial Intelligence Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 11 Data mining process Gain knowledge about the process! Define the goal of analysis! DATABASE, WAREHOUSE • Selection • Transformation DATA SET MODEL KNOWLEDGE • Model building • Verification • Evaluation • Management • Decision support Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 12 6 2013-05-22 TYPES OF DATA AND THE CONCEPT OF SIMILARITY AND DISTANCE Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 13 Distance vs. similarity • Distance – the measure which reflects how far from each other two objects are. • Similarity – the measure which reflects how close to each other two objects are. • Very often a transformation between distance and similarity exists: • Example of the transformation: similarity = 1 / distance similarity = 1 - distance similarity = max(distance) - distance Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 14. 7 2013-05-22 The formal definition of distance Let X be a set and x, y X. Then a function d(x,y) is a called a distance if: • d(x, y) 0, • d(x, y) = d(y, x), • d(x, x) = 0. The distance function d(x, y) which satisfies the condition: • d(x, y) d(x, z) + d(z, y) /triangle inequality/ is called a metric. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 15 Datum and data • Datum (plural: data): – – – – something given, a piece of information, a single piece of information, a fact or proposition used to draw a conclusion or make a decision. • Data – a collection of facts. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 16 8 2013-05-22 Classification of data according to the type of values • quantitative = numerical, number-based – discrete values (integer values), – continuous values (real values). • qualitative = not numerical, word-based data – two-state data (logical data, True/False, Yes/No), – many-state data (color of eyes). Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 17 Classification of data according to their structure • Simple types of data (one object represents one value) • Complex types of data (one objects represents many values) Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 18 9 2013-05-22 Distance for quantitative data • z, y – numbers • dist(x, y) = |x – y| • For example: dist(2, 6) = |2 – 6| = |-4| = 4 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 19 Distance for qualitative data • Nominal values X = {Kragujevac, Rome, London, New York} Kragujevac = Rome NO Kragujevac Rome YES Example of distance: dist(a,a) = 0 dist(a, b) = 1 We can calculate distance based on additional knowledge distance by car(Kragujevac, Rome)= 1425 km Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 20 10 2013-05-22 Distance for qualitative data • Ordered values X = {small, medium, big} Operations: =, , >, < dist(small, medium) < dist(small, big) dist(small, small) = 0 dist(small, medium) = dist(medium, big) PROBLEM! Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 21 Types of complex data • • • • • • • • Matrices, Lists (sequence of elements), Records, Data frames (tables), Sets, Trees, Networks / Graphs, Texts (in natural languages). Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 22 11 2013-05-22 Matrix • • • • a rectangular structure of elements, homogenous, elements are arranged in rows and columns, a position of the element is described by indices. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 23 Objects representation in matrices Features Objects Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 24 12 2013-05-22 Vector • A matrix with one row (a 1 × m matrix) is called a row vector. • A matrix with one column (an m × 1 matrix) is called a column vector. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 25 Record • • • • a complex structure with fields, fields store values, fields are identified by names, record is a heterogonous structure. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 26 13 2013-05-22 Data frame • • • • a table-based structure, row = record, column = field in the record, data frame = vector of records. very popular in data analysis problems! Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 27 Objects as points X Y Z 1 x1 y1 z1 2 x2 y2 z2 3 x3 y3 z3 4 x4 y4 z4 ... ... ... ... N xN yN zN Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 28 14 2013-05-22 Distance between points Assume that we have two points: x(x1, x2, ..., xn) y(y1, y2,..., yn) the distance can be calculated: 𝑛 𝑑 𝑥, 𝑦 = 𝑛 𝑥𝑖 − 𝑦𝑖 𝑑 𝑥, 𝑦 = 𝑖=1 𝑥𝑖 − 𝑦𝑖 2 𝑖=1 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 29 The curse of dimensionality • The curse of dimensionality – problems with huge number of dimensions (features) • Questions: – – – – – Can distance be calculated YES Do dimensions have interpretation YES (features) Can points be presented on the graph NO Which features are important? PROBLEM! Which features have the strongest impact on the distance? PROBLEM! – Is it possible to order features according to their importance? PROBLEM! • Solution: Principal Component Analysis Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 30 15 2013-05-22 The goal of Principal Component Analysis Data set Transformation New data set Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 31 Aspects of PCA Aspect Original data set New data set easy difficult Importance The importance of variables is difficult to predict every sequential variable has smaller importance Correlation generally variables are correlated variables are uncorrelated Interpretation Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 32 16 2013-05-22 How measure the importance of the feature (dimension) The importance of the feature = the range of the feature Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 33 The idea of the PCA 1. Find a point in the center of the data set (it is the origin of the new coordinate system), 2. define the first axis to maximize the importance of the new feature, 3. define the second axis which is perpendicular to the first, 4. .... Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 34 17 2013-05-22 PCA > pca <- princomp(iris[-5]) > summary(pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 2.0494032 0.49097143 0.27872586 0.153870700 Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184 Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000 > Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 35 New features > pca$scores [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] Comp.1 -2.684125626 -2.714141687 -2.888990569 -2.745342856 -2.728716537 -2.280859633 -2.820537751 -2.626144973 -2.886382732 -2.672755798 -2.506947091 Comp.2 -0.319397247 0.177001225 0.144949426 0.318298979 -0.326754513 -0.741330449 0.089461385 -0.163384960 0.578311754 0.113774246 -0.645068899 Comp.3 -0.027914828 -0.210464272 0.017900256 0.031559374 0.090079241 0.168677658 0.257892158 -0.021879318 0.020759570 -0.197632725 -0.075318009 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 Comp.4 0.0022624371 0.0990265503 0.0199683897 -0.0755758166 -0.0612585926 -0.0242008576 -0.0481431065 -0.0452978706 -0.0267447358 -0.0562954013 -0.0150199245 36 18 2013-05-22 The importance of new components > screeplot(pca) Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 37 New components Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 38 19 2013-05-22 Singular Value Decomposition object Expenditures: Food/ Zywnosc Books/ Ksiazki Travels/ Podroze Health/ Zdrowie Janek 1300 200 25 500 Agata 1140 870 450 120 Wacek 900 30 2300 400 Krysia 890 700 500 0 Andrzej 2500 200 4500 200 Wojtek 700 0 0 3100 Jacek 1300 500 900 300 Zygmunt 5000 4000 0 100 Marysia 500 300 400 200 Teresa 300 300 300 300 Viola 2000 0 3400 2500 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 object 39 The goal of SVD • definition of the new coordinate system, • new dimensions form new features/components/latent variables, • new coordinate system is common for objects represented by rows and by columns, • new features are not correlated, • every subsequent feature has smaller importance, • new features are hard to interpret. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 40 20 2013-05-22 SVD Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 41 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 42 SVD 21 2013-05-22 List (sequence) List – a ordered collection of: • values, • events, • tasks, • goods, • cities, • ... The sentence is a sequence of words. The word is a sequence of letters. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 43 Distance between sequences • Editing operation: – Substitution – replacing one element in the sequence by another, – Deletation – removing a given element in the sequence, – Insertion – inserting a new element. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 44 22 2013-05-22 Distance between sequences • Assumption: cost(substitution) = cost(deletation) = cost(insertion) = 1 • Edit distance between two sequences is the minimum number of editing operations required to change one sequence into another. • Example: d(phone, bone) = 2 phone hone bone Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 45 Distance between sequences • Assumption: cost(substitution), cost(deletation), cost(insertion) are defined separately • Edit distance between two sequences is the sequence of editing operations required to change one sequence into another with minimal cost. • Example: dist(“This building is big”, “This building is huge”) < dist(“This building is big”, “This building is small”) Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 46 23 2013-05-22 Tree The best model for hierarchy representation Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 47 Distance between nodes Distance based on the length of the path between nodes dist(A, B) = 1 dist(A, H) = 5 dist(G, G) = 0 Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 48 24 2013-05-22 Similarity between classes C0 C1 C2 Dekang Lin: sim(C1 , C2 ) sim(C1, C2 ) I C0 I C1 I C2 2 logP C0 logP C1 logP C2 Distance based on the information theory Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 49 WordNet WordNet – a lexical database for the English language. it contains more than 150000 words. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 50 25 2013-05-22 Ontology Ontology - a model of domain knowledge. A set of concepts within a domain, and the relationships between pairs of concepts. Ontology-based distance = distance between concepts. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 51 Distance between trees – tree edit distance • Editing operation: – Substitution/Relabel – changing the label of a node, – Deletation – removing a given node in the tree, – Insertion – inserting a new node. • Cost for editing operations: – assume that cost(relabel), cost(deletation) and cost(insertion) is defined • Assume that we have – two trees: T1 and T2 – the sequence of operations which turns T1 into T2 with minimal cost • The cost of this sequence is the tree edit distance. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 52 26 2013-05-22 Graph / Network • Graph – a set of nodes (vertices) connected by edges (links). Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 53 Network modelling Network model – a formal representation of a group of real objects and relationships between them. APPLICATION PERSPECTIVE: • network, • real objects, • real relationships. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 MATHEMATICAL PERSPECTIVE: • graph, • vertices, • edges, arcs. 54 27 2013-05-22 Examples of networks • Web networks, • Social networks – persons (organisations) and relationships between them, • Communication networks (phones networks, planes connections), • Computer networks, • Trade networks (export/import), • Terrorist networks, • ... Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 55 Similarity of nodes in the network Types of node similarities • attribute-based similarity (based on the values of node attributes) , • taxonomy similarity (based on the type of nodes) • relationship similarity (based on the connections between nodes). Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 56 28 2013-05-22 Relationship similarity • Two objects are similar if they have similar relationships with other objects. similar objects dissimilar objects Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 57 Relationships dissimilarity measures B 1,7 A 6,3 7,1 C 3 A B C D A 0,0 1,7 6,3 0,0 B 0,0 0,0 0,0 0,0 C 0,0 0,0 0,0 0,0 D 7,1 0,0 3,0 0,0 D Network d1 u, v d 2 u, v Adjacency matrix 2 2 qus qvs qsu qsv n Euclidean-like dissimilarity s 1 s u ,v qus qvs n s 1 s u ,v qsu qsv Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 Manhattan-like dissimilarity 58 29 2013-05-22 Distance between graphs A graph can be transformed to another one by a finite sequence of graph edit operations which may be defined differently in various algorithms, and GED is defined by the least-cost edit operation sequence. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 59 Set Set – a collection of objects without any particular order. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 60 30 2013-05-22 Distance/similarity of sets The Jacckard index (similarity measure): The Jacckard index (distance measure): Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 61 Text • Text – representation of written language. • Text can carry information, opinions or feelings. Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 62 31 2013-05-22 Frequency matrix as a tool for text representation • Pieces of information are represented by words, • Stages: – cutting text into words, – calculation of word occurrence frequencies, – forming frequency matrix x11 x words 21 ... xn1 documents x12 ... x1m x22 ... x2 m ... ... ... xn 2 ... xnm Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 63 Distance between words x11 x words 21 ... xn1 documents x12 ... x1m x22 ... x2 m ... ... ... xn 2 ... xnm Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 distance between vectors 64 32 2013-05-22 Distance between documents x11 x words 21 ... xn1 documents x12 ... x1m x22 ... x2 m ... ... ... xn 2 ... xnm distance between vectors Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 65 Distance between words and documents x11 x words 21 ... xn1 documents x12 ... x1m x22 ... x2 m ... ... ... xn 2 ... xnm Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 SVD – Latent Semantic Analysis 66 33 2013-05-22 Part I Data mining approach Types of data and the concept of similarity and distance THANK YOU! Paweł Lula, Cracow University of Economics, Kragujevac, May 2013 67 34