Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Business Intelligence and Process Modelling F.W. Takes Universiteit Leiden Lecture 4: Data Mining for BI — Part 1 BIPM — Lecture 4: Data Mining for BI — Part 1 1 / 65 Visual Analytics (“last week’s leftovers” or: “how it’s not done”) BIPM — Lecture 4: Data Mining for BI — Part 1 2 / 65 Visualization Visualization: mapping data properties to visual attributes Good visualization: “proper” mapping of data attributes to visual attributes and properly “balancing” the number of data properties and visual attributes used BIPM — Lecture 4: Data Mining for BI — Part 1 3 / 65 Visualization Visualization: mapping data properties to visual attributes Good visualization: “proper” mapping of data attributes to visual attributes and properly “balancing” the number of data properties and visual attributes used Bad visualization: False data input Misleading visual attributes Abusing human background knowledge BIPM — Lecture 4: Data Mining for BI — Part 1 3 / 65 “Unbiased” data BIPM — Lecture 4: Data Mining for BI — Part 1 4 / 65 Rainbow colors http://poynter.org/uncategorized/224413 BIPM — Lecture 4: Data Mining for BI — Part 1 5 / 65 Parts and sums https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts BIPM — Lecture 4: Data Mining for BI — Part 1 6 / 65 2D bars and icons BIPM — Lecture 4: Data Mining for BI — Part 1 7 / 65 2D bars explained BIPM — Lecture 4: Data Mining for BI — Part 1 8 / 65 2D bars explained BIPM — Lecture 4: Data Mining for BI — Part 1 8 / 65 2D bars explained http://en.wikipedia.org/wiki/Misleading_graph BIPM — Lecture 4: Data Mining for BI — Part 1 8 / 65 3D pies BIPM — Lecture 4: Data Mining for BI — Part 1 9 / 65 3D pies http://en.wikipedia.org/wiki/Misleading_graph BIPM — Lecture 4: Data Mining for BI — Part 1 9 / 65 Color-coding geographic regions https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts BIPM — Lecture 4: Data Mining for BI — Part 1 10 / 65 Color-coding geographic regions https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts BIPM — Lecture 4: Data Mining for BI — Part 1 11 / 65 Axis ranges https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts BIPM — Lecture 4: Data Mining for BI — Part 1 12 / 65 Axis ranges https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts BIPM — Lecture 4: Data Mining for BI — Part 1 13 / 65 Who understands? http://www.multimension.com/project/upgrading-clinical-infographics/ BIPM — Lecture 4: Data Mining for BI — Part 1 14 / 65 Recap Business Intelligence: anything that aims at providing actionable information that can be used to support business decision making Business Analysis Business Analytics Visual Analytics (last week) Descriptive Analytics Predictive Analytics Data → Information → Knowledge Process Modelling (April and May) BIPM — Lecture 4: Data Mining for BI — Part 1 15 / 65 Data Mining BIPM — Lecture 4: Data Mining for BI — Part 1 16 / 65 Overview Data warehouse Data preparation Data Mining theory recap Data Mining case studies Data Mining evaluation techniques Data Mining in a service oriented architecture BIPM — Lecture 4: Data Mining for BI — Part 1 17 / 65 Data warehouse Data warehouse: a copy of transaction data specifically structured for query and analysis (R. Kimball) Data warehouse: a system used for reporting and data analysis (Wikipedia) Data warehouse: a subject oriented, integrated, nonvolatile, timestamped collection of data designed to support management’s decision support needs (B. Inmon) BIPM — Lecture 4: Data Mining for BI — Part 1 18 / 65 Data warehouse data In a data warehouse, data is organized around subjects (whereas information systems are organized around applications) Data is collected from heterogeneous sources and may already be aggregated (for example from an ERP or CRM system) Data is timestamped Data is nonvolatile BIPM — Lecture 4: Data Mining for BI — Part 1 19 / 65 Data warehouse http://savis.vn/ BIPM — Lecture 4: Data Mining for BI — Part 1 20 / 65 Transactional system vs. Data warehouse Transactional System Data warehouse Holds current data Current and historic data Detailed data Detailed and aggregated data Volatile data Nonvolatile data High transaction frequency Medium-low frequency Oriented on daily operations Oriented on data analysis Support for daily decisions Support for strategic decisions Many operational users Few decision-making users Availability very important Availability not so important Data storage focus Information acquisition focus https://www.fer.unizg.hr/ (Business Intelligence) BIPM — Lecture 4: Data Mining for BI — Part 1 21 / 65 Data mining Data mining: the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems (Wikipedia) Data mining: the practice of examining large pre-existing databases in order to generate new information (Oxford) Data mining: knowledge discovery from data (or information) in an automated way (DIKW pyramid) BIPM — Lecture 4: Data Mining for BI — Part 1 22 / 65 DIKW Pyramid BIPM — Lecture 4: Data Mining for BI — Part 1 23 / 65 DIKW Gaps ZPR FER Zagreb - Business Intelligence 20113 BIPM — Lecture 4: Data Mining for BI — Part 1 24 / 65 Data mining . . . KDD: Knowledge Discovery in Databases Data archeology Information harvesting Knowledge extraction Machine learning Big data techniques? Data science? Business intelligence? BIPM — Lecture 4: Data Mining for BI — Part 1 25 / 65 Data mining http://blogs.sas.com/content/subconsciousmusings/2014/08/22 BIPM — Lecture 4: Data Mining for BI — Part 1 26 / 65 KDD Knowledge Discovery in Data is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. Fayyad et al., Advances in knowledge discovery and data mining, MIT press, 1996. BIPM — Lecture 4: Data Mining for BI — Part 1 27 / 65 KDD BIPM — Lecture 4: Data Mining for BI — Part 1 28 / 65 Why data mining now? Data flood / data explosion Cloud computing power Cheap storage Algorithms have matured Software is available Competition is killing BIPM — Lecture 4: Data Mining for BI — Part 1 29 / 65 Data mining in businesses Process management Market basket analysis Marketing Customer loyalty Fraud detection Trend analysis BIPM — Lecture 4: Data Mining for BI — Part 1 30 / 65 Data mining in practice 1 Learn about the problem domain 2 Data selection 3 Data, cleaning, preprocessing and reduction 4 Data mining 5 Interpretation of information 6 Apply knowledge in domain BIPM — Lecture 4: Data Mining for BI — Part 1 31 / 65 Data preprocessing Sampling Normalization Missing data Data conflicts Duplicate data Ambiguity in data BIPM — Lecture 4: Data Mining for BI — Part 1 32 / 65 Guidelines for successful data mining The data must be available The data must be relevant, adequate and clean There must be a well-defined problem The problem should not be solvable by means of ordinary query or OLAP tools The results must be actionable BIPM — Lecture 4: Data Mining for BI — Part 1 33 / 65 Successful data mining in businesses Use a small team with a strong internal integration and a loose management style Carry out a small pilot project before a major data mining project Identify a clear problem owner responsible for the project, e.g., from sales or marketing Try to realize a positive return on investment within 6 to 12 months Have top management back the project up BIPM — Lecture 4: Data Mining for BI — Part 1 34 / 65 Break? http://xkcd.com/539/ BIPM — Lecture 4: Data Mining for BI — Part 1 35 / 65 Categories of techniques Machine learning Supervised learning: learning on labeled data Semi-supervised learning: partially labeled data Unsupervised learning: leaning/mining on unlabeled data Reinforcement learning: agents learning to act in an environment BIPM — Lecture 4: Data Mining for BI — Part 1 36 / 65 Categories of techniques Machine learning Supervised learning: learning on labeled data Semi-supervised learning: partially labeled data Unsupervised learning: leaning/mining on unlabeled data Reinforcement learning: agents learning to act in an environment Data mining Predictive Descriptive BIPM — Lecture 4: Data Mining for BI — Part 1 36 / 65 Supervised learning Regression Classification Bayesian Networks Support Vector Machines Link prediction BIPM — Lecture 4: Data Mining for BI — Part 1 37 / 65 Example dataset 2 attributes and a Class attribute 50 datapoints x 2 3 3 ... BIPM — Lecture 4: Data Mining for BI — Part 1 y 3 2 4 ... Class Blue Green Blue ... 38 / 65 Regression as a model BIPM — Lecture 4: Data Mining for BI — Part 1 39 / 65 Classification: Regression Linear Regression Given n variables x1 , . . . xn Find weights w0 , . . . wn such that w0 + w1 x1 + . . . wn xn ≥ 0 BIPM — Lecture 4: Data Mining for BI — Part 1 40 / 65 Classification: Regression Linear Regression Given n variables x1 , . . . xn Find weights w0 , . . . wn such that w0 + w1 x1 + . . . wn xn ≥ 0 Example: n = 2 w0 + w1 x + w2 y ≥ 0 BIPM — Lecture 4: Data Mining for BI — Part 1 41 / 65 Regression disclaimer http://en.wikipedia.org/wiki/Linear_regression BIPM — Lecture 4: Data Mining for BI — Part 1 42 / 65 Correlation Pearson correlation r ∈ [0; 1] describing the extent to which the relation between variables can be described in a linear way. BIPM — Lecture 4: Data Mining for BI — Part 1 43 / 65 Correlation Pearson correlation r ∈ [0; 1] describing the extent to which the relation between variables can be described in a linear way. BIPM — Lecture 4: Data Mining for BI — Part 1 43 / 65 Correlation How do we perceive correlations? Study by University of Cambridge — Gamification http://guessthecorrelation.com BIPM — Lecture 4: Data Mining for BI — Part 1 44 / 65 Classification: Decision trees Decision Tree (d = 0) return MAJORITY-CLASS(); BIPM — Lecture 4: Data Mining for BI — Part 1 45 / 65 Classification: Decision trees Decision Tree (d = 1) if(X > 5) return BLUE; else return GREEN; // oops! BIPM — Lecture 4: Data Mining for BI — Part 1 46 / 65 Classification: Decision trees Decision Tree (d = 2) if(X > 5) return BLUE; elseif(Y > 3) return BLUE; else return GREEN; BIPM — Lecture 4: Data Mining for BI — Part 1 47 / 65 Classification: Decision trees Decision Tree (d = 3) if(X > 5) return BLUE; elseif(Y > 3) return BLUE; elseif(X > 2) return GREEN; else return BLUE; BIPM — Lecture 4: Data Mining for BI — Part 1 48 / 65 Classification: Neural networks Neural Networks Perceptron Multi-level Backpropagation Deep learning BIPM — Lecture 4: Data Mining for BI — Part 1 49 / 65 Categories of techniques Supervised learning: learning on labeled data Semi-supervised learning: partially labeled data Unsupervised learning: leaning/mining on unlabeled data Reinforcement learning: agents learning to act in an environment BIPM — Lecture 4: Data Mining for BI — Part 1 50 / 65 Semi-supervised learning Semi-supervised learning: learning from both labeled and unlabeled data Smoothness assumption: data points close to each other, are more likely to share the same label Cluster assumption: data tends to form discrete clusters, and points in the same cluster are more likely to share a label Lower dimensionality assumption: probably, the effective dimensionality of the data is much lower than the number of input attributes BIPM — Lecture 4: Data Mining for BI — Part 1 51 / 65 Semi-supervised learning http://en.wikipedia.org/wiki/Semi-supervised_learning BIPM — Lecture 4: Data Mining for BI — Part 1 52 / 65 Semi-supervised learning http://en.wikipedia.org/wiki/Semi-supervised_learning BIPM — Lecture 4: Data Mining for BI — Part 1 53 / 65 Data Mining categories Supervised learning: learning on labeled data Semi-supervised learning: partially labeled data Unsupervised learning: leaning/mining on unlabeled data Reinforcement learning: agents learning to act in an environment BIPM — Lecture 4: Data Mining for BI — Part 1 54 / 65 Reinforcement learning States, actions, transitions and rewards Perceptions and beliefs Single-agent or multi-agent Goal: maximize reward Monte Carlo methods Temporal difference learning BIPM — Lecture 4: Data Mining for BI — Part 1 55 / 65 Reinforcement learning https://www.cs.utexas.edu/~eladlieb/rl_interaction.png BIPM — Lecture 4: Data Mining for BI — Part 1 56 / 65 Google Deepmind Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Human-level control through deep reinforcement learning, Nature 518, 529–533, 2015. Demis Hassabis, http://dx.doi.org/10.1038/nature14236 BIPM — Lecture 4: Data Mining for BI — Part 1 57 / 65 Google Deepmind Silver et al.. Mastering the game of Go with deep neural networks and tree search, Nature 529, 484–489, 2016. http://dx.doi.org/10.1038/nature16961 BIPM — Lecture 4: Data Mining for BI — Part 1 58 / 65 Deep learning BIPM — Lecture 4: Data Mining for BI — Part 1 59 / 65 Watson wins Jeopardy https://www.youtube.com/watch?v=YgYSv2KSyWg BIPM — Lecture 4: Data Mining for BI — Part 1 60 / 65 AlphaGo beats human BIPM — Lecture 4: Data Mining for BI — Part 1 61 / 65 Self-driving cars BIPM — Lecture 4: Data Mining for BI — Part 1 62 / 65 Lab session February 24 Continue with dashboard and data integration Error reporting in PHP and other handy tricks: http://liacs.leidenuniv.nl/ict Answer the BI questions Report issues and questions! BIPM — Lecture 4: Data Mining for BI — Part 1 63 / 65 Extrapolating http://xkcd.com/605/ BIPM — Lecture 4: Data Mining for BI — Part 1 64 / 65 Credits Slides partially based on “From Data Mining to Knowledge Discovery: An Introduction” by Gregory Piatetsky-Shapiro (KDnuggets.com) BIPM — Lecture 4: Data Mining for BI — Part 1 65 / 65