Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hochschule Düsseldorf University of Applied Scienses Fachbereich Wirtschaftswissenschaften Faculty of Business Studies HSD W Business Analytics (M.Sc.) IT in Business Analytics IT Applications in Business Analytics SS2016 / Lecture 05 – Introduction to KNIME Thomas Zeutschler Let’s get started… HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 2 Intoduction HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 3 KNIME Analytics Platform The KNIME is an open source platform for analytical data modelling and processing. www.knime.org KNIME was developed at University of Konstanz in 2004-2006 and focussed initially on pharmaceutical research. Today KNIME is modular, highly scalable data processing platform which allow an easy integration of different modules for: data loading, processing, transformation data analysis visual data exploration HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 KNIME Analytics Platform – Workflows An analysis is defined by a graphical Workflow. Interlinked Nodes are defining the various steps of a workflow. Hundreds of predefined nodes available for various purposes… data loading, processing, transformation and data delivery data analysis and visualization interaction with other tools (e.g. run an R script) HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 5 KNIME Analytics Platform - Frontend HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 6 KNIME Analytics Platform – Real World Example HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 7 KNIME Analytics Platform – Real World Example HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 8 Knime – Installation Register, download and install Knime from http://knime.org www.knime.org HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 9 Knime – Lets get started… HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 10 Knime – First Data Analysis “Sleep in Mammals: Ecological and Constitutional Correlates" Description https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt Dataset https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv “Titanic Survival Status” Description http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html Dataset http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 11 Knime - Essential Nodes Problem: Too many nodes… Solution 1: You can search directly in the Node Repository. Solution 2: Search https://tech.knime.org/forum for your problem. Reading Data HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 12 Knime - Essential Nodes Data Preparation The input table is split into two partitions (i.e. row-wise), e.g. train and test data. The two partitions are available at the two output ports. This node helps handle missing values found in cells of the input table. The node allows for row / column filtering according to certain criteria HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 13 Knime - Essential Nodes First Statistical Data Analysis Calculates statistical moments such as minimum, maximum, mean, standard deviation, variance, median, overall sum, number of missing values and row count across all numeric columns, and counts all nominal values together with their occurrences. Creates a cross table (also referred as contingency table or cross tab). It can be used to analyze the relation of two columns with categorical data and does display the frequency distribution of the categorical variables in a table. HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 14 http://scikit-learn.org/stable/tutorial/machine_learning_map/ Knime – Data Mining Cheating… HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 15 Knime – Data Mining Cheating… Algorithm Pros Linear regression - Very fast (runs in constant time) - Easy to understand the model - Less prone to overfitting Decision trees - Fast - Robust to noise and missing values - Accurate Neural networks Support Vector Machines K-Nearest Neighbors HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer Cons - Unable to model complex relationships -Unable to capture nonlinear relationships without first transforming the inputs - Complex trees are hard to interpret - Duplication within the same sub-tree is possible - Prone to overfitting - Extremely powerful - Long training time - Can model even very complex relationships - Requires significant computing power for - No need to understand the underlying data large datasets - Almost works by “magic” - Model is essentially unreadable - Need to select a good kernel function - Can model complex, nonlinear - Model parameters are difficult to interpret relationships - Sometimes numerical stability problems - Robust to noise (because they maximize - Requires significant memory and margins) processing power - Simple - Powerful - No training involved (“lazy”) - Naturally handles multiclass classification and regression - Expensive and slow to predict new instances - Must define a meaningful distance function - Performs poorly on high-dimensionality datasets SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME Good at - The first look at a dataset - Numerical data with lots of features - Star classification - Medical diagnosis - Credit risk analysis - Images - Video - “Human-intelligence” type tasks like driving or flying - Robotics - Classifying proteins - Text classification - Image classification - Handwriting recognition - Low-dimensional datasets - Computer security: intrusion detection - Fault detection in semiconducter manufacturing - Video content retrieval - Gene expression - Protein-protein interaction 16 Knime – Data Mining Cheating… http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html https://azure.microsoft.com/enus/documentation/articles/mach ine-learning-algorithm-cheatsheet/ https://github.com/soulmachin e/machine-learning-cheatsheet/raw/master/machinelearning-cheat-sheet.pdf HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 17 Exercise in Knime HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 18 First Exercise in Knime "Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976) https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt …/sleep.csv Source: https://www.stat.auckland. ac.nz/~stats330/datasets.d ir/ Training Video: https://www.youtube.com/ watch?v=Uo1C7Iligw0 HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 19 https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv First Exercise in Knime "Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976) 1. 2. 3. 4. How old do animals become on average? Which species gets the oldest? Can we have a histogram of lifespan? What is the correlation between lifespan and size of an animal? 5. Can we have a full correlation matrix of all variables (see figure 1)? 6. Can we have a scatter-plot of species size vs. danger factor (see figure 2)? 7. Split the dataset (train, test). And answer the following question: Can we predict “total-sleep”? HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME Figure 1 Figure 2 20 Lecture Summary & Homework HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 21 Lessons Learned Knime is an easy path towards analytics. A workflow oriented way of working, dramatically simplifies the data analysis and modelling process. Combine CRISP DM and Knime and you are able to solve complex analytical problems in a well organized and repeatable format. First try to understand what algorithm fits to what problem and how they behave and what influences their behavior. Second (if you are willing), try to understand how algorithms work. HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 22 Resources Knime Knime Forum: https://tech.knime.org/forum Knime Training Video: https://www.youtube.com/user/KNIMETV Data Mining Literature Data Mining for the Masses: http://docs.rapidminer.com/downloads/DataMiningForTheMasses.pdf Machine Learning Cheat Sheet https://github.com/soulmachine/machine-learning-cheatsheet/raw/master/machine-learning-cheat-sheet.pdf HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 23 Get Prepared (Homework) Homework: Titanic Survival Status http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls Answer the following question: “What was the probability to survive per Class (1,2,3) and Sex (male, female)?” Create a Knime workflow that answers the question based on the original Titanic data set. Submit your results as a Knime archive file to [email protected]. HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer Hint: SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 24 Any Questions? HSD Faculty of Business Studies Thomas Zeutschler Associate Lecturer SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 25