Download IT Applications in Business Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hochschule Düsseldorf
University of Applied Scienses
Fachbereich Wirtschaftswissenschaften
Faculty of Business Studies
HSD W
Business Analytics (M.Sc.)
IT in Business Analytics
IT Applications in
Business Analytics
SS2016 / Lecture 05 – Introduction to KNIME
Thomas Zeutschler
Let’s get started…
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
2
Intoduction
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
3
KNIME Analytics Platform
 The KNIME is an open source platform for
analytical data modelling and processing.
www.knime.org
 KNIME was developed at University of Konstanz in 2004-2006 and
focussed initially on pharmaceutical research.
 Today KNIME is modular, highly scalable data processing platform
which allow an easy integration of different modules for:
 data loading, processing, transformation
 data analysis
 visual data exploration
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
4
KNIME Analytics Platform – Workflows
 An analysis is defined by a graphical Workflow.
 Interlinked Nodes are defining the various steps of a workflow.
 Hundreds of predefined nodes available for various purposes…
 data loading, processing, transformation and data delivery
 data analysis and visualization
 interaction with other tools (e.g. run an R script)
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
5
KNIME Analytics Platform - Frontend
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
6
KNIME Analytics Platform – Real World Example
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
7
KNIME Analytics Platform – Real World Example
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
8
Knime – Installation
 Register, download and install Knime from http://knime.org
www.knime.org
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
9
Knime – Lets get started…
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
10
Knime – First Data Analysis
“Sleep in Mammals: Ecological and Constitutional Correlates"
 Description
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt
 Dataset
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv
“Titanic Survival Status”
 Description
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html
 Dataset
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
11
Knime - Essential Nodes
Problem: Too many nodes…
Solution 1: You can search directly in the Node Repository.
Solution 2: Search https://tech.knime.org/forum for your problem.
Reading Data
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
12
Knime - Essential Nodes
Data Preparation
The input table is split into two partitions (i.e. row-wise),
e.g. train and test data. The two partitions are available
at the two output ports.
This node helps handle missing values found in cells of
the input table.
The node allows for row / column filtering according to
certain criteria
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
13
Knime - Essential Nodes
First Statistical Data Analysis
Calculates statistical moments such as minimum, maximum,
mean, standard deviation, variance, median, overall sum,
number of missing values and row count across all numeric
columns, and counts all nominal values together with their
occurrences.
Creates a cross table (also referred as contingency table
or cross tab). It can be used to analyze the relation of
two columns with categorical data and does display the
frequency distribution of the categorical variables in a
table.
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
14
http://scikit-learn.org/stable/tutorial/machine_learning_map/
Knime – Data Mining Cheating…
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
15
Knime – Data Mining Cheating…
Algorithm
Pros
Linear regression
- Very fast (runs in constant time)
- Easy to understand the model
- Less prone to overfitting
Decision trees
- Fast
- Robust to noise and missing values
- Accurate
Neural networks
Support Vector
Machines
K-Nearest Neighbors
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Cons
- Unable to model complex relationships
-Unable to capture nonlinear relationships
without first transforming the inputs
- Complex trees are hard to interpret
- Duplication within the same sub-tree is
possible
- Prone to overfitting
- Extremely powerful
- Long training time
- Can model even very complex relationships
- Requires significant computing power for
- No need to understand the underlying data
large datasets
- Almost works by “magic”
- Model is essentially unreadable
- Need to select a good kernel function
- Can model complex, nonlinear
- Model parameters are difficult to interpret
relationships
- Sometimes numerical stability problems
- Robust to noise (because they maximize
- Requires significant memory and
margins)
processing power
- Simple
- Powerful
- No training involved (“lazy”)
- Naturally handles multiclass classification
and regression
- Expensive and slow to predict new
instances
- Must define a meaningful distance
function
- Performs poorly on high-dimensionality
datasets
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
Good at
- The first look at a dataset
- Numerical data with lots of features
- Star classification
- Medical diagnosis
- Credit risk analysis
- Images
- Video
- “Human-intelligence” type tasks like driving or
flying
- Robotics
- Classifying proteins
- Text classification
- Image classification
- Handwriting recognition
- Low-dimensional datasets
- Computer security: intrusion detection
- Fault detection in semiconducter manufacturing
- Video content retrieval
- Gene expression
- Protein-protein interaction
16
Knime – Data Mining Cheating…
http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html
https://azure.microsoft.com/enus/documentation/articles/mach
ine-learning-algorithm-cheatsheet/
https://github.com/soulmachin
e/machine-learning-cheatsheet/raw/master/machinelearning-cheat-sheet.pdf
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
17
Exercise in Knime
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
18
First Exercise in Knime
"Sleep in Mammals: Ecological and Constitutional Correlates"
by Allison, T. and Cicchetti, D. (1976)
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt
…/sleep.csv
Source:
https://www.stat.auckland.
ac.nz/~stats330/datasets.d
ir/
Training Video:
https://www.youtube.com/
watch?v=Uo1C7Iligw0
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
19
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv
First Exercise in Knime
"Sleep in Mammals: Ecological and Constitutional Correlates"
by Allison, T. and Cicchetti, D. (1976)
1.
2.
3.
4.
How old do animals become on average?
Which species gets the oldest?
Can we have a histogram of lifespan?
What is the correlation between lifespan and size
of an animal?
5. Can we have a full correlation matrix of all
variables (see figure 1)?
6. Can we have a scatter-plot of species size vs.
danger factor (see figure 2)?
7. Split the dataset (train, test). And answer the
following question: Can we predict “total-sleep”?
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
Figure 1
Figure 2
20
Lecture Summary & Homework
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
21
Lessons Learned
 Knime is an easy path towards analytics.
 A workflow oriented way of working, dramatically simplifies the data
analysis and modelling process.
 Combine CRISP DM and Knime and you are able to solve complex
analytical problems in a well organized and repeatable format.
 First try to understand what algorithm fits to what problem and how they
behave and what influences their behavior.
 Second (if you are willing), try to understand how algorithms work.
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
22
Resources
Knime
Knime Forum: https://tech.knime.org/forum
Knime Training Video: https://www.youtube.com/user/KNIMETV
Data Mining Literature
Data Mining for the Masses:
http://docs.rapidminer.com/downloads/DataMiningForTheMasses.pdf
Machine Learning Cheat Sheet
https://github.com/soulmachine/machine-learning-cheatsheet/raw/master/machine-learning-cheat-sheet.pdf
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
23
Get Prepared (Homework)
Homework: Titanic Survival Status
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
 Answer the following question:
“What was the probability
to survive per Class (1,2,3)
and Sex (male, female)?”

Create a Knime workflow that answers the
question based on the original Titanic
data set.

Submit your results as a Knime archive file to
[email protected].
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Hint:
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
24
Any Questions?
HSD
Faculty of Business Studies
Thomas Zeutschler
Associate Lecturer
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME
25