Data Mining
Day One
November, 5-6 2015
Kristofer Still
Session 1
Session 2
Session 3
Session 4
Day One
• Overview
• The Data Mining Process
• Hands on examples
Day Two
• Case Study
• Data Mining for Unstructured Data
• Demos of other Helpful Data Mining Tools and Resources
Learning objectives
• Motivate you to approach data mining like any other managed
project or process.
• Gain a set of tools that provide a systematic process by which you
can understand the nature of your data and how to get the most
out of it
• Understand how to evaluate models and some ways to potentially
improve a model’s performance
Why R?
Not the answer for everyone
Pros and cons
Recent developments and trends
• Data is always involved
• Usually more data than people can keep track of
• Terabytes of data – now petabytes
– Example “A Million Model in Minutes”
• Data is more complex
Questions About Your Data
• How much data do I have and at what rate do I expect it
to grow?
• How is it stored?
• Is it secure and recoverable?
• What’s important?
• How can I convert data into insights?
Finding Insights
• What is the chance that an event will occur and what will be the
magnitude of that event?
• What patterns are there in my database and which are significant?
• How can I group and classify the entities in my data?
• What relationships exist in my data?
• Can I detect anomalies in my data?
• What do I expect to happen to measures over time?
Sample Data
• Example Database of Customers
What is Data Mining?
What is Data Mining?
What is Data Mining?
• “the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns
in data.” --- Fayyad
• “finding interesting structure (patterns, statistical models,
relationships) in data bases”.--- Fayyad, Chaduri and
• “a knowledge discovery process of extracting previously
unknown, actionable information from very large data
bases”--- Zornes
• “ a process that uses a variety of data analysis tools to
discover patterns and relationships in data that may be
used to make valid predictions.” ---Edelstein
What is Data Mining?
da·ta min·ing
noun: data mining; noun: datamining
the practice of examining data using various modeling techniques in
order to generate new information or insight, detect patterns and
relationships, and ultimately make valid predictions. -Shaffer (2014)
The process by which an organization seeks to utilize its data assets to
generate value for its stakeholders.-Still (2014) (see also 1.)
Statistics vs. Data Mining
• Statistics is part of data mining – ex. Determining the
signal from the noise, significance of findings (inference),
estimating probabilities. In statistics data is often
collected to answer a specific question.
• Data Mining – much broader, entire process of data
analysis, including data cleaning, preparation and
visualization. Data has typically been collected in some
Statistics vs. Data Mining
Data Mining
• Particular model with
• Models are flexible and
specific parameters and
often better suited for
assumptions about the
non-linear relationships in
model errors
• In addition to accuracy, • Prediction accuracy is
often equally concerned
most important
with interpretation and
range of results
• Computationally intensive
• Generally not
computationally intensive
Data Mining Process Models
• Six Sigma (Design, Measure, Analyze, Improve, Control)
• KDD (Knowledge Discovery in Databases)
• SEMMA (Sample, Explore, Modify, Model, Assess)
• CRISP-DM (Cross Industry Standard Process for Data Mining
Six Sigma Model - DMAIC
KDD Model
What do these have in common?
Business understanding
Understanding the data through exploration
Data preparation
Like most relationships things go well when needs are met.
“We have to protect our phoney, baloney jobs here, gentlemen!”
-Governor William J. Le Petomane, Blazing Saddles (1974)
Business Understanding Overview
• Assess situation
• Set goals
• Create Plan
Business Understanding – Assess
Current state/desired state
Players (governors, partners, gatekeepers, advocates)
Resources (hardware/software, data, expertise, time, budget)
Business Understanding – Set
Business and data mining goal?
Business Understanding – Set
• Measuring business success?
Business Understanding – Set
• Measuring data mining success?
Business Understanding – Set
Business goals:
• Reduce customer churn by 5% in 6 months among customers
with a profit margin of 10% or more.
• Reduce wire fraud in the commercial bank by 10 percent
within three months of deploying new anomaly detection
algorithm, 15 percent within six months, and 25 percent
within one year.
Understanding your data
• Does your data have what it takes?
– Suitable?
– Sufficient?
– High information content?
– Challenges
• Quantity
• Veracity (surveys, social media/web, 3rd party source,
deception, temporal, missing)
• Measurement/Collection
• How are systems, databases, entities related
– IDs
– Attributes and dimensions
– Aggregation
Characterizing your data
Too little?
Too much?
Date/time considerations
Geographic considerations
Redundancy (duplication and naming)
Value labels (single system)
Change in definition or measurement
Operational changes or changes in external environment
Leaks from the future
Duplicate records
Invalid values
• Global v. Local
• Causes include
– Poor data quality / contamination
– Low quality measurements, malfunctioning equipment,
manual error
– Correct but exceptional data
Missing data
• No data for a field or entire record
• Why missing?
• All permissible values for a variable
• Conditional
– Influenced by other variable
– Influenced by business rules
Default values
• Usually related to missing or empty values, but could be
– E.g. 9999, 0, -1, >N
• What are the potential concerns if you treated them as valid
• Inputs usually related to categorical inputs
• Target e.g. bankruptcy, medical studies, insurance, fraud
detection, payment, security
How to manage sparsity
• Inputs – transform the input
• Transform the data
– Sampling
– Introduce bias
Data Exploration
• Dimension
• Data types
• Summary measures
– Centrality - mean, median, mode
– Dispersion – range, variation, standard deviation
– Skewness and kurtosis
– Relationship – correlation
• Plots – box plots, histograms, pie charts, scatterplots, parallel
coordinates, heat maps
Data Types
• Categorical – data as named classes or levels of an attribute
– Nominal - differentiates between items and subjects base
on their names e.g. gender, race, style, form
– Ordinal – allows for a rank order but nothing can be said
about degree of difference between them. e.g. true/false
(binary), rankings, income or class
Data Types
• Numeric (continuous) – has numeric value and a natural order
– Interval – has interpretable differences but no true zero
and can’t be multiplied or divided e.g. Dates, temperature
(Yes, Kelvin would be an exception, but resist the urge to
raise your hand and out yourself as a BIG nerd.)
– Ratio – specifies “how much” (magnitude) or “how many“
(count) of something. Unlike interval has a non-arbitrary
zero point so came make comparisons like “twice as” e.g.
age, length, mass, elapsed time
Measures of Central Tendency
• Mean – the “average”
• Median – the “middle”
• Mode – the “most frequent”
Measures of Central Tendency
Measures of Central Tendency
Remove Outlier:
mean - 6.3
median - 4.5
range - 17
Measures of Dispersion
• Range – Max minus Min
• Variance –the average squared difference of the scores from the
𝑠2 =
• Standard deviation – the square root of the variance:
(𝑥 − 𝑥 )2
• Variance vs. standard deviation?
Two Types of Variation
• Common Cause
• Special Cause
Measures of Variability
Skewness and Kurtosis
• Skewness - measures how symmetric a distribution is:
(𝑥 − 𝑥)3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
(𝑛 − 1)𝑠 3
• Standard deviation – indicates how peaked or flat a
distribution is compared to a normal distribution:
(𝑥 − 𝑥 )4
𝑘𝑢𝑟𝑡𝑜𝑠𝑖𝑠 =
(𝑛 − 1)𝑠 4
Skewness and kurtosis
Graphical techniques
Humans are better at seeing things than reading and interpreting lists of
Therefore, graphical representations of your dataset can sometimes be the
shortest path to insight
Useful for:
– Identifying relationships and/or patterns
– Revealing interactions
– Diagnosing biases
– Showing where data is missing
– Identifying which predictors to use
– Indicating transformations or other operations to perform on the data prior
to modeling
– Detecting outliers
– Suggesting model(s) to use
• A histogram divides the levels of a variable into equal-sized
bins and then counts then number of points in the dataset
that belong in each bin.
• Great tool to summarize data. Can see center, spread,
as well as issues with skew, outliers, or bimodality.
Box and Whiskers or Box Plot (Tukey)
Box and Whiskers or Box Plot (Tukey)
Box and Whiskers or Box Plot (Tukey)
The Box:
– Median – the line on drawn on the box
– Lower quartile – number 25% of the data lies below;
median between minimum and the overall median
– Upper quartile – number 75% of the data lies below;
median between maximum and the overall median
Box and Whiskers or Box Plot (Tukey)
The whiskers variation 1:
– Draw line from the top of the box (Q3) to the max and
the bottom of the box (Q1) to the minimum
Box and Whiskers or Box Plot (Tukey)
The whiskers variation 2:
– Calculate the interquartile range: IQR = Q3 – Q1
– Then calculate:
• L1 = Q1 – 1.5 * IQR
• L2 = Q1 – 3.0 * IQR
• U1 = Q3 + 1.5 * IQR
• U2 = Q1 + 3.0 * IQR
– Whiskers drawn from Q1 to smallest point > L1 and from Q3 to
largest point smaller than U1
– Points between L1 and L2 and U1 and U2 are drawn as a small
– Points beyond L2 and U2 are drawn as large circles
Box Plot vs. PDF
• Allows you to see potential associations between two or
more variables
• You can also see the direction and shape of that
• Finally, you can identify if that relationship changes as
one of the variables changes (homo/heteroscedastic)
Scatterplot (examples)
Scatterplot Matrix
Linear Regression
Data preparation
• Measure quality
• Test assumptions
• Validate! Validate! Validate!
Choose modeling technique(s)
Fit model(s)
Evaluate model(s)
Tune model(s)
Wash, rinse, repeat until you have the “best” model or
collection of models
Choose wisely….
• Suitability
– Type of prediction
– Types of observations
– Shape
– Interaction
• Assumptions
• Missing data
• Scalability
• Interpretability
• Audience
Linear Regression
𝑦 = 𝑚𝑥 + 𝑏
𝑥𝑦 − 𝑥 𝑦
𝑚= 2
𝑥 − (𝑥)2
𝑏 = 𝑦 − 𝑚𝑥
Pattern of Data Not Linear
• More predictors than just one can be used.
– Multiple Regression
• Transformations can be applied to the predictors.
• Predictors can be multiplied together and used as terms
in the equation.
• Modifications can be made to accommodate response
predictions that just have yes/no or 0/1 values.
Logistic Regression
Pay/No Pay, Bankruptcy, Re-Admittance
Estimation no longer least squares
Now likelihood approach
MLE (maximum likelihood estimation) of logit regression
• Mean Residual Deviance – compare model with model
complexity (compare to adjusted R2)
• Residual Deviance – won’t account for model complexity
(compare to R2)
• Smaller Mean Residual Deviance is better
Cluster Analysis
• Algorithm that will take a dataset and attempt to divide its
entities into n groups based on their attribute values.
• Determines an optimal (may not be unique solution) set
of groups that maximizes both the with in group similarity
and distance between groups
• High school, The sorting hat, laundry
• e.g. customer types, fraud detection, location selection
• “Sorting the laundry”
– White clothes vs. color clothes (easy)
– White short with color stripes?
– Gray Shirt?
• Clustering in business applications much more difficult
– Very dynamic
– Ever changing
• How many clusters?
– This is key
• Also used to detect outliers.
– Which records stand out from the clusters
A sale on men’s suits is being held in all branches of a
department store for southern California. All stores with
these characteristics have seen at least a 100% jump in
revenue since the start of the sale except one. It turns out
that this store had, unlike the others, advertised via radio
rather than television.
Cluster Analysis
Traditional Clustering
• Goal is to identify similar
groups of objects
• Groups (clusters, new classes)
are discovered
• Dataset consists of attributes
• Unsupervised (class label has
to be learned)
• Important: Similarity
assessment which derives a
“distance function” is critical,
because clusters are
discovered based on
• Pre-defined classes
• Datasets consist of attributes
and a class labels
• Supervised (class label is
• Goal is to predict classes from
the object properties/attribute
• Classifiers are learnt from sets
of classified examples
• Important: classifiers need to
have a high accuracy
• Happy medium between homogeneous groups and the
fewest number of clusters.
• How useful is a cluster of one?
• Or a cluster for each individual point?
Two types of Clustering
• Hierarchical
– Tree
• Smallest clusters merge together
• Agglomerative vs. Divisive
– Clusters defined by the data
• Non-Hierarchical
– Single pass method
– Reallocation method
• User defines 10 clusters, but data is clearly 13
Nearest Neighbor
• Your next door neighbor’s income is $100,000
– How much do you make?
• Your next door neighbor’s income is $30,000
– How much do you make?
• Assumptions are being made
• Consider other variables (broader definition of neighbor):
– School attended and degree
– Job title
– Length of job
Nearest Neighbor
• Apple
– Closer to Orange or Banana?
• Toyota Corolla
– Closer to a Honda Civic or a Porsche?
• Simply Stated
– Objects that are “near” to each other will have similar
prediction values as well. Therefore if you know the
prediction value of one of the objects you can predict
it for it’s nearest neighbors.
Nearest Neighbor
• Applications
– Text Retrieval
– Search Algorithms
– Stock Market Data
– “Customers who bought this also bought”
– Movie preferences
K Nearest Neighbor (KNN)
• Let’s vote on it
– Many is better than one
• All of your neighbor’s have income > $100,000
– How much do you make?
– Are you a little more confident in your guess?
• A vote of ¾ of your neighbors compared to a single
neighbor would be more accurate.
• How confident are we?
• Can we measure this?
K Nearest Neighbor (KNN)
• The distance to the nearest neighbor provides a level of
• If the neighbor is very close or an exact match then there
is much higher confidence in the prediction than if the
nearest record is a great distance from the unclassified
• The degree of homogeneity amongst the predictions
within the K nearest neighbors can also be used. If all
the nearest neighbors make the same prediction then
there is much higher confidence in the prediction than if
half the records made one prediction and the other half
made another prediction.
N-Dimensional Space
• In order to determine near vs. far we need to define a
space where distance can be calculated
– Neighborhoods for Income
• If we have 5 predictors then we have a 5 dimensional
• Imagine 1,000 or 50,000 predictors
• Clustering – typically 1 predictor to each dimension
• Nearest Neighbor – dimensions are stretched
– Basically weighting one more than another when
calculating the distance
Clustering vs. Nearest Neighbor
Decision Trees
Predictive Model viewed as a tree
Each branch of the tree is a classification method
Divides up data at each branch without losing data
Very easy to understand and interpret
– Opposite of the Neural Network (black box)
• Good at handling raw data and minimizes preprocessing
• Excel at complex real world problems, computationally
• Used for Exploration, Data Processing, Prediction
Decision Trees
Decision Trees
• Over fitting is when your tree (or any data mining
algorithm for that matter) pays attention to parts of the
data that are irrelevant (i.e. fits noise)
• Over fitting can cause your model to make less accurate
predictions on new data. (i.e. less robust)
• Can use statistical tests to detect over fitting. In this case
a chi-square test. Would this result have happened by
Decision Trees
• Start at the bottom of your tree and do a chi-square test
on the terminal nodes to determine:
If there was no relationship between the input and target,
what’s the chance I would have the same result?
• Remove (prune) those nodes
• Finding the simplest (parsimonious) tree for your data
Random Forest
• Grow many trees varying the sample and variables used
to grow the tree randomly
• Prediction chosen is the mode of the predictions of all
the individual trees in the “forest”
Neural Networks
• Approximate representation of how are brains are
organized and how we “learn.”
• They “learn” and adapt, but so do other models
Neural Networks
• Our brain is made up of dozens of billions of neurons
Neural Networks
The nodes represent the neurons and the links
represent the system of axons, dendrites, and synapses
Neural Networks
Neural Networks
Neural Networks
• Requires lots of pre-processing of the data
– Standardizing variables can be very important
• Very powerful predictive modeling techniques
– But at a cost
• Ease of use
• Ease of deployment
• Over fitting – they are exceptional at training noise
Evaluating Models
• Measure quality
• Test assumptions
• Validate! Validate! Validate!
Accuracy vs.
Accuracy vs.
Is this Process Accurate?
Accuracy Depends on the Specs!
Accuracy Depends on the Specs!
Question the Specs
“If the facts don’t fit the
theory, change the facts.
- Albert Einstein
Control Charts
This Process is IN CONTROL
Is this process Accurate?
HINT – What are the Specs?
Is this process Accurate?
Yes, it’s accurate!!!
Is this process Accurate?
NO, it’s NOT accurate!!!
Measuring Success
• Regression
– “Regression toward the mean”
– Error is normal
– “Independent” is an important assumption
– “OLS” (ordinary least squared)
• Why is it ordinary? Because it’s linear (not weighted)
• Minimize the sum of the squared residuals
– Unconstant variation is called…
Measuring Success
Unconstant variation is called…
Measuring Success
Bigger is better (unless it’s too good!)
R2 – measures goodness of fit
Adjusted R2 – adjusts for number of explanatory terms. The
more variables the more error is introduced into the model.
Small p-value – reject the null hypothesis
f-test and t-test equivalent
Measuring Success
• MSE – Mean Squared Error (lower the better)
– Risk Function
– Quantify difference between implied values of estimator and
true values
– “Squared Error Loss” (quadratic loss) – average of the squares of
the error
– RMSE is the square root of MSE (same unit as y-axis)
• Greatest reduction in MSE or RMSE often determines the winners
of analytics competitions on sites like Kaggle
– e.g. reduction in RMSE of Netflix’s recommendation engine
Model Selection Criteria
Complexity Parameter
Variable Selection
• Forward – one at a time
– Use f-test to rank
– One by one procedure (look at p-value)
• Backwards – remove one at a time
– Examine model performance after each decision
– Once removed, never comes back
– Need rule to stop
• Stepwise – combination of forward and backward
– Might have one variable in, then out, then back in
• “All possible” “Best subset” “Exhaustive search”
– Fit model with all possible combination of variables and
compare performance measures
Variable Selection
• First three methods are one dimensional
• Can only use complete cases (i.e. must have value for each
• Low ratio of cases to variables and excessive collinearity can
disrupt selection
• Can disrupt logical groupings
• Don’t ignore your own judgment and intuition about your
• Can’t make something out of nothing (GIGO)
End product
Maintenance and management
Monitor and measure business outcomes
Best practices
Best Practices