Download Information Gain

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSCI E-84
A Practical Approach to Data
Science
Ramon A. Mata-Toledo, Ph.D.
Professor of Computer Science
Harvard Extension School
Unit 1 - Lecture 2
February, Wednesday 3, 2016
Business Problems and Data Science
Solutions
This lecture is based primarily on Chapters 2 and 3 of the book “Data Science for Business”
by Foster Provost and Tom Fawcett, 2013.
Thanks also to Professor P. Adamopoulos (Stern School of Business of New York University)
and Professor Tomer Geva (The Tel Aviv University School of Management)
Figures are used with the authors’ permission.
Lecture Objectives
At the end of this lecture, the student should be able to identify and define concepts
such as:
•
•
•
•
Data Mining Process
Data Mining Models
Data Mining Tasks
Supervised Segmentation (purity, information gain, entropy)
In addition, this lecture will introduce you to:
• Supervised versus Unsupervised Methods
• Data Warehousing
• Database Querying
• On-line Analytical Data Processing
• Categorical and Quantitative data
Business Scenario
• You just landed a great analytical job with MegaTelCo, one of the largest
telecommunication firms in the US.
• They are having a major problem with customer retention in their wireless
business.
• In the mid-Atlantic region, 20% of cell phone customers leave when their
contracts expire. Communications companies are now engaged in battles to
attract each other’s customers while retaining their own.
• Marketing has already designed a special retention offer.
• Your task is to devise a precise, step-by-step plan for how the data science
team should use MegaTelCo’s vast data resources to solve the problem.
MegaTelCo: Predicting Customer Churn
• What data you might use?
• How would they be used?
• How should MegaTelCo choose a set of customers to receive their offer in order
to best reduce churn for a particular incentive budget?
Terminology
Model:
• A simplified representation of reality created to serve a purpose
Predictive Model:
• A formula for estimating the unknown value of interest: the target
• The formula can be mathematical, logical statement (e.g., rule), etc.
Prediction:
• Estimate an unknown value (i.e. the target)
Instance / example:
• Represents a fact or a data point
• Described by a set of attributes (fields, columns, variables, or features)
Terminology
(Continuation)
Model induction:
• The creation of models from data. The procedure that creates the model
is called the induction algorithm or learner
Training data:
• The input data for the induction algorithm
Individual:
• Any entity about which we have data (customer, business, industry)
Terminology
(Continuation)
Informative variables (attributes):
provide information about the
characteristics of the entity.
It is desirable to choose those
variables that correlate well with
the target variable.
What is a model?
A simplified* representation of reality created for a specific purpose
*based
on some assumptions deemed important for the target
• Examples: map (abstraction of physical world), engineering prototypes, the Black-Scholes
option pricing model, etc.
• Data Mining example:
“formula” for predicting probability of customer attrition at contract expiration
 “classification model” or “class-probability estimation model”
Data Mining Process
Data mining is a process that breaks up the overall task of finding patterns from data into a set of welldefined set of subtasks. The solutions to these subtasks can often be put together to solve the overall
problem.
science + craft + creativity + common sense
Data Mining is
a process
Cross-Industry Standard Process for Data Mining
CRISP-DM
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
Determine
Business
Objectives
Collect
Initial
Data
Select
Data
Select
Modeling
Technique
Evaluate
Results
Plan
Deployment
Assess
Situation
Describe
Data
Clean
Data
Generate
Test Design
Review
Process
Plan Monitoring
&
Maintenance
Determine
Data Mining
Goals
Explore
Data
Construct
Data
Build
Model
Determine
Next Steps
Produce
Final
Report
Produce
Project Plan
Verify
Data
Quality
Integrate
Data
Assess
Model
Format
Data
Review
Project
Data Mining versus Data Warehousing/Storage
Data Warehousing / Storage
Data warehouses coalesce data from across an enterprise, often from multiple transaction-processing systems
Data Mining versus Database Querying and Reporting
Database Querying / Reporting (SQL, Excel,
QBE, other GUI-based querying)
•
Queries are requests posed to a database for
particular subsets of the data
•
Very flexible interface to produce reports
summaries of the data
•
No modeling or sophisticated pattern finding
•
Most of the cool visualizations
Data Mining versus On-line Analytical Data Processing
OLAP – On-line Analytical
Processing
• OLAP provides easy-to-use GUI
to explore large data collections
• Exploration is manual; no
modeling
• Dimensions of analysis
preprogrammed into OLAP
system
Data Mining versus…
• Traditional statistical analysis
• Mainly based on hypothesis testing or estimation / quantification of uncertainty
• Should be used to follow-up on data mining’s hypothesis generation
• Automated statistical modeling (e.g., advanced regression)
• This is data mining, one type – usually based on linear models
• Massive databases allow non-linear alternatives
Answering business questions with these
techniques..
• Who are the most profitable customers?
• Database querying
• Is there really a difference between profitable customers and the average customer?
• Statistical hypothesis testing
• But who really are these customers? Can I characterize them?
• OLAP (manual search), Data mining (automated pattern finding)
• Will some particular new customer be profitable? How much revenue should I expect
this customer to generate?
• Data mining (predictive modeling)
Common Data Mining Tasks
• Classification estimation
Given a sample population predict for each individual the class to which it belongs. The set of classes are
generally small and disjoint (non overlapping).
• Class probability estimation (scoring)
A model produces, for each individual, a value or score which is the probability that the individual belongs
to the class.
• How likely is this consumer to respond to our campaign?
• Regression
Given a sample population, regression, attempts to estimate the relationship between a selected
dependent variable and one or more independent variable.
• How much will she use the service?
Common Data Mining Tasks
(continuation)
• Similarity Matching
Process that attempts to identify similar individual based on particular data about them.
• Can we find consumers similar to my best customers?
• Clustering
Process that attempts to group individual in a population based on similarities w/o a specific target.
• Do my customers form natural groups?
• Co-Occurrence Grouping (known also as frequent itemset mining, association rule discovery, market-basket analysis)
Process attempts to find associations based on their transactions.
• Customers that bought this item also bought
Common Data Mining Tasks
(continuation)
• Profiling
Process that attempts to characterize the typical behavior of an individual, group, or population. Often
used to establish a baseline. Used also to observe ‘out of the normal behavior’ for fraud detection.
• What items do my best customer buy more frequently?
• Link Prediction
Process that attempts to predict connections between nodes of a graph based on the information of the
individual nodes. Often used in social networks such as Facebook or LinkedIn.
• People that you may know
• Data Reduction
• Process that transforms a large set of data to replace it with a smaller set that contains most of the
relevant information.
• What items are purchased together?
Common Data Mining Tasks
(continuation)
• Casual Modeling
Process that attempts to help us understand the actions or events that influence other events.
• Did the advertising campaign increased the volume of purchases?
This type of technique may include randomized controlled experiments with two variants (A/B tests). One
of the variants serves as the control; the other is the variation of the experiment.
• Which of the two promotional codes increased volume of sales? (assuming an email advertisement sent
two identical populations with variations in the promotional code)
Supervised versus Unsupervised Data Mining
• Unsupervised (no specific target)
• “Do our customers naturally fall into different groups?”
• No guarantee that the results are meaningful or will be useful for any particular purpose
• Supervised (specific target)
• “Can we find groups of customers who have particularly high likelihoods of canceling
their service soon after contracts expire?”
• Results are generally more useful than unsupervised. Supervised data mining require different techniques than
unsupervised.
• Supervised Data Mining requires that the target information be in the data.
• Example: attempts to retrieve historical behavior of customer for last year when data only has been kept for 6
months.
Common Data Mining Task – Supervised vs. Unsupervised
Supervised Data Mining & Predictive Modeling
• Is there a specific, quantifiable target that we are interested in or trying to predict?
• Think about the decision
• Do we have data on this target?
• Do we have enough data on this target?
• Need a min ~500 of each type of classification
• Do we have relevant data prior to decision?
• Think timing of decision and action
• The result of supervised data mining is a model that predicts some quantity
• A model can either be used to predict or to understand
Subclasses of Supervised Data Mining
Two main subclasses of Supervised data mining can be distinguished based upon the
type of their target:
• Classification
• Categorical target
• Often binary
• Includes “class probability estimation”
• “Will this customer purchase service 𝑺𝟏 if given incentive 𝑰𝟏?”
• Classification problem
• Binary target (the customer either purchases or does not)
• Regression
• Numeric target (preferable for business applications)
• “How much will this customer use the service?”
• Regression problem
• Numeric target
• Target variable: amount of usage per customer
Data Mining Model in Use
prediction
of prob. of
class
New
data
item
Probability
estimation model
What is probability of
attrition of this customer
with characteristics 𝑋, 𝑌, 𝑍?
𝑝(𝐶|𝑋, 𝑌, 𝑍) = 0.85
Data Mining versus Use of the Model
Model in
use:
What
is
probability of
attrition of this customer
with characteristics 𝑋, 𝑌, 𝑍?
Supervised Segmentation
• How can we divide (segment) the population into groups that differ from each other
with respect to some quantity that we would like to predict?
• Which customers will churn after their expiration date?
• Which customers are likely to default?
• Informative attributes
• Find knowable attributes that correlate with the target of interest
• Increase accuracy
• Alleviate computational problems
• E.g., tree induction
Supervised Segmentation
(Continuation)
• How can we judge whether a variable contains important information about the
target variable?
• How much?
• What single variable give the most information about the target?
• Based on this attribute, how can we partition the customers into subgroups that are as pure as
possible – with respect to the target (i.e., such that in each group as many instances as possible
belong to the same class).
• A subgroup is pure if every member of the group (instance) has the same value for the target
otherwise is impure.
Selecting Informative Attributes
• Example: A binary (Yes, No) classification problem with the following attributes:
•
•
•
•
Head shape (square, circular)
Body shape (rectangular, oval)
Body color (red, blue, purple)
Target variable: Yes, No
No
Yes
Yes
Yes
Yes
No
Yes
No
Purity Measure – Information Gain – Entropy
• Difficulties encountered in segmenting the population due to the impurity of the
subgroups can be addressed by creating a formula that evaluates how well each attribute
separates the entities into segments. The formula is called a purity measure.
• The most common criterion for segmenting the population is called information gain
(IG).
• Information gain is based on a purity measure called entropy.
• Entropy is a measure of the degree of disorder that can be applied to a set.
•The entropy is defined as follows:
entropy = -å pi lg( pi )
Pi: the probability (the relative percentage) of property i within the set
Pi = 0 when no member of the set satisfy the property
Pi = 1 when all the members of the set satisfy the property
lg(pi) stands for the logarithm of base 2: log2pi
Calculating the Information Gain
IG (parent, children) = entropy(parent)−[p(c1)×entropy(c1)+p(c2)×entropy(c2) +…]
Parent
Child 1
(c1)
Child 2
(c2)
Child…
Note: Higher IG indicates a more informative split by the variable.
Selecting Informative Attributes
• Example: Set of 10 people divided into two classes: will default (write-off) and will not default (nonwrite-off). 7 people will not default; p(n-w-o)=7/10 = 0.7 and p(w-o)=3/10
• entropy (S) = -[0.7 x lg(0.7) + 0.3 x lg(0.3)] = 0.88
lg(x) stands for logarithm of base 2.
Log (x) is the logarithm of base 10.
The logarithm of number x in base 2 equals the logarithm
of x in base 10, divided by the logarithm of 2 in base 10.
Remember: logarithms of numbers between 0 and 1 are
negative.
Information Gain
• Information gain measures the change in entropy due to any amount of new
information being added
Information Gain
Information Gain
(continuation)
13/30 approx. 0.43
17/30 approx. 0.57
Relative percentages of the subset with respect to the
parent. This percentage is calculated dividing the cardinality
of the child-set by the cardinality of the parent-set.
Attribute Selection
Reasons for selecting only a subset of attributes:
• Better insights and business understanding
• Better explanations and more tractable models
• Reduced cost
• Faster predictions
• Better predictions!
Example: Attribution Selection with Information Gain
• This dataset includes descriptions of hypothetical samples corresponding to 23 species
of gilled mushrooms in the Agaricus and Lepiota Family
• Each species is identified as definitely edible, definitely poisonous, or of unknown
edibility and not recommended
• This latter class was combined with the poisonous one
• The Guide clearly states that there is no simple rule for determining the edibility of a
mushroom; no rule like “leaflets three, let it be” for Poisonous Oak and Ivy.
Example: Attribution Selection with Information Gain
(continuation)
Example: Attribution Selection with Information Gain
Example: Attribution Selection with Information Gain
Example: Attribution Selection with Information Gain
Multivariate Supervised Segmentation
• If we select the single variable that gives the most information gain, we create a very
simple segmentation
• If we select multiple attributes each giving some information gain, how do we put
them together?
Categorical and Quantitative Variables
A categorical variable places an observation in one (and only one) category chosen from two or more possible categories.
• If there is no ordering that can be done between the categories, the variable is nominal.
• If there is some intrinsic order that can be assigned to the categories, the variables are ordinal.
Examples of categorical data
•
•
•
•
•
•
•
•
•
•
Your gender (male, female)
Your class in school (Freshman, Sophomore, Junior, Senior, Graduate)
Your performance status (Probation, Regular, Honors)
Your political party (Democrat, Republican, Independent)
Your hair color (blonde, brown, red, balck, white, other)
Your type of pet (cat, dog, ferret, rabbit, other)
Your race (Hispanic, Asian, African American, Caucasian)
Machine settings (Low, Medium, High)
Method of payment (Cash, Credit)
The color of some object (red, orange, yellow, green, blue, purple)
Variables and the Case format
Before analyzing the data, prepare charts or graphs, or do any statistical test, it’s important to gain
a sense of what data is all about.
Why is this necessary?
The type of data available dictates what we can do with it.
Do you want to prepare a scatterplot or perform a linear regression to generate a predictive model?
Fine as long as you two quantitative variables preferable both continuous
Do you want to do a one-way Analysis of Variance (ANOVA)?
No problem, as long as you have multiple collections of quantitative variables, and you can
split them up into groups using one of the categorical variables you have collected.
Do you want to construct a bar plot?
OK, but you will need categorical data or else you should be preparing a histogram instead.
Questions?