• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Coefficient of determination wikipedia, lookup

Forecasting wikipedia, lookup

Transcript
```Chapter 9
DATA MINING
PAULA JENSEN
SDSM&T
ENGM 745
McGraw-Hill/Irwin
DATA MINING
 DATA-DATA
 Extracting
of useful information
from large databases
 Tools of Data Mining
 Looking at where to find the data
TOOLS OF DATA MINING
Prediction
 Classification
 Clustering
 Association

PREDICTION

Predict the value of a numeric variable
Customer’s expenditure
 Will they purchase
 What are their interests
 Do their interests predict a purchase

CLASSIFICATION
Classes of objects or actions
 Reliability of customer
 Income
 Location

CLUSTERING
Analysis tools analyze objects viewed as a class
 Where is the cut off of income or size
 How do I group the information

ASSOCIATION
Patterns based on likes
 Netflix

CLASSIFICATION
 k-nearest
neighbor
 Naïve Bayes
 Classification/regression trees
 Logistic Regression
DATA MINING TERMINOLOGY
9-10
9-11
9-12
K-NEAREST NEIGHBOR
Use Subset of total data called training data
 Select closest neighbor with Euclidian distance
shown in previous slide other metrics available to
measure to define neighbors
 Validation data is a separate set of data
 Test statistic important on the validation data
versus the training data
 60% of data training data and 40% validation
data acceptable mix

9-14
9-15
K-NEAREST NEIGHBOR ANALYSIS
Multidimentional
 Program is going to compute a distance
associated to each attribute
 Continuous Variables are measured in different
scales
 Categorical attributes will use a weighted
mechanism
 Example is will they respond to marketing to
take a loan

K=3 means used 3 neighbors to classify all records
9-17
Type 1 would take a loan – Type 0 would not take a loan
9-18
9-19
TERMS
Lift – measures the change in concentration of a
particular class when the model is used to select
a group from the general population. Significant
lift on the example.
 Decile Wise chart- Pick the top 10% of our
records classified by our model our selection
would include approximately 7 times as many
correct classifications.

Classification Trees
9-21
9-22
9-23
CLASSIFICATION TREES

Decision rules are easy
 Easy to understand


Overfit data
 Correlated attributes will cause multicollinearity

9-25
9-26
9-27
9-28
NAÏVE BAYSES
Statistical Classification
 Bayes Therom: predicts the probability of a prior
event given a certain subsequent event has taken
place
 Called Naïve because each attribute is assumed
as independent

9-30
9-31
9-32
9-33
BAYESIAN THEOREM

P (A|B) = (P(B|A))* P(A)
P(B)
P(A) is the prior probability
P (A|B) is conditional probability of A, given B
P (B|A) is the conditional probability of B given A
P (B) is the prior probability of B
9-35
APPLYING BAYES’ THEROM
REGRESSION
Logistic regression or Logit analysis
 Difference between logics regression and
ordinary regression is that the dependent
variable in logistic regression is categorical not
continuous
 Dependent Variable is Dichtomous- either yes or
no
 Dependent variable is either will be limited to
values between 0 and 1

9-38
9-39
9-40
9-41
9-42
9-43
9-44
9-45
9-46
WHERE DO I FIND THE DATA???

Current Customer Activity
