Download Class3_Data_Ming

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Dr. Chang Liu
What is Data Mining

Data mining has been known by many different terms
•
•
•
•

Knowledge Discovery in Database (KDD)
Predictive Analytics
Machine Learning
Business Analytics
It is the process of finding hidden patterns in data
• For example, what is the profile of people who buy from us?

Usage of data mining has become widespread recently for
various reasons

Typically, businesses find huge increases in profitability as a
result of applying data mining
Some Common Problems

Growing business by cross-selling
• A retailer can use buying patterns of customers to generate
recommendations for new customers

Determine risk of giving a loan to a particular
customer
• Profiles of customers who have defaulted in the past are
learned and used with new customers

Forecast the likely unemployment level based on its
past trend

Is a credit card transaction likely to be a fraudulent?

Is this tumor in a patient’s breast likely malignant?
Data Mining Tasks

Data mining problems are solved by performing a
specific task:
• Given a problem, an analyst should first determine the data
mining task that should be performed.
 I need to determine whether a customer is likely to
default a loan. I can solve this by performing a
classification task

There are a number of tasks:
•
•
•
•
•
•
•
Classification
Association or market basket analysis
Forecasting
Deviation Analysis
Clustering or segmentation
Sequence analysis
Regression
Data Mining Tasks (cont.)

Classification is used to predict which of a few
known outcomes a case is likely to be
• Is this customer likely to default? Has two known outcome
“Yes” or “No”



Association is used to analyze transaction tables and
determine which items in the transaction table tend
to go together. Example?
Forecasting is used to generate new data points in a
time series. Example?
Deviation analysis is used to determine anomalous
data points or outliers
• Used by security experts to detect network intrusion attacks
• Used by insurance companies and credit card companies to detect fraud
Data Mining Tasks (cont.)


Clustering or segmentation is used to discover
natural grouping in data
Sequence analysis discovers sequence patterns in
events
• E.g., purchase of a computer is followed by purchase of a
printer, then webcam …
• Used by marketing folks to understand and exploit buying
habits
• Used to analyze web clickstream data

Regression is used to predict numerical values
Data Ming Algorithms

Microsoft SSAS provides the following data
mining algorithms:
•
•
•
•
•
•
•
•
•
Microsoft
Microsoft
Microsoft
Microsoft
Microsoft
Microsoft
Microsoft
Microsoft
Microsoft
Decision Trees
Neural Network
Naïve Bayes
Association Rules
Time Series
Clustering
Sequence Clustering
Linear Regression
Logistic Regression
Case

The thing you are mining or asking questions
about is called a case
• The case is often a row in a table; e.g., when
studying which customers are likely to default on
a loan, each row in the customer table is a case
• Transaction tables are an example of nested cases
Attributes / Case Key

Attributes are the variables that are used in
the data mining analysis.
• Attributes are often columns in the case table

An attribute can be input or an output
• At modeling time, both input and output attributes are
provided
• At the prediction time, input attributes are used to predict
output attributes

Case Key indicates the identity of the case
• This is often the primary key or a row index
Mining Structure / Mining Model


A mining structure is a table that contains
the columns to be analyzed. It also contains
data mining models used to analyze the data.
Mining model defines how the problem is to
be modeled.
• Specify which columns to be included in the model
• Specify the algorithm to be used
• Define which columns are input and which are
output
Training Models



Many data mining algorithms requires
historical data to learn patterns from
Training the model is also known as
processing the model
Typically, not all available historical data is
used to train the model
• A percentage is left for testing purpose. This set is
called the testing set
• The data is used to train the model is called the
training set
Class Activity_1








High school student historical data –
CollegePlan table from DB661
You are asked to find out what factors
influence a high school student to go to college
(or not)
What data mining task would you perform?
What is the case in this case?
What is the case key?
What algorithm(s) is/are applicable for this task?
Which attribute(s) is/are input?
Which attribute(s) is/are output?
Class Activity_2



Explore vmMSFTYear2008 data in DB661
Predict Microsoft stock values in the first
week of 2009 (The real data is available at
vmMSFTFirstWeek2009)
Can you make money from MSFT based
on your data mining knowledge?
QUESTIONS??
With a new student table
Results