Download File - Tommy Wei`s e

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Tommy Wei
Cory Hutchinson
ISDS 4180
• What is CRISP-DM (CRoss Industry Standard Process
for Data Mining)
• Blueprint
• Phases and Tasks
• Summary
• A guide or blueprint as to how to conduct a data
mining project
• Breaks down life cycle of a data mining project into 6
• Developed to give a standardized approach towards
data mining projects
• Intended for better, faster results from data mining
Why a Standard Process?
• There was a clear need for data mining, but no sense
of direction as to how organizations launch their own
data mining projects
• Before data mining was very scattered
• Used to encourage good habits and best practices
• Makes it reliable and repeatable with people who
have little data mining experience
• Monitoring and maintenance is easier
CRISP-DM Creation
• Created by 4 data mining veterans
• SIG group created to develop a standard data mining
• Association of data mining enthusiasts got together, large input from wide range of people
• Data miners, data warehousing vendors, management consultants
• Started to refine and improve model, had live trials for
data mining projects
CRISP-DM: Process Flow
• Data Mining Methodology
• For all businesses
• Complete Outline
• Life Cycle: 6 Phases
CRISP-DM: 6 Phases
• Business Understanding
• Understanding the business objectives, business goals, how can data mining help in this regard
• Data understanding
• Start with a data set, increase familiarity, get some insight and identify any data quality issues
• Data preparation
• All activities included to make the final data set that will be used in the different modeling
• Modeling
• Choose a modeling technique, create the model design and test it
• Evaluation
• Thoroughly evaluate the model and the results to see if it meets the business objectives, process review
• Deployment
• Can be using the model to create a dashboard or report, or putting the data mining process across the
entire organization
Phases and Tasks
Assess the
Determine the
Data Mining
Produce a
Project Plan
Collect the
Describe the
Explore the
Select the
Build the
Produce Final
Assess the
Phase 1: Business
• Summary: Focuses on project objectives, requirements from
a business perspective. Then converting that knowledge into
problems or thoughts that can be solved with data mining.
Rough outline of what to do to achieve the objectives.
Phase 1: Business
• Determine the business objectives:
 Have a deep understanding of what the client wants, from a business
perspective, what they REALLY want accomplished
 Understand any business related questions associated with it
• Assess the situation:
 A more detailed understanding of what resources you need as well as any
constraints, potential obstacles and assumptions you might need to make
 More specific details are found here
• Determine the data mining goals:
 Determine the data mining objectives that need to be completed in order to
achieve this business goal
 EX. Business goal: Increase our overall restaurant sales in the northeast and
southeast regions of the US
 Data mining goal: predict how well people from those specific regions embrace our flavor of
food given data from several franchises in the past 3 years, demographic information, price of
item, and other intangible factors such as culture, brand recognition, and reputation
Phase 1: Business
• Produce a project plan
 Project the goals that data miners want to achieve in order to get closer to
achieving the business goals. What do data miners have to achieve in
order to achieve those business goals
 EX. Business goal: To reduce churn rate for our internet provider company
 Data mining goals:
 Identify the characteristics of high value customers based on the most recent 5 years of
 Identify which customers left after 1 year of service
 Build a mathematical model (logistic regression) to determine which customer is most
likely to leave within 3 years of service
Phase 2: Data
• Summary: It starts with some data already collected and
proceeds with activities in order to get more familiar with
the data set.
Identify data quality problems
Discover data insight
Detecting subsets
Extracting hidden information.
Phase 2: Data
• Collect the initial data
 Acquire the necessary data to complete data mining goals and the entire
 Loading data, and possibly integrating data if you are taking data from multiple
data sources
• Describe the data
 Examine the properties of the acquired data, do you have everything you need?
 EX. Data formatting, quantity of data, number of records, fields within each
table, datatype within each field
• Explore the data
 You start to tackle data mining questions, you start using querying, visualization
and reporting
 Aggregations, relationships between data, subsets of data
Phase 2: Data
• Verify data quality
 examine the quality of data, is everything you need there? Are there any
missing gaps? Does the data make any sense? The spelling? Any ambiguity?
Phase 3: Data Preparation
• Summary: 50% to 70% of the time will be spent on this
phase. All the activities used to construct the final
dataset from the original raw data. A lot of steps will be
taken to prepare the data.
• Selecting certain tables, records, attributes, doing some
conversions and transformations, data cleaning
Phase 3: Data Preparation
• Select data
Decide on the data to be used for analysis
Defines which attributes and which records and tables are selected
Data types and data volume that you want
Relevance to data mining goals
• Clean data
 Make sure data quality is at a high level
 Removing corrupt, inaccurate, or duplicate data from table, record or database
• Construct data
 This is where you start preparing the final data set
 Create derived attributes, new records, transform and format data (date for example)
• Integrate data
 This is where you combine information from multiple tables into one and create new records or
 Maybe join multiple data source
 Mathematical calculations on the data, and group them a certain way
Phase 3: Data Preparation
• Format data
 This is extra formatting required in order for the data set to be accepted into the
modeling tool
 The design of the data, illegal characters
Phase 4: Modeling
• Summary: Time to select a modeling technique for the
data set you finalized based on the data mining goals
and objectives. You will have to set the parameter
settings to optimize results and then compare results if
you used several modeling techniques.
Phase 4: Modeling
• Select the modeling technique
 Time to select the actual modeling technique you will use on your data
 Examples are: decision trees, sequential patterns, linear/logistic
regression, clustering, categorical analysis, segmentation
• Generate test design
 Make sure you have a way to test the model’s quality and validity
 Have a training data set that you built your model off of and then test
that on a test data set to see its accuracy
 EX. For categorical analysis, run the model on a test data set and compare
those results to the real results. Did it categorize everything correctly?
What was the error rate?
• Build the model
 Time to run the model you built on the data set and see the results
Phase 4: Modeling
• Assess the model
 judge the success of the data mining model based on the results, data
mining success criteria, desired test design
 Make sure to contact business analysts and domain experts to discuss the
results in a business context, see if it makes sense
 Consider if it is a good model that can be given to others in the organization
Phase 5: Evaluation
• Summary: Thoroughly evaluate the model. Review
the steps that were executed to construct the
model to make sure it properly aligns with the
business objectives. Make sure all important
business issues have been considered. At the end,
you should decide whether you want to keep this
data mining model and the results or not.
Phase 5: Evaluation
• Evaluate results
 Assess if the model and results meet business requirements
 Is there any reason at all that this data mining model is deficient? Did it give
you everything you want?
 Test the model multiple times in the real world
 Document any challenges, useful tips, information and hints for future
• Review process
 Did we correctly build the model?
 Is there any important factor or task that we left out or overlook?
• Determine next steps
 Decide where to proceed next: move to deployment, run the model a few more
times with new data sets, or set up new data mining projects
 Includes analysis of remaining resources and budget to determine next steps
Phase 6: Deployment
• Summary: In this phase, you are going to determine how
the results will be used. Who will use them, how often?
The model and the knowledge gained will need to be
given in a way so clients will understand it and other
people can run the model throughout the organization. It
can be as simple as making a report or implementing a
repeatable data mining process across the enterprise.
Phase 6: Deployment
• Plan deployment
 Takes the results and develops a strategy on how the data results will be sent
throughout the organization
• Plan monitoring and maintenance
 Need to teach people how to independently operate and maintain the data mining
model if it becomes part of the day to day business
 Teach people how to correctly use the data mining results
• Produce final report
 Project leader and team write up a final report
 Can be a summary of project and experiences
 Can be a comprehensive presentation of the data mining results
A way to design a data mining model that is reliable and
repeatable by people with little data mining skills
Provides a uniform framework
Flexible to account for differences in data and business
problems and objectives