Download Top 10 Data Mining Mistakes by John Elder

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Top 10 Data Mining Mistakes by
John Elder
You’ve made a mistake if you… 0. Lack Data 1. Focus on Training 6.
Discount Pesky Cases 7.
Extrapolate 8.
Answer Every Inquiry 9.
Sample Casually 2. Rely on One Technique 3. Ask the Wrong QuesGon 4. Listen (only) to the Data 5. Accept Leaks from the future 10. Believe the Best Model Copyright © 2012 Elder Research, Inc. 1 Why business mistakes?
•  Not all data mining projects are a success
•  Approximately 90% of projects meet their
technical goals
•  Approximately 65% of solutions are actually
deployed at the client organization
Copyright © 2012 Elder Research, Inc. 2 Who benefits?
•  Business leaders: need to establish the best
environment and culture to ensure technical
and business success
•  Technical leaders: need to understand the
business obstacles that the business client
are facing and need to avoid those
mistakes directly and indirectly
Copyright © 2012 Elder Research, Inc. 3 You might be making a data
mining business mistake if you…
#1: Fail to define an objective
•  Without a clear objective, data mining can
be an exercise in futility
•  Find a pain point and find the best
approach to solve it
Copyright © 2012 Elder Research, Inc. 5 #2: Start too big
•  Developing a transformational data mining
service can be a major undertaking
requiring extensive energy and resources
•  Unless there is complete organizational
investment, the task can be overwhelming
and quickly result in frustration and failure
•  Starting small allows the organization to get
a feel for what it takes to succeed
•  Shoot for an early “small success”
Copyright © 2012 Elder Research, Inc. 6 #3: Lack support from the keepers
of the data
•  Modelers not only need timely access to
data, but information about the data
•  Access is needed to the people who are
familiar with the data and know:
–  how it is collected and maintained
–  why it is messy and/or incomplete
–  what each data field means
–  how the data is used
–  ensure that the transformed data properly represents
the business use and understanding
Copyright © 2012 Elder Research, Inc. 7 Copyright © 2012 Elder Research, Inc. 8 #3: Lack support from the keepers
of the data
•  Modelers not only need timely access to
data, but information about the data
•  Access is needed to the people who are
familiar with the data and know:
–  how it is collected and maintained
–  why it is messy and/or incomplete
–  what each data field means
–  how the data is used
–  ensure that the transformed data properly represents
the business use and understanding
Copyright © 2012 Elder Research, Inc. 9 #4: Wait for perfect data
•  No matter how long one works at it, data
will never be perfect
•  Good modelers expect to work with messy
data and have tools to deal with it
•  Give them the data you have and let them
go to work
Copyright © 2012 Elder Research, Inc. 10 #5: Believe you have perfect data
•  Understanding, cleansing, and preparing
data accounts for 65-80% of time on data
mining engagements
•  Even with relatively clean data, it is a
necessary process that takes time for
effective modeling
Copyright © 2012 Elder Research, Inc. 11 Recent Client with Perfect Data
•  Early (unexpected) insights
–  Call volumes are greater when HCP info is inconsistent, suggesting
that some outbound calls may have a primary purpose of data
verification, rather than order generation
•  Data anomalies
–  Sometimes Ship_Date precedes Call_Date
–  Some orders have multiple call dates, sometimes many months
apart.
•  Modeling decision
–  Use call rather than ship date, since this is causal to an order
Copyright © 2012 Elder Research, Inc. 12 #6: Rely too heavily on software
•  Lots of good software on the market
•  Even the best software requires expert
users (data miners) to make it work
•  Software is a tool to be used to build
models to obtain valuable outputs
•  Expecting the tool to do it all results in
wasted money and shelf space
Copyright © 2012 Elder Research, Inc. 13 Copyright © 2012 Elder Research, Inc. 14 15 #7: Don’t understand the different
levels of analytics
Copyright © 2012 Elder Research, Inc. 16 The 9 Levels of Analytics
Descriptive Techniques:
1 – Standard Reporting 2 – Custom Reporting or “Slicing and Dicing” the Data (Excel)
3 – Queries/drilldowns (SQL, OLAP) 4 – Dashboards/alerts (Business Intelligence) 5 – Statistical Analysis
6 – Clustering (Unsupervised Learning)
Predictive Techniques:
7 – Predictive Modeling
8 – Optimization & Simulation
9 – Next Generation Analytics – Text Mining & Link Analysis
Copyright © 2012 Elder Research, Inc. 17 #7: Don’t understand the different
levels of analytics
•  A lot of people are caught up in the buzz of
data mining and analytics in general
•  A lack of understanding of the different
levels of analytics can result in wasted
money and time
•  Understanding the different levels of
analytics can help an organization to better
focus resources on the right solution
Copyright © 2012 Elder Research, Inc. 18 #8: Exclude the domain SMEs
•  Having expert data miners is not enough
•  SMEs needed to:
–  Provide business understanding
–  Provide common sense checks to the modeling
process
–  Maximize use of the models
•  Buy-in from the internal SME helps to make
believers out of the non-SMEs
Copyright © 2012 Elder Research, Inc. 19 #9: Don’t plan for deployment
•  Building the models is only the start
•  Deployment within the organization
infrastructure can be a larger effort in terms
of resources and time
•  Need to decide on the deliverable format
and work with IT to figure out how it will be
accomplished
Copyright © 2012 Elder Research, Inc. 20 Copyright © 2012 Elder Research, Inc. 21 #10: Rush the process
•  Good modeling is a methodical, iterative
process.
•  Working with the data can take 65-80% of
the project time
•  Cutting corners will result in weak or
incorrect models
Copyright © 2012 Elder Research, Inc. 22 CRISP-DM
Copyright © 2012 Elder Research, Inc. 23 Summary
1.  Fail to define an objective
2.  Start too big
3.  Lack support from the keepers of the data
4.  Wait for perfect data
5.  Believe you have perfect data
6.  Rely too heavily on software
7.  Don’t understand the different levels of analytics
8.  Exclude the domain SMEs
9.  Don’t plan for deployment
10.  Rush the process
Copyright © 2012 Elder Research, Inc. 24