Download Powerpoints

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Geographic information system wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Neuroinformatics wikipedia , lookup

Theoretical computer science wikipedia , lookup

Cluster analysis wikipedia , lookup

Predictive analytics wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Data analysis wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Data Mining
With SQL Server Data Tools
Mining Data Using Tools You Already Have
Introductions
 Annelies Beaty, Manager Enterprise Data Strategy at US Xpress
 Played many roles at US Xpress over the years
 My current role is to architect how Enterprise level data is managed and
presented to the organization as a whole.
 Development Practices and Guidelines
 Tool evaluation
 Step in and get my hands ‘dirty’ whenever possible, needed.
Data Mining (or Data Science)
 What do we mean by Data Mining
 Process By Which Large Sets of Data can be Analyzed for Actionable Information
 Look for Answers to Questions over data sets so large you can get lost in it.
 Find relationships that are too complex to be seen.
 Types of Data Mining Scenarios
 Forecasting – Predict future outcomes based on past experience
 Risk and Probability – Based on Past results, what factors lead to the results we want
 Recommendations – Based on ‘experience’, what else do we think goes with this set?
 Finding Sequences – What are the frequent paths or steps taken through a system of
possible steps.
 Grouping – Separating the dataset into clusters of ‘like’ objects; determining affinity
High Level Data Management
New Business
Question is asked
Answer is
Delivered
THIS TAKES TIME AND RESOURCES
Challenge with ‘Best Practice’ EDW
 EDW development is methodical. Designed to answer a specific related set of
questions around a business process.
 Time to Deliver Results
 Sometimes an ‘Overkill’ solution.
 Sometimes an Incomplete solution.
 Interesting Fact from a TDWI conference I attended about 2 years ago on
Operational Intelligence:
“50% of traditional data warehouses are not used in daily decision making”
 What if the lifespan of the current ‘question of the day’ is very short? OR what if you don’t even know the question?
So A Real Challenge
Data Mining over (potentially) incomplete data
fast enough to get the results needed by the
business yesterday to make a strategic business
decision
-And Can we do it without significant investment in new tools
Gartner Magic Quadrant - Microsoft
 Business Intelligence
 Leader Quadrant for Completeness of Vision and Ability To Execute
 Sql Server/SSIS/SSRS is a complete solution.
 Product Quality, availability of skills, low implementation costs, alignment with existing infrastructure
 Lacking a true Metadata Management solution and visualizations are not as good as some other
vendors (but improving)
 Advanced Analytics
 Perhaps not so much – still a Niche Player
 Product Quality, availability of skills, low implementation costs, alignment with existing infrastructure
 Availability of analytics gives MS great reach into organizations that can serve as a springboard for
future development
 SSAS still lacks in depth and breadth, and usability, when compared to the leaders.
 However – MS is expected to put a lot of energy into this space and has the means to do so.
Source: Gartner Research *** Early/mid 2014
Demo 1 – Setting up a project
 Availability of Skills
 Don’t need to wait for SSAS
 Easily works with existing infrastructure
 Cost – If you have a SQL Server installation, you can do this now.
 Set up and cursorily explore a Decision Tree model. Show the new objects
on the backend SSAS server, single predictive query.
Predictive Algorithms
 Decision Tree:
 Presents the data as a series of ‘decisions’ used to reach the conclusion. A new
branch is added when a significant correlation is found between the input and
predicted variables.
 Clustering:
 Presents the input data as groups of entities with a high correlation of common
attributes. ** Can be used to simply profile the data. Prediction is optional **
 Naïve Bayes:
 Quick method to analyze relationships between input and predictable columns;
Less intense, but also less accurate. However, can be used to help define inputs
for more accurate, but costly, solutions.
 Neural Network:
 Complex algorithm that evaluates every possible combination of inputs and
outcome(s).
Results – Singleton Query
Query Inputs
RESULTS
Decision Tree
Buyer
Cluster
% Chance
Support
0
57.79
951
1
42.20
694
Naïve Bayes
Buyer
Buyer
% Chance
Support
0
61.74
975
1
38.25
604
Neural Network
% Chance
Support
Buyer
% Chance
Support
0
67.44
12465.68
0
74.79
7478.95
1
32.55
6017.32
1
25.20
2520.04
Demo
 Exploration of 4 predictive data mining models.
Strengths of the predictive models
 Naïve Bayes – Simplest computationally. May use up front to start the
analysis since it processes faster. Use the results to refine the criteria for
additional analysis with more complex tools. ** Cannot use continuous data
as an input
 Decision Tree – Used to predict outcomes based on past data, both
discrete and continuous.
 Clustering – Used to segment the dataset. Use of a predictable outcome is
not required. Makes it useful to detect anomalies in the data.
 Neural Network – most complex – can detect rules and relationships other
methods can’t. Good use cases include those with large number of inputs
and relatively few output: Text mining, (Stock) market analysis,
manufacturing processes.
Other Data Mining Algorithms
 Time Series: Allows us to use historical data to extrapolate a likely value at some
point in the future.
 Example: Predict Expected Sales By Region
 Association Algorithm: Used to detect associations between items or events –
the more frequently items or events occur together, the higher the correlation
and the probability that if one occurs, the other will too.
 Example – customers who bought this also bought this (Amazon)
 Sequence Algorithm: Clusters sequences of events; Similar to cluster algorithm.
 Example: Common paths through a website, or application.
 Linear Regression Algorithm: allows us to explore the linear relationship between
variables. Variation on Decision Trees.
 Compute a trend line over sales and marketing data.
 Logistic Regression: Variation of Neural Network used to model binary outcomes
 Use demographics to determine likelihood of a predicted outcome, such as disease.
Resources
 MSDN has substantial documentation and tutorials to bring you up to speed
on each algorithm
 Sql Server Central (a red gate community site) has a step by step Data
Mining series of Articles that take you all the way through the MSDN tutorials
on basic Data Mining and then how to leverage them…
 SSIS packages to build them, exploration via Excel data mining tools, Power BI
suite.