* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Powerpoints
Inverse problem wikipedia , lookup
Geographic information system wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Cluster analysis wikipedia , lookup
Predictive analytics wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Data analysis wikipedia , lookup
Pattern recognition wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have Introductions Annelies Beaty, Manager Enterprise Data Strategy at US Xpress Played many roles at US Xpress over the years My current role is to architect how Enterprise level data is managed and presented to the organization as a whole. Development Practices and Guidelines Tool evaluation Step in and get my hands ‘dirty’ whenever possible, needed. Data Mining (or Data Science) What do we mean by Data Mining Process By Which Large Sets of Data can be Analyzed for Actionable Information Look for Answers to Questions over data sets so large you can get lost in it. Find relationships that are too complex to be seen. Types of Data Mining Scenarios Forecasting – Predict future outcomes based on past experience Risk and Probability – Based on Past results, what factors lead to the results we want Recommendations – Based on ‘experience’, what else do we think goes with this set? Finding Sequences – What are the frequent paths or steps taken through a system of possible steps. Grouping – Separating the dataset into clusters of ‘like’ objects; determining affinity High Level Data Management New Business Question is asked Answer is Delivered THIS TAKES TIME AND RESOURCES Challenge with ‘Best Practice’ EDW EDW development is methodical. Designed to answer a specific related set of questions around a business process. Time to Deliver Results Sometimes an ‘Overkill’ solution. Sometimes an Incomplete solution. Interesting Fact from a TDWI conference I attended about 2 years ago on Operational Intelligence: “50% of traditional data warehouses are not used in daily decision making” What if the lifespan of the current ‘question of the day’ is very short? OR what if you don’t even know the question? So A Real Challenge Data Mining over (potentially) incomplete data fast enough to get the results needed by the business yesterday to make a strategic business decision -And Can we do it without significant investment in new tools Gartner Magic Quadrant - Microsoft Business Intelligence Leader Quadrant for Completeness of Vision and Ability To Execute Sql Server/SSIS/SSRS is a complete solution. Product Quality, availability of skills, low implementation costs, alignment with existing infrastructure Lacking a true Metadata Management solution and visualizations are not as good as some other vendors (but improving) Advanced Analytics Perhaps not so much – still a Niche Player Product Quality, availability of skills, low implementation costs, alignment with existing infrastructure Availability of analytics gives MS great reach into organizations that can serve as a springboard for future development SSAS still lacks in depth and breadth, and usability, when compared to the leaders. However – MS is expected to put a lot of energy into this space and has the means to do so. Source: Gartner Research *** Early/mid 2014 Demo 1 – Setting up a project Availability of Skills Don’t need to wait for SSAS Easily works with existing infrastructure Cost – If you have a SQL Server installation, you can do this now. Set up and cursorily explore a Decision Tree model. Show the new objects on the backend SSAS server, single predictive query. Predictive Algorithms Decision Tree: Presents the data as a series of ‘decisions’ used to reach the conclusion. A new branch is added when a significant correlation is found between the input and predicted variables. Clustering: Presents the input data as groups of entities with a high correlation of common attributes. ** Can be used to simply profile the data. Prediction is optional ** Naïve Bayes: Quick method to analyze relationships between input and predictable columns; Less intense, but also less accurate. However, can be used to help define inputs for more accurate, but costly, solutions. Neural Network: Complex algorithm that evaluates every possible combination of inputs and outcome(s). Results – Singleton Query Query Inputs RESULTS Decision Tree Buyer Cluster % Chance Support 0 57.79 951 1 42.20 694 Naïve Bayes Buyer Buyer % Chance Support 0 61.74 975 1 38.25 604 Neural Network % Chance Support Buyer % Chance Support 0 67.44 12465.68 0 74.79 7478.95 1 32.55 6017.32 1 25.20 2520.04 Demo Exploration of 4 predictive data mining models. Strengths of the predictive models Naïve Bayes – Simplest computationally. May use up front to start the analysis since it processes faster. Use the results to refine the criteria for additional analysis with more complex tools. ** Cannot use continuous data as an input Decision Tree – Used to predict outcomes based on past data, both discrete and continuous. Clustering – Used to segment the dataset. Use of a predictable outcome is not required. Makes it useful to detect anomalies in the data. Neural Network – most complex – can detect rules and relationships other methods can’t. Good use cases include those with large number of inputs and relatively few output: Text mining, (Stock) market analysis, manufacturing processes. Other Data Mining Algorithms Time Series: Allows us to use historical data to extrapolate a likely value at some point in the future. Example: Predict Expected Sales By Region Association Algorithm: Used to detect associations between items or events – the more frequently items or events occur together, the higher the correlation and the probability that if one occurs, the other will too. Example – customers who bought this also bought this (Amazon) Sequence Algorithm: Clusters sequences of events; Similar to cluster algorithm. Example: Common paths through a website, or application. Linear Regression Algorithm: allows us to explore the linear relationship between variables. Variation on Decision Trees. Compute a trend line over sales and marketing data. Logistic Regression: Variation of Neural Network used to model binary outcomes Use demographics to determine likelihood of a predicted outcome, such as disease. Resources MSDN has substantial documentation and tutorials to bring you up to speed on each algorithm Sql Server Central (a red gate community site) has a step by step Data Mining series of Articles that take you all the way through the MSDN tutorials on basic Data Mining and then how to leverage them… SSIS packages to build them, exploration via Excel data mining tools, Power BI suite.