Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Drew Minkin [email protected] ◦ Past Analytics Architect at Zilliant Senior Consultant, Fujitsu 6+ years Microsoft Services Escalation Engineer Dedicated Field Engineer (“Alliance”) Local speaker for SQL and BI OLAP Lecturer, SMU’s BI Graduate Certificate Program ◦ Present Business Intelligence Architect at FiServ ISV Part time data miner for hire Data Mining Intro DM Methodology Data Concepts Validating and Testing Models Applying Output with Scorecards http://archive.ics.uci.edu/ml/ http://www.kdnuggets.com/ Methodology Architecture Information Flow Technologies Problem Definition Data Modeling Data Discovery Analytics Modeling Applied Analytics Model Validation Problem Definition Data Modeling Data Discovery Analytics Modeling Applied Analytics Model Validation Business case and non-technical details of predictive analytics inquiry ◦ ◦ ◦ ◦ ◦ Business objectives and success criteria Requirements, assumptions and constraints Project plan , Risks and contingencies Data mining goals and success criteria Terminology, tools and techniques Analysis of source data for structural and content gaps ◦ ◦ ◦ ◦ Data Data Data Data collection report description report exploration report quality report Selection and manipulation of source data into a conformed entity input ready for formal exploration ◦ ◦ ◦ ◦ Dataset and dictionary and rationale Data cleansing report Derived attributes Generated merged and reformatted data Research and analysis of patterns and creation of data mining models ◦ ◦ ◦ ◦ ◦ ◦ Model Modeling technique Modeling assumptions Test design Parameter settings Model description Testing data mining models using different algorithms and validation of statistical significance ◦ Revised Parameter settings ◦ Model Validation plan ◦ Model assessment Integration of models with new data ◦ ◦ ◦ ◦ ◦ Deployment plan Monitoring and maintenance plan Final report Final presentation Experience documentation Case – set of columns you want to analyze ◦ Age, Gender, Region, Annual Spending Case Key – unique ID of a case A column has: ◦ Data Type ◦ Content Type ◦ And optionally: Distribution Discretization Related Columns Flags (e.g. NOT NULL) We don’t care about detailed low-level types DM only uses: ◦ ◦ ◦ ◦ ◦ ◦ Text Long Boolean Double Date and by some 3rd party algorithms: Time, and Sequence Common: ◦ DISCRETE Red, Blue ◦ CONTINOUS $6,511.49 ◦ DISCRETIZED 1-5, 6-20, 21+ Denotes a key: ◦ KEY For special purposes: ◦ ◦ ◦ ◦ KEY SEQUENCE KEY TIME ORDERED CYCLICAL Some algorithms interpret this in different ways, but in general, columns are for: Input ◦ For predicting another column PREDICT ◦ These columns are both predicted and act as inputs for predicting others PREDICT_ONLY ◦ Not used as input Columns can be input or predictable or both When you don’t need to analyze full continuous range DM automatically convert data into buckets ◦ By default, into 5 Techniques: ◦ ◦ ◦ ◦ AUTOMATIC CLUSTERS EQUAL_AREAS THRESHOLDS If you know the distribution of your data (you should), indicate it: ◦ NORMAL Typical Gaussian bell-curve ◦ LOG NORMAL Most values at the “beginning” of the scale ◦ UNIFORM Flat line – equally likely or perfectly random Other distributions can exist, but you cannot indicate them – algorithm will work fine Nested Case – case containing a table column ◦ Purchases of a Customer Used for analyzing patterns in a relationship It has a Nested Key ◦ Not a “relational” foreign key! ◦ Normally, the Nested Key is a column you want to analyze E.g.: Product Name or Model Algorithms and Use Cases Association Rules Clustering Decision Trees Linear Regression Logistic Regression Naïve Bayes Neural Nets Sequence Clustering Time Series Algorithms and Use Cases Algorithm Drillthrough PMML DM Dimension Association Yes No Yes Clustering Yes Yes Yes Decision Trees Yes Yes Yes Linear Regression Yes No No Logistic Regression No No No Naive Bayes Yes Yes No Neural Network No No No Sequence Clustering Yes No Yes Time Series Yes No No AVGGIFT INCOME LASTGIFT MAXRAMNT MINRAMNT RAMNTALL WEALTH1 WEALTH2 STATE Average dollar amount of gifts to date HOUSEHOLD INCOME last donation amount Dollar amount of largest gift to date Dollar amount of smallest gift to date Dollar amount of lifetime gifts to date Wealth Rating Wealth Rating State abbreviation (a nominal/symbolic field) Donor Rank DOMAIN/Cluster code. A nominal or symbolic field. could be broken down by bytes as explained below. ◦ 1st byte = Urbanicity level of the donor's neighborhood U=Urban C=City S=Suburban T=Town R=Rural 2nd byte = Socio-Economic status of the neighborhood ◦ 1 = Highest SES 2 = Average SES 3 = Lowest SES except for Urban communities, 1 = Highest SES, 2= Above average SES 3 = Below average SES 4 = Lowest SES. = http://dejasu.wordpress.com/2008/01/28/knowledge-wisdom-other/question_mark.jpg www.crisp-dm.org www.sqlserverdatamining.com Masao Okada Rafal Lukawiecki Eugene A. Asahara Data Mining in Action : A Case Study Drew Minkin (madmanminkin) Evaluation Links