Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Understanding Data Mining Craig A. Stevens, PMP, CC [email protected] www.westbrookstevens.com Examples of Classical Statistical Methods Latitude 36.19N and Longitude -86.78W Nashville, TN, USA Yi = a + bxi + e Multiple Regression http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm Multiple Regression http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm Multiple Regression http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm Multiple Regression http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm Multiple Regression http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm Data Mining http://datamining.typepad.com/photos/uncat egorized/livejournal.png What is Data Mining? • The process of identifying hidden patterns, trends, and relationships in large quantities of data. Why Do Data Mining? • To discover useful information for making decisions. • Too many variables for Classical Statistical methods to work. – Large Number of Records 108 - 1012 • Gigabyte – Terabyte – High Dimensional Data • Lots of Variables (10 – 104 attributes) The Huber-Wegman Taxonomy of Data Set Sizes Descriptor Tiny Small Data Set Size in Bytes 10^2 10^4 Medium 10^6 Large Huge Massive 10^8 10^10 10^12 Super Massive 10^15 Storage Mode Piece of Paper A few Pieces of Paper A Floppy Disk Hard Disk Multiple Hard Disks Robotic Magnetic Tape Storage Silos Distributed Data Archives Name BAD Model Role Target Measurement Level Binary Description CLAGE Input Interval Age of oldest trade line in months CLNO Input Interval Number of trade lines DEBTINC Input Interval Debt-to-income ratio DELINQ Input Interval Number of trade lines DEROG Input Interval Number of major derogatory reports JOB Input Nominal LOAN Input Interval MORTDUE Input Interval NINQ Input Interval REASON Input Binary VALUE Input Interval Six occupational categories Amount of the loan request Amount due on existing mortgage Number of recent credit inquiries DebtCon=debt consolidation, HomeImp=home improvement Value of current property YOJ Input Interval Years at present job 1=client defaulted on loan 0=loan repaid SAS Enterprise Miner Objects Shows the Cut off Point is 6 Variables Small Number of Useful Variables Comparing Methods and Profit vs Marketing Cost Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 Clustering As in Different Brands 2 0 1 - 1 0 1 0 - 1 0 1 - 1 0 1 0 1 H 4 H 4 P CR2 _ 1 M O IS _I 9B O 0 - 1 J - 1 _ 1 2 O 0 L J A L - 1 0 1 - 1 0 1 2 3 - 1 0 1 2 - 1 0 1 _ C A 1 C - 1 0 0 2 Z Z M O IS _I 9B S S - 1 B B _ R R 0 A A _ C C 1 M O IS _I 9B Q G - 1 H Q _ G I H D _ O S I 1 M O IS _I 9B D 2 3 - 1 6 O S 6 D O O 0 J D _ J H S _ 1 S H A A 2 M O IS _I 9B J - 1 L C J F C L _ F 0 _ T 2 F A 1 T 3 - 1 1 A F 0 M O IS _I 9B 2 - 1 3 R 0 3 T _ O 1 T R P P CR3 _ 1 P CR1 _ 1 0 0 0 0 0 P R O T_TR 3 1 P R O T_TR 3 1 P R O T_TR 3 1 P R O T_TR 3 1 P R O T_TR 3 1 2 2 2 2 2 M O IS _I 9B P R O T_TR 3 FA T_FC LJ A S H _JO D 6 S O D I _H G QC A R B _S Z0 C A L_JO H 4 1 2 - 1 0 1 - 1 0 1 2 3 - 1 0 1 2 4 H - 1 O 0 J _ L A C 0 Z S _ B R A C Q G H _ I D O S 6 D O J _ H S A - 1 - 1 - 1 - 1 0 1 1 1 1 FA T_FC LJ 0 FA T_FC LJ 0 FA T_FC LJ 0 FA T_FC LJ 2 2 2 2 3 3 3 3 1 2 - 1 0 1 - 1 0 1 2 4 H - 1 O 0 J _ L A C 0 Z S _ B R A C Q G H _ I D O S 3 - 1 - 1 - 1 1 1 1 A S H _JO D 6 0 A S H _JO D 6 0 A S H _JO D 6 0 2 2 2 1 2 - 1 0 1 4 H - 1 O 0 J _ L A C 0 Z S _ B R A C - 1 - 1 0 0 1 S O D I _H G Q 1 S O D I _H G Q 2 2 3 3 1 2 4 H - 1 O 0 J _ L A C - 1 0 C A R B _S Z0 1 Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/ Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/ National Energy Research Scientific Computing Center SurfStat A Matlab toolbox for the statistical analysis of univariate and multivariate surface and volumetric data using linear mixed effects models and random field theory Keith J. Worsley Latitude 36.19N and Longitude -86.78W Nashville, TN, USA Genealogical Tree On You Tube http://www.youtube.com/watch?v=CnniJR5Ah7g