Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez 1 Motivation • Naïve Bayes Classifier(NB) – One of the most popular and important classifiers in Machine Learning – Robust, Powerful, Fast to Compute And Easy to Understand • Programming Inside A DBMS – SQL can easily handle complex computations – UDFs can use arrays and processed in memory 2 Data Mining Inside A DBMS •Avoids Exporting the data outside the DBMS •Major overhead •Data Security •Scales Linearly with large data sets •Exploit parallelism provided by a DBMS •Use optimized queries with simple database operations •Objective: Push computations involving large data sets inside the DBMS Bayesian Classifier Based On K-Means (BKM) • A Generalization Of Naïve Bayes(NB) • The Algorithm – Initialization: Randomly initialize k clusters per class from the data set. – E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics. – M-Step: Re-compute cluster centers and radii. Check Convergence. • The E-Step and M-Step are repeated until model converges i.e clusters do not move 4 BKM: Finding the clusters per class Database Optimizations • Five different query optimization techniques for distance computation were introduced. • User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF. • Using CASE statement instead of aggregations. • Sufficient Statistics of the clusters were computed in a single table scan. 6 Comparing Accuracy – NB Vs BKM Vs DT •Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases Data Set Algorithm Global Class-0 Class-1 pima NB 76% 80% 68% BKM 76% 87% 53% DT 68% 76% 53% NB 70% 87% 45% BKM 73% 91% 43% DT 80% 85% 72% NB 50% 51% 30% BKM 59% 59% 60% DT 89% 96% 0% NB 93% 91% 95% BKM 93% 84% 97% DT 95% 94% 96% Spam •Class Breakdown Accuracy: BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale. Bscale Wbcancer 7 BKM Scalability- Varying n,d,k Times per Iteration. Defaults: d=4,k=4,n=100k 8 Comparing DBMS with MapReduce MapReduce: A distributed non-transactional high performance data intensive processing framework. Incremental Mining • An UDF performing incremental data mining exploiting data parallelism • Minimizing the number of scans(1-3) on the data set • Provides an approximation of the model before we scan through the complete data set • Requires thread safe sharing of the model without affecting performance Papers • Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010 • Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010 • Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008