Download PPT

Database Systems Group Research Overview 2010 1 OLAP Statistical Tests Zhibo Chen Advisor: Dr. Carlos Ordonez • Goal: Isolate factors that cause significant changes in a measured value – Ex: Increase in age causes increase in risk for heart disease • Combined OLAP with Means Comparison Parametric Test – Used to pair similar groups and determine if they are significantly different – Want to reject hypothesis that the two groups have the same mean • Developed GUI that allows for easy user interface 2 OLAP Statistical Tests Zhibo Chen Advisor: Dr. Carlos Ordonez • Association Rules – technique used to detect patterns within items of dataset – HighAge, High Cholestrol => Heart Disease • Compare results from both techniques • OLAP Statistical Test discovered more rules than Association Rules – p-value is more reliable than confidence (considers pdf) – OLAP affected less by distribution than AR • AR better when performance is priority and data is skewed • OLAP Statistical Test better when data is distributed 3 OLAP Statistical Test versus Association Rules Zhibo Chen Advisor: Dr. Carlos Ordonez • Blue and red lines represent location of the averages of the two groups – Averages are fairly different from one another • Confidence says that the two groups are similar – Many blue points above 50 – Many red points above 50 – confidence is low 4 OLAP Exploration with UDF Zhibo Chen Advisor: Dr. Carlos Ordonez • On-Line Analytical Process (OLAP) – Set of techniques allowing users to explore various aggregations of a dataset – Ex: dataset with day, month, year, sales • What were average sales for Sundays? • Solve by grouping on day and then extracting Sunday • Normally done outside the database or with OLAP servers – We want to study how to perform the same techniques inside the DBMS (SQL or UDF) – Found that users can efficiently perform OLAP exploration using UDFs 5 Digital Libraries in a DBMS Carlos Garcia-Alvarado • • • Advisor: Dr. Carlos Ordonez Information retrieval techniques have been traditionally exploited outside relational database systems due to storage overhead, complexity to suit them in a relational model, and slower performance in SQL implementations. Searching and querying documents under information retrieval models in relational database systems can be performed with optimized SQL. We explore three phases: • Document preprocessing. • Document storage. • Document retrieval (VSM, OPM, DPLM). 6 Keyword Search Across Document and Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez Databases • • • • • Sometimes the meaning and structure of a database is unknown. There are external semi-structured sources that can help to describe it. We found that we can link these two worlds to identify relationships between the structured data with the semistructured data. We believe that is the right approach to do it inside the database. We implemented a prototype entirely in SQL. 7 Bayesian Statistics Carlos Garcia-Alvarado • • • • • Advisor: Dr. Carlos Ordonez Latest trend in advanced statistics; very demanding: CPU and large data sets Applied to microarray data in the DBMS. The problem involves high dimensionality data of few samples. Variable selection is the first issue that we have been trying to solve. Computational expensive looking for the best model (2^d), where d is de number of dimensions. Applying SQL optimizations and data layout modifications, we obtain less than 3 seconds selections of > 1 M dimensions , but still not enough. Current work: Gibbs Sampler Variable Selection. 8 PCA Mario Navas Advisor: Dr. Carlos Ordonez  Black-box  Rotation of the input space  Make the representative components evident  No Covariance between attributes  Variance represented by the eigenvalues  Deal with high dimensionality 9 DB Implementation    Summary matrices n L Q Correlation matrix Eigenvalue decomposition problem 10 Outliers detection in microarray data  Deal with high dimensionality  Redundancy minimized  Find distance based outliers in a reduced space PCA -based Outliers [2D] Distance-based Outliers [7D] PCA -based Outliers [2D] Distance-based Outliers [126] Matching top 10 11 Bayesian Classification Based On Decomposition via Clustering Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez • An Extension Of Naïve Bayes. • Class Decomposition of the Gaussians Using Clustering • Using K-Means and E-M • Scalability - Query Optimizations for Computationally and Memory Intensive Computations • Incremental Learning of the Classifier 12 Computing Distance & Sufficient Statistics Using SQL & UDFs Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez • Five different SQL optimizations and one User Defined Function (UDF) to compute Euclidean distance in K-Means • Sufficient Statistics – Count, Linear Sum and Quadartic Sum for multiple clusters and multiple classes computed in a single data set scan Using SQL (or) UDF. 13 Fast Bayesian Classifier Based on FREM • The Algorithm – Initialization : Randomly initialize k clusters per class from the data set. – E-step : Compute Mahalanobis distance, find nearest cluster and then compute sufficient statistics. – M-step : Recompute the mean and variances and weight of the clusters per class. Mixture parameters updated in this step. – SplitClusters : Splitting Heavy Clusters to reach higher quality solutions and reseeding low weight clusters. – The E-step and M-step are iterated until the model converges. 14 Constrained Association Rules in SQL Kai Zhao Advisor: Dr. Carlos Ordonez • Association rules are a data mining technique used to discover frequent patterns in a data set. Real world application of this technique is broad and can include fields such as medical and commerce. We can automatically generate efficient SQL queries for discovering association rules 15 Comparison between CAR and DT Kai Zhao Advisor: Dr. Carlos Ordonez • CAR perform an exhaustive combinatorial research whereas DT recursively partition the input attribute space. • CAR aim to find all rules above the given thresholds whereas DT find regions in space where most records belong to the same class. • CAR analyze item combinations whereas DT select only one input attribute at one time. 16 Frequent Subgraph Mining Kai Zhao Advisor: Dr. Carlos Ordonez • Frequent subgraph – A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT