Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Type: Research Paper Authors: David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, Sunil P. Chakkappen (Oracle) Presented by: Siddhant Kulkarni Term: Fall 2015 Query Cost! Estimating “Size” of the resultant table if a join is performed on two or more tables Accommodating the predicates Related work of this paper focuses on: Sampling techniques such as Bifocal Sampling, End-Biased Sampling Sketch based methods for join size estimation Correlated Sampling A novel sketch-based join size estimation Type: Authors: Demonstration Paper Damian Bursztyn(University of Paris-sud), Francois Goasdoue(University of Rennes), Ioana Manolescu(University of Paris-sud) Presented by: Term: Siddhant Kulkarni Fall 2015 RDF = Resource Description Framework You can query this data! Processing queries and displaying results is called query answering! Two ways of query answering Saturation based (SAT) Reformulation based (REF) – PROBLEM! Showcase large set of REF techniques (along with one they have presented in another paper) What the demo system allows you to do: Pick RDF graph and visualize it statistics Choose a query and answering method Observe evaluation in real time Modify RDF data and reevaluate Xiaolan Wang Mary Feng University of Massachusetts Presented by: Omar Alqahtani Fall 2015 Yue Wang Xin Luna Dong University of Iowa Alexandra Meliou Google Inc. Retrieving high quality datasets from voluminous and diverse sources is crucial for many data-intensive applications. However, the retrieved datasets often contain noise and other discrepancies. Traditional data cleaning tools mostly trying to answer “Which data is incorrect?” Demonstration for DATAXRAY, a general-purpose, highly- scalable tool. It explains why and how errors happen in a data generative process It answers: Why are there errors in the data? or How can I prevent further errors? It finds groupings of errors that may be due to the same cause. But how: it identifies these groups based on their common characteristics ( features ) Features are organized in a hierarchical structure based on simple containment relationships. DATAXRAY uses a top-down algorithm to explore the feature hierarchy: To identify the set of features that best summarize all erroneous data elements. It uses a cost function based on Bayesian analysis to derive the set of features with the highest probability of being associated with the causes for the mistakes in a dataset. Presented by: Ranjan_KY Fall 2015 Web scrapping (or wrapping) is a popular means for acquiring data from the web. Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources. No scalable tools exists that support these task. . Modern wrapper-generation systems leverage a number of features ranging from HTML and visual structures to knowledge bases and microdata. Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content. Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node. Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation. The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution. A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites. WADaR’s repair process (i) Annotating the extracted relations with standard entity recognizers, (ii) Computing Markov chains describing the most likely segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input relation according to the given target schema and that can possibly be encoded back into the wrapper. In this paper, related work was not evaluated in detail [1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013. [2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013. [3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015. Rahul Potharaju† , Joseph Chan† , Luhui Hu† , Cristina Nita-Rotaru ∗ , Mingshi Wang† , Liyuan Zhang† , Navendu Jain‡ †Microsoft ∗Purdue University ‡Microsoft Research Presented by: Zohreh Raghebi Configuration errors have a significant impact on system performance and availability For instance, a misconfiguration in the user-authentication system caused login problems for several Google services including Gmail and Drive A software misconfiguration in Windows Azure caused a 2.5 hour outage in 2012 Many configuration errors are due to faulty patches e.g., changed file paths causing incompatibility with other applications, empty fields in Registry failed uninstallations Unfortunately, troubleshooting misconfigurations is time consuming, hard and expensive. First, today’s software configurations are becoming increasingly complex and large comprising hundreds of parameters and their settings Second, many of these errors manifest as silent failures leaving users clueless: They either search online or contact customer service and support (CSS) loss of productivity, time and effort several research efforts have proposed techniques to identify, diagnose and fix configuration problems Some commercial tools are also available to manage system configurations or to automate certain configuration tasks many of these approaches either assume the presence of a large set of configurations to apply statistical testing e.g., PeerPressure periodically checkpoint disk state e.g., Chronus risking high overheads; use data flow analysis e.g., ConfAid for error-tracing; This paper presents the design, implementation and evaluation of ConfSeer a system that aims to proactively find misconfigurations on user machines using a knowledge base (KB) of technical solutions ConfSeer focuses on addressing parameter-related misconfigurations, as they account for a majority of user configuration errors the key idea behind ConfSeer: to enable configuration-diagnosis-as-a-service by automatically matching configuration problems to their solutions described in free-form text First, ConfSeer takes the snapshots of configuration files from a user machine as input These are typically uploaded by agents running on these machines Second, it extracts the configuration parameter names and value settings from the snapshots matches them against a large set of KB articles, which are published and actively maintained by many vendors Third, after a match is found, ConfSeer automatically pinpoints the configuration error with its matching KB article so users can apply the suggested fix ConfSeer is the first approach that combines traditional IR and NLP techniques (e.g., indexing, synonyms) with new domain specific techniques (e.g., constraint evaluation, synonym expansion with named-entity resolution) to build an end-to end practical system to detect misconfigurations. It is part of a larger system-building effort to automatically detect software errors and misconfigurations by leveraging a broad range of data sources such as knowledge bases, technical help articles, and question and answer forums which contain valuable yet unstructured information to perform diagnosis Dan Olteanu LogicBlox, Inc. [email protected] Presented by: Zohreh Raghebi An increasing amount of self-service enterprise applications require live programming in the database the traditional edit-compile-run cycle is abandoned in favor of a more interactive user experience with live feedback on a program's runtime behavior In retail-planning spreadsheets backed by scalable full-edged database systems: users can define and change schemas of pivot tables and formulas over these schemas on the fly These changes trigger updates to the application code on the database server the challenge is to quickly update the user spreadsheets in response to these changes To achieve interactive response times in real world changes to application code must be quickly compiled and hot-swapped into the running program the effects of those changes must be efficiently computed in an incremental fashion In this paper, we discuss the technical challenges in supporting live programming in the database. The workhorse architectural component is a “meta-engine" Incrementally maintains metadata representing application code guides its compilation into an internal representation in the database kernel orchestrates maintenance of materialized results of the application code based on those changes In contrast, the engine proper works on application data and can incrementally maintain materialized results in the face of data updates. The meta-engine instructs the engine which materialized results need to be (partially or completely) recomputed Without the meta-engine, the engine would unnecessarily recompute from scratch all materialized results every time the application code changes render the system unusable for live programming present the meta-engine solution that implemented in the LogicBlox commercial system LogicBlox offers a unified runtime for the enterprise software stack LogicBlox applications are written in an extension of Datalog called LogiQL LogiQL acts as a declarative programming model unifying OLTP, OLAP, and prescriptive and predictive analytics It offers rich language constructs for expressing derivation rules the meta-engine uses rules expressed in a Datalog-like language called MetaLogiQL these operate on metadata representing LogiQL rules Outside of the database context: our design may even provide a novel means of building incremental compilers for general-purpose programming languages Presented by: Dardan Xhymshiti Fall 2015 Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation) Conference: VLDB Type: Demonstration Data discovery of relevant information in relational databases. Problem of generating reports. To find relevant information, users have to find the database tables that are relevant to the task, for each of them understand its content to determine whether it is truly relevant etc. The schema’s table and column names are often not very descriptive of the content. Example: In 412 data columns in 639 tables from 29 databases used by Microsoft’s IT organization, 28% of all columns were very generic as: name, it, description, field, code A typical corporate database table with generic column names Such non-descriptive column names make it difficult to search and understand the table. One solution: using data stewards to enrich the database tables and columns with textual description. Time consuming Ignoring databases that are less frequent. Barcelos: automatically annotate columns of database tables. Annotate even those tables that are not frequent. How it works? It works by mining spreadsheets. Many of these spreadsheets are generated by queries. It uses spreadsheet’s column names as candidate annotations for the corresponding database columns. For the table above Barcelos produces annotations: TeamID and Team for the first column, Delivery Team and Team for the second column. Line of Business and Business for the third. The authors have provided a method to for extracting relevant tables from an enterprise database. A method for identifying and ranking relevant column annotations. An implementation of Barcelos and an experimental evaluation that shows its efficiency and effectiveness. Presented by: Dardan Xhymshiti Fall 2015 Authors: Dinesh Das, Jiaqi Yan, Mohamed Zait, Satyanarayana R Valluri, Nirav Vyas, Ramarajan Krishnamachari, Prashant Gaharwar, Jesse Kamp, Nioly Mukherjee Conference: VLDB Type: Industry paper Database In-Memory (column-oriented) Database On-Disk (row-oriented) Oracle 12C Database In-memory Industry’s first dual format database (In-Memory & On-Disk) Problem: optimization of query processing Optimization for On-Disk query processing are not efficient for In-Memory query processing. Motivation: Modify the query optimizer to generate execution plans optimized for the specific format – row major or columnar - that will be scanned during query execution. Various vendors have taken different approaches to generating execution plans for in-memory columnar tables: Make no change to the query optimizer expecting that the queries in the different data format would perform better. Using heuristic methods to allow the optimizer to generate different plans. Limit optimizer enhancements to specific workloads like star queries. The authors provide a comprehensive optimizer redesign to handle a variety of workloads on databases with varied schemas and different data formats. Column major tables dates since 1980s. Sybase IQ. MonetDB and C-Store around 2000s. Presented by: Shahab Helmi Fall 2015 Authors: Publication: VLDB 2015 Type: Demonstration Paper Data analysts often engage in data exploration tasks to discover interesting data patterns, without knowing exactly what they are looking for (exploratory analysis). Users try to make sense of the underlying data space by navigating through it. The process includes a great deal of experimentation with queries, backtracking on the basis of query results, and revision of results at various points in the process. When data size is huge, finding the relevant sub-space and relevant results takes so long. AIDE is an automated data exploration system that: Steers the user towards interesting data areas based on her relevance feedback on database samples. Aims to achieve the goal of identifying all database objects that match the user interest with high efficiency. It relies on a combination of machine learning techniques and sample selection algorithms to provide effective data exploration results as well as high interactive performance over databases of large sizes. Datasets: AuctionMark: information on action items and their bids. 1.77GB. Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of stars and galaxies. Large data size and complex schema. 1GB-100GB. US housing and used cars: available through the DAIDEM Lab System Implementation: Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees PostgreSQL Presented by: Shahab Helmi Fall 2015 Authors: Publication: VLDB 2015 Type: Industry Paper This paper presents the work on the SAP HANA Scale-out Extension: a novel distributed database architecture designed to support large scale analytics over real-time data. High performance OLAP with massive scale-out capabilities. Concurrently allowing OLTP workloads. New design of core database components such as query processing, concurrency control, and persistence, using high throughput low-latency networks and storage devices. enables analytics over real-time changing data and allows ne grained user specified service level agreements (SLAs) on data freshness. There are two fundamental paradigm shifts happening in enterprise data management: Dramatic increase in the amount of data being produced and persisted by enterprises. Need for businesses to have analytical access to up-to-date data in order to make critical business decisions. Enterprises want real-time insights from their data in order to make critical time sensitive business decisions -> the ETL pipelines for offline analytical processing of day, week, or even month old data do not work. On one hand, a system must provide on-line transaction processing (OLTP) support to have real-time changes to data reflected in queries. On the other, systems need to scale to very large data sizes and provide on-line analytical processing (OLAP) over these large and changing data sets. Mixed transactional and analytical workloads. Ability to take advantage of emerging hardware: High core count processors, SIMD instructions, large processor caches, and large memory capacities. Storage class memories and high-bandwidth low-latency network interconnects. Supporting cloud data storage. Heterogeneous scale-out of OLTP and OLAP workloads. Decoupling query processing from transaction management. The ability to improve performance by scheduling snapshots for read-only OLAP transactions according to fine grained SLAs. A scalable distributed log providing durability, fault tolerance, and asynchronous update dissemination to compute engines. Support for different compute engines: e.g., SQL engines, R, Spark, graph, and text. Mixed OLTP/OLAP: HyPer, ConuxDB. Scale-out OLTP Systems: Calvin, H-Store. Shared Log: CORFU, Kafka, BookKeeper Presented by: Ashkan Malekloo Fall 2015 Type: Demonstration paper Authors: Quan Pham, Severin Thaler, Tanu Malik, Ian Foster, Boris Glavic VLDB 15 Recently, application virtualization (AV), has emerged as a light-weight alternative for sharing and efficient repeatability AV approaches: Linux Containers CDE (Using System Call Interposition to Automatically Create Portable Software Packages) Generally, application virtualization techniques can also be applied to DB applications These techniques treat a database system as a black-box application process Oblivious to the query statements or database model supported by the database system. Tool for creating packages of DB applications. LDV package encapsulates: Application Relevant dependencies Relevant data LDV relies on data prevalence Its ability to create self-contained packages of a DB application that can be shared and run on different machine configurations without the need to install a database system and setup a database Extracting a slice of the database accessed by an application How LDV's execution traces can be used to understand how the files, processes, SQL operations, and database content of an application are related to each other.