Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

Presented by: Omar Alqahtani Spring 2016 Authors: Publication: ICDE 2015 Type: Research Paper 2 Data Exploration platforms assist users to discover interesting objects within large volumes of scientific and business data. Similar to top-k and skyline, but what is it? Data diversification is to extract from a query result, a small set of non-redundant points that are diverse among themselves according to some distance measure. Current approach is process-first-diversity-next. Drawback? Motivation: the need to efficiently provide users with effective insights during data exploration. 3 Progressive Data Diversification (pDiverse) scheme. The main idea is to detect and prune those data points in the query result that cannot be included in the final diverse set. By utilizing partial distance computation, will reduce the amount of CPU and I/O Incurred during query diversification. Also, Progressive Greedy (pGreedy) heuristic, which forms the core of our pDiverse scheme. Extending pGreedy to work with column-store. Integrated model, which combined range query with the diversification. Optimizing pDiverse by incorporating novel techniques for ordering of dimensions and approximation of diversity 4 Mostly, there are three categories of diversification: Content based -- Novelty based -- Semantic coverage based Formal definition: It is NP-Hard problem, so, greedy-based heuristics are the ones most widely used. 5 Presented by: Omar Alqahtani Spring 2016 Authors: Publication: ICDE 2015 Type: Research Paper 7 Query execution performance of database systems depends heavily on query optimization decisions. Best possible plan, mostly, needs cost model to estimate performance of viable alternatives. Cost models rely on statistics about the data. But? As a result, commercial DBMS often assume uniform data distributions and attribute value independence, which is in reality hardly the case. Suboptimal plans Subpar performance 8 9 They define robustness in the context of query processing as: The ability of a system to efficiently cope with unexpected and adverse conditions, and deliver near-optimal performance for all query inputs. 10 Based on: Understanding of the data distributions is a continuous process. Also, distribution may develop throughout the execution of a query plan. Since one execution strategy might not be optimal over the entire data set. They propose: A new class of morphable operators that continuously and seamlessly adjust their execution strategy as the understanding of the data evolves. Smooth Scan Operator that morphs between an index look-up and a full table scan, which: achieves near-optimal performance regardless of the operator’s selectivity obliviously to the existing data statistics. 11 Some works focus on dealing with the problem at the optimizer level, but: in dynamic environments, they could bring only partial benefits as the environment keeps changing even after optimization. Orthogonal approaches on run-time adaptivity, however: They are lacking the flexibility at the level of access paths. remain sensitive to the accuracy of statistics. 12 Presented by: Zohreh Raghebi Spring 2016 Authors: Publication: ICDE 2015 Type: Research Paper 14 Rapid growth of event based social network services Meetup and Plancast Connects people through events Allow users to form online groups Publish and announce events to other group members 15 1) Which groups would a particular user like to join? 2) Which tags might a group choose when constructing its profiles? 3) Who will attend an upcoming event? To design recommendation systems for three specific tasks Tags to groups groups to users Events to users 16 [1] Proposed a factorization model [2] Introduced a topic model To exploits social and location features for event-based group recommendation To solve the tag recommendation problem for groups [3] Used a simple graph-based approach To recommend users for an event Performs the information diffusion over user network Lack of general solution 17 To model the interactions between multiple entities Users, Events, Groups, and Tags Analyzing the data to extract some useful temporal patterns of user behaviors Convert the recommendation problem into a node proximity calculation problem 18 To evaluate the node proximity Heterogeneous graph contains multiple types of entities Influence each other via different types of interactions To balance the importance of these influences for proximity calculation The importance of them may vary from one recommendation problem to another 19 Random Walk with Restart (RWR) to calculate node proximity for recommendations RWR is developed on univariate Markov chain for homogeneous graphs As a generalization, multivariate Markov chain (MMC) To model the random walk process in a heterogeneous graph MMC is able to explicitly model the influences between different entities 20 Existing MMC based methods need to manually set the influence weights between different types of entities Multiple types of entities exist Learning scheme tries to fid the optimal set of weights 21 A general model, to handle multiple recommendation problems in an event-based social network To avoid the issue of manual parameter assignment Propose a learning framework to find appropriate parameters for the model The values of learned parameters indicate the importance of different types of entities in different recommendation tasks Better understandings on user behavior in an event-based social network 22 Presented by: Zohreh Raghebi Spring 2016 Authors: Publication: ICDE 2015 Type: Research Paper 24 Knowledge is represented as a graph There is uncertainty in the presence of each edge in the graph Uncertain graphs have been used extensively Communication networks Social networks Protein interaction networks 25 Identification of dense substructures within a graph Clique, a completely connected subgraph Maximal clique, is a clique that is not contained within any other clique Enumerating all maximal cliques Finding overlapping communities from social networks Finding overlapping multiple protein complexes Analysis of email networks 26 Clique in an uncertain graph A set of vertices that has a high probability of being a completely connected subgraph Applications Finding sets of vertices help to unearth robust communities within an uncertain graph A group of proteins such that it is likely that each protein interacts with each other protein 27 A set of vertices U is an α-maximal clique if U is a clique with probability at least α There does not exist a vertex set S such that U ⊂ S and S is a clique with probability at least α When α = 1, we have the notion of a maximal clique in a deterministic graph 28 The problem of finding reliable subgraphs In contrast, interested in finding subgraphs that are not just connected, Finding subgraphs that are connected with a high probability Fully connected with a high probability Enumerating the k cliques with the highest probability of existence Focus on enumerating all α-maximal cliques in a graph 29 f(n, α) be the maximum number of α-maximal cliques Proofs…………… 30 Using depth-first-search (DFS) with backtracking Starts with a set of vertices C that is an α-clique Incrementally adds vertices to C While retaining the property of C being an α-clique The algorithm backtracks to explore other possible vertices until all possible search paths have been explored 31 First, To save the effort of needing to check if a new vertex v can be used to extend C Consider only those vertices that are already connected to every vertex within C This leads us to incrementally track vertices that can still be used to extend C 32 Second, not all vertices that extend C into a clique preserve the property of C being an α-clique. Adding a new vertex v to C decreases the clique probability By a factor equal to the product of the edge probabilities between v and every vertex in C. Incrementally maintaining this factor for each vertex v 33 34