* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download LN30 - WSU EECS
Survey
Document related concepts
Transcript
CPT-S 580-06 Advanced Databases Yinghui Wu EME 49 ADB (ln30) 1 CPT-S 580-08 Advanced Databases Advanced Database: Summary Course summary Future of database research (the Beckman report) Suggestions and tips ADB (ln30) Database Research Database research community less than 40 years old business type applications that have the following demands: – Efficiency in access and modification of very large amounts of data – Resilience in surviving hardware and software errors without losing data – Access control to support simultaneous access by multiple users and ensure consistency – Persistence of the data over long time periods regardless of the programs that access the data Research has centered on methods for designing systems with efficiency, resilience, access control, and persistence and on the languages and conceptual tools to help users to access, manipulate and design databases. Overview of topics DBMS beyond relational databases (week 2-3) • noSQL and newSQL • Data stream management Main-memory DBMS (week 4) • Architecture and design principles • Query and indexing strategy Advanced query techniques (week 5-6) • Indexing, query optimization • Approximate querying Parallel and distributed DBMS (week 7-8) • Parallel/distributed computation models • Partition, fault tolerance and concurrency control • Distributed stream processing ADB (ln30) Overview of topics DBMS and • DBMS and IR knowledge • DBMS and scalable DM/ML discovery(week 9-11) Data Quality (week 12-13) • Dirty data: issues and problems • Data cleaning and repairing (dependencies) DBMS in cloud, and data warehouse (week 14) • OLAP • Scalable data warehouse Privacy and security (week 15) • Access control • Data confidentiality • GhostDB ADB (ln30) Data models, noSQL & newSQL 6 Data Models Relational model Entity-Relationship data model (mainly for database design) Object-based data models (Object-oriented and Object-relational) Semistructured data model (XML and graphs) Other older models: – Network model – Hierarchical model “What goes around comes around”, by Michael Stonebraker oldSQL vs. noSQL ACID EASE • noSQL: concept and theory • • • • noSQL databases • • • • • • • • • • Cheap, easy to implement (open source) Data are replicated to multiple nodes (fault-tolerant) Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP) • • • CAP theory ACID vs EASE noSQL vs RDBMS Key-value stores Document DBs Column family Graph databases Joins, ACID transactions SQL as a sometimes frustrating but still powerful query language easy integration with other applications that support SQL 8 oldSQL vs. noSQL vs. NewSQL “A DBMS that delivers the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID, or to improve performance for appropriate workloads.” SQL + ACID + performance and scalability through modern innovative software architecture Principle 1: minimizing or stay away from locking Principle 2: rely on main memory Principle 3: try to avoid latching Principle 4: cheaper solutions for HA Disk-based vs. Main-Memory DBM S Disk bottleneck is removed as database is kept in main memory → Access to main memory becomes new bottleneck tuple-at-a-time vectorized execution Row-store or column store? operator-at-a-time DBMS vs. DSMS Traditional DBMS: – static records with no predefined notion of time – persistent data storage and complex querying DSMS: • • • • SQL Query on-line analysis of rapidly changing data streams data stream sequence of items, too large to store entirely, not ending continuous queries Result Result Continuous Query (CQ) Query Processing Main Memory Query Processing Disk Data Stream(s) Main Memory Data Stream(s) 11 Scalable database query processing 12 Approximate query evaluation Exact Query “Big Data” Exact Answer Compression Sketch Summaries Approximate Query “Small Data” KB/MB Long Response Times! Approximate Answer FAST!! Approximate query evaluation • • query driven: approximate query models data driven: synopses, histogram, sampling, sketches, spanners… Making big data small: Resource bounded search 13 Parallel query processing parallel DBMS Architectures Q( D ) Q( D1 ) Q( D2 ) … Q( Dn 4 Parallelism: Intraquery, Interquery Intraoperation,Interoperation ) Parallel models: PRAM BSP logP Programming model: MapReduce <k1, v1> <k1, v1> <k1, v1> <k1, v1> mapper mapper mapper <k2, v2> <k2, v2> <k2, v2> reducer reducer <k3, v3> <k3, v3> 14 Query processing: Make it distributed Parallel Graph programming models – MapReduce for BFS for distance queries, PageRank.. – Vertex Centric Programming: GraphLab and Pregel – Graph Centric Programming: Giraph ++ – GRAPE: Hybrid models Virtual Processors 15 DBMS and knowledge discovery 16 A case study: approximate IR for graph queries “find information about the patients with eye tumor, and doctors who cured them.” (IBM Watson, Facebook Graph Search, Apple Siri, Wolfram Alpha Search…) eye tumor choroid neoplasm does not match patient eye neoplasm eye tumor Jane (patient) doctor choroid neoplasm Alex Smith (primary care provider) match! doctor SameAs superclassOf primary care physician provider 17 More than one way to pick a leaf… Query Data Graph Transformation Category Example First/Last token String Abbreviation String Prefix String Acronym String Synonym Semantic “tumor” -> “neoplasm” Ontology Semantic “teacher” -> “educator” Range Numeric “1980” -> “~30” Unit Conversion Numeric “3 mi” -> “4.8 km” Distance Topology … … “Barack Obama” -> “Obama” “Jeffrey Jacob Abrams” -> “J. J. Abrams” “Doctor” -> “Dr” “Bank of America” -> “BOA" “Pine” - “M:I” -> … “Pine” - “J.J. Abrams” - “M:I” 18 Turn Web into Knowledge Base more knowledge, analytics, insight • • knowledge acquisition Knowledge Web intelligent interpretation Entity resolution Relation learning How to make DM/ML scale? platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes Data quality 21 Data quality Data quality: The No.1 problem for data management Real life data are dirty, dirty data are costly – The quest for a principled approach – Critical issues: • Data consistency • Data accuracy • Entity resolution (record matching) • Information completeness • Data currency Many challenges remain – certain fixes (minimum user interaction), information completeness, data currency, Interaction between central issues of data quality telecommunication, life sciences, finance, e-government, … Data quality: A rich source of questions and vitality 22 Dependencies for improving data quality Conditional functional dependencies (CFDs) – Syntax and semantics – Static analysis: consistency and implication, axiom system Conditional inclusion dependencies (CINDs) – Syntax and semantics – Static analysis: consistency and implication Matching dependencies for record matching (MDs) – Syntax and semantics – Relative candidate keys 23 A platform for improving data quality Business rules Master data profiling validating Validation error detecting dependencies data repairing automatically discover rules record matching certain fixes Standardization Dirty data Clean Data Auditing Enrichment Develop practical data cleaning system Monitoring Data explorer24 DBMS: special topics 25 DBMS in the Cloud Design principles • Separate systems & application • Limited interaction to a single node • Decouple ownership • Limited synchronization 26 DBMS and data warehouse time Date 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Sales Fact Table time_key item_key branch_key branch Country TV PC VCR sum 1Qtr item time_key day day_of_the_week month quarter year location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measure s A decision support database that is maintained separately from the organization’s operational database subject-oriented, integrated, time- ADB (ln30) variant, and nonvolatile DBMS and Crowdsourcing Use crowd to answer DB queries Where to use crowd? How to use crowd? How to support SQL? How to devise a system? Quality? ADB (ln30) Task management Pricing Trustworthy Scalability DBMS: privacy and security Key problems – Access control – Data anonymization: k-anonymity, l-diversity, t-closeness, differential privacy, secure query processing – Balancing performance and security: partition indexing; onionencryption ADB (ln30) Future of Database Research 30 The Beckman report http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext Research challenges – Challenge 1:Scalable big/fast data infrastructures – parallel and distributed processing (volume) • • • • • • • Query processing and optimization (process monitoring) – Integrate data mining, sampling, machine learning New hardware Cost-efficient storage High-speed data streams Late-bound schemas Consistency Metrics and benchmarks ADB (ln30) The Beckman report http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext Research challenges – Challenge 2: Diversity in data management • • • • – No-one-size-fits-all Cross-platform integration Programming models Data processing workflows Challenge 3: End-to-end processing of data • • • • Data-to-knowledge pipeline Tool-diversity and customizability Open source Understanding data/knowledge bases ADB (ln30) The Beckman report http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext Research challenges – Challenge 4: Cloud Service • • • • • • – Elasticity Data replication System administration and tuning Multitenancy Data sharing Hybrid clouds (cyber-physical systems) Challenge 5: Roles of humans in the data life cycle • • • • Data producer (meta-data) Data curators (crowdsourcing) Data consumers (fuzzy queries) Online communities (data community) ADB (ln30) Future of Big Data techs (NSF National Priorities) 34 Suggestions and tips 35 Survey presentation/writing Presentation (18 minutes + 2-3 minutes Q&A) – Background and motivation • • • – Problems formulation: • • – why the problem set is important application of the solutions Challenges Input and output Object function, if any Techniques • – – For each method you surveyed – high level idea – a summary of key techniques, and major result (performance guarantees, time/storage cost, speed up, correctness guarantee, error bound, etc) Evaluation • Evaluation metric/categorization • A comparison of algorithms/techniques; pros and cons; • Summary of experimental result Conclusion and Vision • Give your opinion on how these work can be improved • Make a connection to your own research project 36 General tips Every talk motivates a problem Talk is about idea Simple Slides are better A picture is worth a thousand words Keep logic flow Prepare for Questions Practice makes perfect 37 CPT_S 580-06 Advanced Databases I hope you enjoy this course And found it useful! Thank you 38 Course evaluation Reminder: – VCEA Course evaluations will be open April 18th through May 6th – For students: Direct to myWSU portal. The center of the page includes a BLUE COURSE EVALUATIONS window. – You will receive an initial email announcement and two reminders. Reminders will only be sent if you have incomplete evaluations. 39 What is a Survey Paper?? CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009 A survey paper is… "a paper that summarizes and organizes recent research results in a novel way that integrates and adds understanding to work in the field. A survey article assumes a general knowledge of the area; it emphasizes the classification of the existing literature, developing a perspective on the area, and evaluating trends." As described by ACM Computing Surveys Goals of a Survey Provide reader with a view of existing work that is well organized and comprehensive – Not all details must be included, which one’s should/shouldn’t? – Make sure to cover all relevant material completely – Logical structure of organization – State-of-the-art view Your survey paper should … Summarize the research in 5-8 papers on a particular topic Include your own commentary on the significance of the approach and the solutions presented in each paper Provide a critical assessment of the work that has been done Include a discussion on future research directions REMEMBER – Everything you write in this survey paper has to be in your own words – All ideas, paraphrases of other people's words must be correctly attributed in the body of the paper and in the references – Any evidence of it in the survey paper will result in a fail grade How To Find Articles Search various digital libraries – ACM – IEEE – Google Scholar Try to identify research groups/faculty in the area – Dig into their work and pointers How To Pick Articles – In General When picking papers to read - try to: – Pick a recent survey of the field so you can quickly gain an overview, – Pick a paper that you can easier understand – book chapters often give easier understandable materials and lengthy explanation that may give you a head start, although they may not be as up-to-date as papers, – Pick papers that are related to each other in some ways and/or that are in the same field so that you can write a meaningful survey out of them, – Favour papers from well-known journals and conferences, – Favour “first” or “foundational” papers in the field (as indicated in other people’s survey paper), – Favour more recent papers, – Once you have identified an interesting technology to report upon, follow developments in that strand of technology (e.g. time-wise and technology-wise developments). – Find relationships with respect to each other and to your topic area (classification scheme/categorization) Article Structure It should not be just a concatenation of paper reviews A typical structure of a paper includes: – – – – – – Title Abstract Introduction Body of paper Conclusion/Future Work References Article Structure Introduction – Importance and significance of the topic – Discuss the background and target audience – Summarize the surveyed research area and explain why the surveyed area has been studied – Summarize the classification scheme you used to do the survey – Summarize the surveyed techniques with the above classification scheme Article Structure Survey details/Body of paper – Present the surveyed techniques using the classification scheme in detail – Identify the trends in the surveyed area. Give evidences for your decision – Identify some leading research/products/companies/websites – Identify the unresolved problems/difficulties, and future research issues Article Structure Conclusions/Future work – Summarize the conclusions of your survey References – List all the citations referenced in your paper Figures Can be taken from papers as long as appropriate credit is given – “Figure taken from [28]”. Draw your own figures to show classification or structure of the survey Use tables to organize comparisons between applications/systems/etc How to Cite a Reference Cite the full info about the paper – – – – – Author names Paper title Publication details Page numbers Year, etc [1] Adomavicius G, Tuzhilin A., “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 6. (June 2005), pp. 734-749. In the text, use "[1]" to refer There are many bibliography formats. Select one and stick to it. http://standards.ieee.org/guides/style/2009_Style_Manual.pdf (Chap 19) http://sgs.umkc.edu/pdfs/ACM-STYLE-EXAMPLES.pdf General Rules for Bibliography Avoid use of et al. in a bibliography unless list is very long (five or more authors). Internet drafts must be marked ``work in progress''. Book citations include publication years, but no ISBN number. It is now acceptable to include URLs to material, but it is probably bad form to include a URL pointing to the author's web page for papers published in IEEE and ACM publications, given the copyright situation. Use it for software and other non-library material. Avoid long URLs; it may be sufficient to point to the general page and let the reader find the material. General URLs are also less likely to change. Leave a space between first names and last name, i.e., "J. P. Doe", not "J.P.Doe". What not to do…. What not to do….