Download LN30 - WSU EECS

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
CPT-S 580-06
Advanced Databases
Yinghui Wu
EME 49
ADB (ln30)
1
CPT-S 580-08 Advanced Databases
Advanced Database: Summary
 Course summary
 Future of database research (the Beckman report)
 Suggestions and tips
ADB (ln30)
Database Research
 Database research community less than 40 years old
 business type applications that have the following demands:
– Efficiency in access and modification of very large amounts of
data
– Resilience in surviving hardware and software errors without
losing data
– Access control to support simultaneous access by multiple users
and ensure consistency
– Persistence of the data over long time periods regardless of the
programs that access the data
 Research has centered on methods for designing systems with
efficiency, resilience, access control, and persistence and on the
languages and conceptual tools to help users to access,
manipulate and design databases.
Overview of topics
DBMS beyond
relational databases
(week 2-3)
• noSQL and newSQL
• Data stream management
Main-memory
DBMS (week 4)
• Architecture and design principles
• Query and indexing strategy
Advanced query
techniques
(week 5-6)
• Indexing, query optimization
• Approximate querying
Parallel and
distributed DBMS
(week 7-8)
• Parallel/distributed computation models
• Partition, fault tolerance and concurrency control
• Distributed stream processing
ADB (ln30)
Overview of topics
DBMS and
• DBMS and IR
knowledge
• DBMS and scalable DM/ML
discovery(week 9-11)
Data Quality
(week 12-13)
• Dirty data: issues and problems
• Data cleaning and repairing (dependencies)
DBMS in cloud, and
data warehouse
(week 14)
• OLAP
• Scalable data warehouse
Privacy and security
(week 15)
• Access control
• Data confidentiality
• GhostDB
ADB (ln30)
Data models, noSQL & newSQL
6
Data Models
 Relational model
 Entity-Relationship data model (mainly for database design)
 Object-based data models (Object-oriented and Object-relational)
 Semistructured data model (XML and graphs)
 Other older models:
– Network model
– Hierarchical model
“What goes around comes around”, by Michael Stonebraker
oldSQL vs. noSQL
ACID
EASE
•
noSQL: concept and theory
•
•
•
•
noSQL databases
•
•
•
•
•
•
•
•
•
•
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (fault-tolerant)
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
•
•
•
CAP theory
ACID vs EASE
noSQL vs RDBMS
Key-value stores
Document DBs
Column family
Graph databases
Joins, ACID transactions
SQL as a sometimes frustrating but still
powerful query language
easy integration with other applications
that support SQL
8
oldSQL vs. noSQL vs. NewSQL
“A DBMS that delivers the scalability and flexibility promised by
NoSQL while retaining the support for SQL queries and/or ACID, or
to improve performance for appropriate workloads.”
SQL + ACID + performance and scalability through modern
innovative software architecture
Principle 1: minimizing or stay away from locking
Principle 2: rely on main memory
Principle 3: try to avoid latching
Principle 4: cheaper solutions for HA
Disk-based vs. Main-Memory DBM S
Disk bottleneck is removed as
database is kept in main
memory
→ Access to main memory
becomes new bottleneck
tuple-at-a-time
vectorized execution
Row-store or column store?
operator-at-a-time
DBMS vs. DSMS
 Traditional DBMS:
– static records with no predefined notion of time
– persistent data storage and
complex querying
DSMS:
•
•
•
•
SQL Query
on-line analysis of rapidly changing
data streams
data stream
sequence of items, too large to
store entirely, not ending
continuous queries
Result
Result
Continuous Query (CQ)
Query Processing
Main Memory
Query Processing
Disk
Data Stream(s)
Main Memory
Data Stream(s)
11
Scalable database query processing
12
Approximate query evaluation
Exact Query
“Big Data”
Exact Answer
Compression
Sketch
Summaries
Approximate
Query
“Small Data”
KB/MB
Long Response
Times!
Approximate Answer
FAST!!
 Approximate query evaluation
•
•
query driven: approximate query models
data driven: synopses, histogram, sampling, sketches, spanners…
 Making big data small:
 Resource bounded search
13
Parallel query processing
 parallel DBMS Architectures
Q(
D
)
Q( D1
)
Q( D2
)
…
Q( Dn
 4 Parallelism: Intraquery, Interquery
Intraoperation,Interoperation
)
 Parallel models: PRAM BSP logP
 Programming model: MapReduce
<k1, v1> <k1, v1> <k1, v1>
<k1, v1>
mapper
mapper
mapper
<k2, v2>
<k2, v2>
<k2, v2>
reducer
reducer
<k3, v3>
<k3, v3>
14
Query processing: Make it distributed
 Parallel Graph programming models
– MapReduce for BFS for distance queries, PageRank..
– Vertex Centric Programming: GraphLab and Pregel
– Graph Centric Programming: Giraph ++
– GRAPE: Hybrid models
Virtual Processors
15
DBMS and knowledge discovery
16
A case study: approximate IR for graph queries
“find information about the patients with eye tumor,
and doctors who cured them.”
(IBM Watson, Facebook Graph Search, Apple Siri, Wolfram Alpha Search…)
eye tumor
choroid neoplasm
does not match
patient
eye neoplasm
eye tumor
Jane
(patient)
doctor
choroid
neoplasm
Alex Smith
(primary care provider)
match!
doctor
SameAs
superclassOf primary care
physician
provider
17
More than one way to pick a leaf…
Query
Data Graph
Transformation
Category
Example
First/Last token
String
Abbreviation
String
Prefix
String
Acronym
String
Synonym
Semantic
“tumor” -> “neoplasm”
Ontology
Semantic
“teacher” -> “educator”
Range
Numeric
“1980” -> “~30”
Unit Conversion
Numeric
“3 mi” -> “4.8 km”
Distance
Topology
…
…
“Barack Obama” -> “Obama”
“Jeffrey Jacob Abrams” -> “J. J. Abrams”
“Doctor” -> “Dr”
“Bank of America” -> “BOA"
“Pine” - “M:I” ->
…
“Pine” - “J.J. Abrams” - “M:I”
18
Turn Web into Knowledge Base
more knowledge, analytics, insight
•
•
knowledge
acquisition
Knowledge
Web
intelligent
interpretation
Entity resolution
Relation learning
How to make DM/ML scale? platform choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
Data quality
21
Data quality
 Data quality: The No.1 problem for data management
 Real life data are dirty, dirty data are costly
– The quest for a principled approach
– Critical issues:
• Data consistency
• Data accuracy
• Entity resolution (record matching)
• Information completeness
• Data currency

Many challenges remain
– certain fixes (minimum user interaction), information completeness, data
currency, Interaction between central issues of data quality
telecommunication, life sciences, finance, e-government, …
Data quality: A rich source of questions and vitality
22
Dependencies for improving data quality
 Conditional functional dependencies (CFDs)
– Syntax and semantics
– Static analysis: consistency and implication, axiom system
 Conditional inclusion dependencies (CINDs)
– Syntax and semantics
– Static analysis: consistency and implication
 Matching dependencies for record matching (MDs)
– Syntax and semantics
– Relative candidate keys
23
A platform for improving data quality
Business rules
Master data
profiling
validating
Validation
error detecting
dependencies
data repairing
automatically
discover rules
record matching
certain fixes
Standardization
Dirty data
Clean Data
Auditing
Enrichment
Develop practical data cleaning system
Monitoring
Data explorer24
DBMS: special topics
25
DBMS in the Cloud
Design principles
• Separate systems
& application
• Limited interaction
to a single node
• Decouple ownership
• Limited
synchronization
26
DBMS and data warehouse
time
Date
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Sales Fact Table
time_key
item_key
branch_key
branch
Country
TV
PC
VCR
sum
1Qtr
item
time_key
day
day_of_the_week
month
quarter
year
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measure
s
A decision support database that
is maintained separately from the
organization’s operational
database
subject-oriented, integrated, time-
ADB (ln30)
variant, and nonvolatile
DBMS and Crowdsourcing

Use crowd to answer DB
queries
Where to use crowd?
 How to use crowd?
 How to support SQL?
 How to devise a system?
 Quality?

ADB (ln30)




Task management
Pricing
Trustworthy
Scalability
DBMS: privacy and security
 Key problems
– Access control
– Data anonymization: k-anonymity, l-diversity, t-closeness, differential
privacy, secure query processing
– Balancing performance and security: partition indexing; onionencryption
ADB (ln30)
Future of Database Research
30
The Beckman report
http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext
 Research challenges
– Challenge 1:Scalable big/fast data infrastructures – parallel and
distributed processing (volume)
•
•
•
•
•
•
•
Query processing and optimization (process monitoring)
– Integrate data mining, sampling, machine learning
New hardware
Cost-efficient storage
High-speed data streams
Late-bound schemas
Consistency
Metrics and benchmarks
ADB (ln30)
The Beckman report
http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext
 Research challenges
– Challenge 2: Diversity in data management
•
•
•
•
–
No-one-size-fits-all
Cross-platform integration
Programming models
Data processing workflows
Challenge 3: End-to-end processing of data
•
•
•
•
Data-to-knowledge pipeline
Tool-diversity and customizability
Open source
Understanding data/knowledge bases
ADB (ln30)
The Beckman report
http://cacm.acm.org/magazines/2016/2/197411-the-beckman-report-on-databaseresearch/fulltext
 Research challenges
– Challenge 4: Cloud Service
•
•
•
•
•
•
–
Elasticity
Data replication
System administration and tuning
Multitenancy
Data sharing
Hybrid clouds (cyber-physical systems)
Challenge 5: Roles of humans in the data life cycle
•
•
•
•
Data producer (meta-data)
Data curators (crowdsourcing)
Data consumers (fuzzy queries)
Online communities (data community)
ADB (ln30)
Future of Big Data techs (NSF National Priorities)
34
Suggestions and tips
35
Survey presentation/writing

Presentation (18 minutes + 2-3 minutes Q&A)
–
Background and motivation
•
•
•
–
Problems formulation:
•
•
–
why the problem set is important
application of the solutions
Challenges
Input and output
Object function, if any
Techniques
•
–
–
For each method you surveyed
– high level idea
– a summary of key techniques, and major result (performance guarantees, time/storage
cost, speed up, correctness guarantee, error bound, etc)
Evaluation
• Evaluation metric/categorization
• A comparison of algorithms/techniques; pros and cons;
• Summary of experimental result
Conclusion and Vision
• Give your opinion on how these work can be improved
• Make a connection to your own research project
36
General tips
 Every talk motivates a problem
 Talk is about idea
 Simple Slides are better
 A picture is worth a thousand words
 Keep logic flow
 Prepare for Questions
 Practice makes perfect
37
CPT_S 580-06
Advanced Databases
I hope you enjoy this course
And found it useful!
Thank you 
38
Course evaluation
 Reminder:
– VCEA Course evaluations will be open April 18th through May 6th
– For students: Direct to myWSU portal. The center of the page includes
a BLUE COURSE EVALUATIONS window.
– You will receive an initial email announcement and two reminders.
Reminders will only be sent if you have incomplete evaluations.
39
What is a Survey Paper??
CSE594 Fall 2009
Jennifer Wong
Oct. 14, 2009
A survey paper is…
"a paper that summarizes and organizes recent research results in
a novel way that integrates and adds understanding to work in
the field. A survey article assumes a general knowledge of the
area; it emphasizes the classification of the existing literature,
developing a perspective on the area, and evaluating trends."
As described by ACM Computing
Surveys
Goals of a Survey
 Provide reader with a view of existing work that is well organized
and comprehensive
– Not all details must be included, which one’s
should/shouldn’t?
– Make sure to cover all relevant material completely
– Logical structure of organization
– State-of-the-art view
Your survey paper should …
 Summarize the research in 5-8 papers on a particular
topic
 Include your own commentary on the significance of
the approach and the solutions presented in each
paper
 Provide a critical assessment of the work that has
been done
 Include a discussion on future research directions
 REMEMBER
– Everything you write in this survey paper has to be in your
own words
– All ideas, paraphrases of other people's words must be
correctly attributed in the body of the paper and in the
references
– Any evidence of it in the survey paper will result in a fail
grade
How To Find Articles
 Search various digital libraries
– ACM
– IEEE
– Google Scholar
 Try to identify research groups/faculty in the area
– Dig into their work and pointers
How To Pick Articles – In General
 When picking papers to read - try to:
– Pick a recent survey of the field so you can quickly gain an
overview,
– Pick a paper that you can easier understand – book chapters
often give easier understandable materials and lengthy
explanation that may give you a head start, although they may
not be as up-to-date as papers,
– Pick papers that are related to each other in some ways and/or
that are in the same field so that you can write a meaningful
survey out of them,
– Favour papers from well-known journals and conferences,
– Favour “first” or “foundational” papers in the field (as indicated in
other people’s survey paper),
– Favour more recent papers,
– Once you have identified an interesting technology to report
upon, follow developments in that strand of technology (e.g.
time-wise and technology-wise developments).
– Find relationships with respect to each other and to your topic
area (classification scheme/categorization)
Article Structure
 It should not be just a concatenation of paper reviews
 A typical structure of a paper includes:
–
–
–
–
–
–
Title
Abstract
Introduction
Body of paper
Conclusion/Future Work
References
Article Structure
 Introduction
– Importance and significance of the topic
– Discuss the background and target audience
– Summarize the surveyed research area and explain why the
surveyed area has been studied
– Summarize the classification scheme you used to do the
survey
– Summarize the surveyed techniques with the above
classification scheme
Article Structure
 Survey details/Body of paper
– Present the surveyed techniques using the classification
scheme in detail
– Identify the trends in the surveyed area. Give evidences for
your decision
– Identify some leading research/products/companies/websites
– Identify the unresolved problems/difficulties, and future
research issues
Article Structure
 Conclusions/Future work
– Summarize the conclusions of your survey
 References
– List all the citations referenced in your paper
Figures
 Can be taken from papers as long as appropriate credit is given
– “Figure taken from [28]”.
 Draw your own figures to show classification or structure of the
survey
 Use tables to organize comparisons between
applications/systems/etc
How to Cite a Reference
 Cite the full info about the paper
–
–
–
–
–
Author names
Paper title
Publication details
Page numbers
Year, etc
[1] Adomavicius G, Tuzhilin A., “Toward the Next Generation of
Recommender Systems: A Survey of the State-of-the-Art and Possible
Extensions”, IEEE Transactions on Knowledge and Data Engineering,
Vol. 17, No. 6. (June 2005), pp. 734-749.
In the text, use "[1]" to refer
There are many bibliography formats. Select one and stick to it.
http://standards.ieee.org/guides/style/2009_Style_Manual.pdf (Chap 19)
http://sgs.umkc.edu/pdfs/ACM-STYLE-EXAMPLES.pdf
General Rules for Bibliography
 Avoid use of et al. in a bibliography unless list is very long (five or
more authors).
 Internet drafts must be marked ``work in progress''.
 Book citations include publication years, but no ISBN number.
 It is now acceptable to include URLs to material, but it is probably bad
form to include a URL pointing to the author's web page for papers
published in IEEE and ACM publications, given the copyright situation.
Use it for software and other non-library material. Avoid long URLs; it
may be sufficient to point to the general page and let the reader find
the material. General URLs are also less likely to change.
 Leave a space between first names and last name, i.e., "J. P. Doe",
not "J.P.Doe".
What not to do….
What not to do….