Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Analytics: SQL or NoSQL?
Richard Taylor
Chair Business Intelligence SIG
The NoSQL Movement
Meetup June 11 2009 in San Francisco
NoSQL name proposed by Eric Evans
Hadoop/HBase (Yahoo)
2004 BigTable (Google)
2007 Dynamo (Amazon)
Project Voldemort (LinkedIn)
2008 Cassandra (Facebook)
NoSQL Conferences
Relational Database/SQL
Database Timeline
1970 Codd
Relational Model
1989 SQL-89
1979 Oracle
Object Relational
1992 SQL-92
1974 SEQUEL
1970
1980 Gray
Transaction
1969 CODASYL
- Network database
- Schema
- DDL/DML
1980
1999 SQL:1999
2003 SQL:2003
Analytics extensions
1990
2000
1995 Bernstein et al
Critique of ANSI SQL
Isolation Levels
1981 Bernstein and Goodman
Multi-version Concurrency
Control
2010
Relational Model
Table – n-tuple
Column

Row
Normalized data



Operations on tables:


select, project, join
Relationship on key

Key
“Atomic”
Multi-column Key

Primary Key
Foreign Key
SQL


Designed for Transaction Processing
Good



Easily handles simple cases
Everyone has a Query Language
Bad


Data access language (not Turing complete)
Declarative Language (4GL)


Impedance mismatch with procedural languages
Complicated cases get repetitive
Normalization

Refine design of structured data




Avoid modification anomalies


Ensure every data item is stored only once
Avoid bias to any particular pattern of querying


“Atomic”
No repeating groups
Data item depends on key (and nothing else)
Allow data to be accessed from every angle
Denormalization
Star Schema Example
Fact
Table
Date_key
Date_key
Store_key
Day_in_week
Promotion_key
Day_in_month
Product_key
Day_in_year
Receipt_number
Day_name
Quantity
Week_in_month
Revenue
Week_in_year
Unit_price
Month_nbr
Month_name
Quarter
Year
Holiday
Holiday_desc
…
Database Summary
• Costs
–
–
–
–
–
–
Fixed schema
Normalization
Transform data on load
Cost of scaling
Problems with large objects
Complicated software
• Benefits
– Mature technology
– Precise querying
– Star Schema – historic data
Tuple Store/NoSQL
Tuple Storage Systems
• Google Database System
–
–
–
–
Chubby – Lock/metadata manager
Google File System – Distributed file system
Bigtable – Tuple storage on GFS
Map Reduce – Data processing on tuples
• Other tuple stores
–
–
–
–
Voldemort – Amazon Dynamo
Cassandra
HBase
Hypertable
Tuple Store Model
Tuple Store


One Table
Operate on Map


Key
Value


Structured Key
Unstructured Value
Operations:

Key
Set of (Key, Value)
Column Timestamp

select, project
Map Reduce
Map Reduce
• Define two functions
– Map
• Input: tuple
• Output: list of tuples
– Reduce
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { list(v3) }
-> { (k2, v3) }
• Input: key, list of values
• Output: list or tuple
• Specify a cluster
• Specify input and output tuple stores
• Framework does the rest
Map Reduce Example
For each web page count the number of pages
that reference that page
Input tuple store is WWW
URL
URL
URL
URL
…
Web
Page
Web
Page
Web
WebPage
Page
Map Function:
for each anchor on web page,
emit (anchorURL, 1)
Reduce Function:
emit (anchorURL, sum(list))
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
Output tuple store is
{ (URL, count) }
Example in SQL
For each web page count the number of pages
that reference that page
CREATE TABLE links
( URL page NOT NULL,
URL ref_page NOT NULL,
PRIMARY KEY page, ref_page
)
SELECT ref_page, count(DISTINCT page)
FROM links
GROUP BY ref_page
Tuple Store Summary
• Semi-structured data
– No need to normalize data
• Simple implementations
– Cheap, fast, scalable
• Map Reduce Processing
– Simple programming (for geeks)
• Issues
– No guidance from schema
– No model for historic data
Hadoop wins
Sort Benchmark
Synthesis
Summary
• SQL
– Structured data
– Precise
– Historic data
– Needs transformation
– Scalability issues
• NoSQL
– Cheap
– Scalable
– Handles large data
Enterprise Model
Money
Content
NoSQL
Relational
DB
Analytics
?
Issues:
- Data volume
- Query requirements
Metadata?
Analytics Architecture
TB+/day
Tuple
Store
Map Reduce
Processing
Reports
RDB
Data Warehouse
Cubes
Reports
etc.
Summary
It is all about structured data
How much do we want?
How much can we afford?
Related documents