Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG The NoSQL Movement Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans Hadoop/HBase (Yahoo) 2004 BigTable (Google) 2007 Dynamo (Amazon) Project Voldemort (LinkedIn) 2008 Cassandra (Facebook) NoSQL Conferences Relational Database/SQL Database Timeline 1970 Codd Relational Model 1989 SQL-89 1979 Oracle Object Relational 1992 SQL-92 1974 SEQUEL 1970 1980 Gray Transaction 1969 CODASYL - Network database - Schema - DDL/DML 1980 1999 SQL:1999 2003 SQL:2003 Analytics extensions 1990 2000 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1981 Bernstein and Goodman Multi-version Concurrency Control 2010 Relational Model Table – n-tuple Column Row Normalized data Operations on tables: select, project, join Relationship on key Key “Atomic” Multi-column Key Primary Key Foreign Key SQL Designed for Transaction Processing Good Easily handles simple cases Everyone has a Query Language Bad Data access language (not Turing complete) Declarative Language (4GL) Impedance mismatch with procedural languages Complicated cases get repetitive Normalization Refine design of structured data Avoid modification anomalies Ensure every data item is stored only once Avoid bias to any particular pattern of querying “Atomic” No repeating groups Data item depends on key (and nothing else) Allow data to be accessed from every angle Denormalization Star Schema Example Fact Table Date_key Date_key Store_key Day_in_week Promotion_key Day_in_month Product_key Day_in_year Receipt_number Day_name Quantity Week_in_month Revenue Week_in_year Unit_price Month_nbr Month_name Quarter Year Holiday Holiday_desc … Database Summary • Costs – – – – – – Fixed schema Normalization Transform data on load Cost of scaling Problems with large objects Complicated software • Benefits – Mature technology – Precise querying – Star Schema – historic data Tuple Store/NoSQL Tuple Storage Systems • Google Database System – – – – Chubby – Lock/metadata manager Google File System – Distributed file system Bigtable – Tuple storage on GFS Map Reduce – Data processing on tuples • Other tuple stores – – – – Voldemort – Amazon Dynamo Cassandra HBase Hypertable Tuple Store Model Tuple Store One Table Operate on Map Key Value Structured Key Unstructured Value Operations: Key Set of (Key, Value) Column Timestamp select, project Map Reduce Map Reduce • Define two functions – Map • Input: tuple • Output: list of tuples – Reduce { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) } • Input: key, list of values • Output: list or tuple • Specify a cluster • Specify input and output tuple stores • Framework does the rest Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW URL URL URL URL … Web Page Web Page Web WebPage Page Map Function: for each anchor on web page, emit (anchorURL, 1) Reduce Function: emit (anchorURL, sum(list)) { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3) } Output tuple store is { (URL, count) } Example in SQL For each web page count the number of pages that reference that page CREATE TABLE links ( URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page Tuple Store Summary • Semi-structured data – No need to normalize data • Simple implementations – Cheap, fast, scalable • Map Reduce Processing – Simple programming (for geeks) • Issues – No guidance from schema – No model for historic data Hadoop wins Sort Benchmark Synthesis Summary • SQL – Structured data – Precise – Historic data – Needs transformation – Scalability issues • NoSQL – Cheap – Scalable – Handles large data Enterprise Model Money Content NoSQL Relational DB Analytics ? Issues: - Data volume - Query requirements Metadata? Analytics Architecture TB+/day Tuple Store Map Reduce Processing Reports RDB Data Warehouse Cubes Reports etc. Summary It is all about structured data How much do we want? How much can we afford?