* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to SQL
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
10/26/2015 WORKSHOP: Introduction to DB and SQL • Overview of database systems – What is behind this Web Site? – Database Management Systems – What is DBMS? – Where are RDBMS used ? • Problems without an DBMS • How the Programmer Sees the DBMS • SQL Fundamentals of Website Development CSC 2320, Fall 2015 The Department of Computer Science 10/26/2015 What is behind this Web Site? -2- ©2015 http://cs.gsu.edu/~mhan7 Database Management Systems • Search on a large database Database Management System = DBMS • Specify search conditions • A collection of files that store the data • Many users • A big C program written by someone else that accesses and updates those files for you • Updates Relational DBMS = RDBMS • Access through a web interface • Data files are structured as relations (tables) 10/26/2015 -3- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -4- ©2015 http://cs.gsu.edu/~mhan7 1 10/26/2015 Where are RDBMS used ? Example of a Traditional Database Application • Backend for traditional “database” applications Suppose we are building a system – EPFL administration to store the information about: • Backend for large Websites • students – Google • courses • Backend for Web services • professors – Amazon 10/26/2015 • who takes what, who teaches what -5- ©2015 http://cs.gsu.edu/~mhan7 What is DBMS? 10/26/2015 -6- ©2015 http://cs.gsu.edu/~mhan7 Why Use a DBMS? • Need for information management • A very large, integrated collection of data. • Models real-world enterprise. – Entities (e.g., students, courses) – Relationships (e.g., John is taking CSC2320) • A Database Management System (DBMS) is a software package designed to store and manage databases. • Data independence and efficient access. • Data integrity and security. • Uniform data administration. • Concurrent access, recovery from crashes. • Replication control • Reduced application development time. 10/26/2015 -7- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -8- ©2015 http://cs.gsu.edu/~mhan7 2 10/26/2015 Why Study Databases?? ? • Shift from computation to information – at the “low end”: access to physical world – at the “high end”: scientific applications • Datasets increasing in diversity and volume. – Digital libraries, interactive video, Human Genome project, e-commerce, sensor networks – ... need for DBMS/data services exploding • DBMS encompasses several areas of CS – OS, languages, theory, AI, multimedia, logic -9- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Data Models • A data model is a collection of concepts for describing data. • A schema is a description of a particular collection of data, using the a given data model. • The relational model of data is the most widely used model today. – Main concept: relation, basically a table with rows and columns. – Every relation has a schema, which describes the columns, or fields. -10- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Can we do it without a DBMS ? Levels of Abstraction • Many views, single conceptual (logical) schema and physical schema. – – – Sure we can! Start by storing the data in files: View 1 View 2 View 3 students.txt courses.txt professors.txt Conceptual Schema Views describe how users see the data. Conceptual schema defines logical structure Physical schema describes the files and indexes used. Physical Schema Now write C or Java programs to implement specific tasks Schemas are defined using DDL; data is modified/queried using DML. 10/26/2015 -11- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -12- ©2015 http://cs.gsu.edu/~mhan7 3 10/26/2015 Doing it without a DBMS... Problems without an DBMS... Read ‘students.txt’ Read ‘courses.txt’ Find&update the record “Mary Johnson” Find&update the record “CSE444” Write “students.txt” Write “courses.txt” • Enroll “Mary Johnson” in “CSC2320”: Write a C/Java program to do the following: Read ‘students.txt’ Read ‘courses.txt’ Find&update the record “Mary Johnson” Find&update the record “CSC2320” Write “students.txt” Write “courses.txt” 10/26/2015 -13- ©2015 http://cs.gsu.edu/~mhan7 Enters a DBMS • System crashes: – What is the problem ? • Large data sets (say 500GB) CRASH ! – Why is this a problem ? • Simultaneous access by many users – Lock students.txt – what is the problem ? 10/26/2015 -14- ©2015 http://cs.gsu.edu/~mhan7 Functionality of a DBMS The programmer sees SQL, which has two components: • Data Definition Language - DDL • Data Manipulation Language - DML “Two tier system” or “client-server” – query language connection (ODBC, JDBC) Data files 10/26/2015 Database server (someone else’s C program) -15- Applications ©2015 http://cs.gsu.edu/~mhan7 Behind the scenes the DBMS has: • Query engine • Query optimizer • Storage management • Transaction Management (concurrency, recovery) 10/26/2015 -16- ©2015 http://cs.gsu.edu/~mhan7 4 10/26/2015 How the Programmer Sees the DBMS How the Programmer Sees the DBMS • Start with DDL to create tables: CREATE TABLE Students ( Name CHAR(30) SSN CHAR(9) PRIMARY KEY NOT NULL,, Category CHAR(20) ) ... • Tables: Students: SSN 123-45-6789 234-56-7890 • Continue with DML to populate tables: -17- Category undergrad grad … SSN 123-45-6789 123-45-6789 234-56-7890 Courses: CID CSC2444 CSC2541 INSERT INTO Students VALUES(‘Charles’, ‘123456789’, ‘undergraduate’) . . . . 10/26/2015 Takes: Name Charles Dan … Name Databases Operating systems CID CSC2444 CSC2541 CSC2142 … Quarter fall winter • Still implemented as files, but behind the scenes can be quite complex “data independence” = separate logical view from physical implementation ©2015 http://cs.gsu.edu/~mhan7 Transactions 10/26/2015 -18- ©2015 http://cs.gsu.edu/~mhan7 Transactions • Enroll “Mary Johnson” in “CSC2320”: • A transaction = sequence of statements that either all succeed, or all fail • Transactions have the ACID properties: BEGIN TRANSACTION; INSERT INTO Takes SELECT Students.SSN, Courses.CID FROM Students, Courses WHERE Students.name = ‘Mary Johnson’ and Courses.name = ‘CSC2320’ A = atomicity (a transaction should be done or undone completely ) C = consistency (a transaction should transform a system from one consistent state to another consistent state) I = isolation (each transaction should happen independently of other -- More updates here.... transactions ) IF everything-went-OK THEN COMMIT; ELSE ROLLBACK D = durability (completed transactions should remain permanent) If system crashes, the transaction is still either committed or aborted 10/26/2015 -19- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -20- ©2015 http://cs.gsu.edu/~mhan7 5 10/26/2015 Queries Queries, behind the scene • Find all courses that “Mary” takes SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid Declarative SQL query Imperative query execution plan: sname SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid • What happens behind the scene ? – Query processor figures out how to answer the query efficiently. cid=cid sid=sid name=“Mary” Students Takes Courses The optimizer chooses the best execution plan for a query 10/26/2015 -21- ©2015 http://cs.gsu.edu/~mhan7 Database Systems ©2015 http://cs.gsu.edu/~mhan7 • Accessing databases through web interfaces – Java programming interface (JDBC) – Embedding into HTML pages (JSP) – Access through http protocol (Web Services) • Using Web document formats for data definition and manipulation – XML, Xquery, Xpath – XML databases and messaging systems Oracle IBM (with DB2) Microsoft (SQL Server) Sybase • Some free database systems (Unix) : – Postgres – MySQL – Predator 10/26/2015 -22- Databases and the Web • The big commercial database vendors: – – – – 10/26/2015 -23- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -24- ©2015 http://cs.gsu.edu/~mhan7 6 10/26/2015 Database Integration Other Trends in Databases • Industrial • Combining data from different databases – collection of data (wrapping) – Object-relational databases – combination of data and generation of new views on – Main memory database systems the data (mediation) – Data warehousing and mining • Problem: heterogeneity • Research – access, representation, content – Peer-to-peer data management – Stream data management – Mobile data management 10/26/2015 -25- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -26- ©2015 http://cs.gsu.edu/~mhan7 AGAIN: 3 types of SQL commands A simplified schematic of a typical SQL environment • 1. Data Definition Language (DDL) commands - that define a database, including creating, altering, and dropping tables and establishing constraints • 2. Data Manipulation Language (DML) commands - that maintain and query a database • 3. Data Control Language (DCL) commands - that control a database, including administering privileges and committing data 10/26/2015 -27- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -28- ©2015 http://cs.gsu.edu/~mhan7 7 10/26/2015 SQL Data types DDL, DML, DCL, and the database development process • CHAR(n) – fixed-length character data, n characters long Maximum length = 2000 bytes • VARCHAR2(n) – variable length character data, maximum 4000 bytes • LONG – variable-length character data, up to 4GB. Maximum 1 per table • NUMBER(p,q) – general purpose numeric data type • INTEGER(p) – signed integer, p digits wide • FLOAT(p) – floating point in scientific notation with p binary digits precision • DATE – fixed-length date/time in dd-mm-yy form 10/26/2015 -29- ©2015 http://cs.gsu.edu/~mhan7 Syntax used in these notes -31- -30- ©2015 http://cs.gsu.edu/~mhan7 SQL Database Definition • Capitals = command syntax (may not be required by the RDBMS) • Lowercase = values that must be supplied by user • Brackets = enclose optional syntax • Ellipses (...) = indicate that the accompanying syntactic clause may be repeated as necessary • Each SQL command ends with a semicolon ‘;’ • In interactive mode, when the user presses the RETURN key, the SQL command will execute 10/26/2015 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 • Data Definition Language (DDL) has these major CREATE statements: – CREATE SCHEMA – defines a portion of the database owned by a particular user. Schemas are dependent on a catalog and contain schema objects, including base tables and views, domains, constraints, assertions, character sets, collations etc. – CREATE TABLE – defines a new table and its columns. The table may be a base table or a derived table. Tables are dependent on a schema. Derived tables are created by executing a query that uses one or more tables or views. 10/26/2015 -32- ©2015 http://cs.gsu.edu/~mhan7 8 10/26/2015 SQL Database Definition SQL database definition – CREATE VIEW – defines a logical table from one or more tables or views. There are limitations on updating data through a view. Where views can be updated, those changes can be transferred to the underlying base tables originally referenced to create the view. 10/26/2015 -33- ©2015 http://cs.gsu.edu/~mhan7 Creating tables 10/26/2015 -34- ©2015 http://cs.gsu.edu/~mhan7 Creating tables • Once data model is designed and normalized, the columns needed for each table can be defined using the CREATE TABLE command. The syntax for this is shown in the following Fig. These are the seven steps to follow: • 1. Identify the appropriate datatype for each , including length and precision • 2. Identify those columns that should accept null values. Column controls that indicate a column cannot be null are established when a table is created and are enforced for every update of the table 10/26/2015 • Each of the previous create commands may be reversed using a DROP command, so DROP TABLE will destroy a table (including its definitions, contents and any schemas or views associated with it) • Usually only the table creator may delete the table • DROP SCHEMA and DROP VIEW will also delete the named schema or view. • ALTER TABLE may be used to change the definition of an existing table -35- ©2015 http://cs.gsu.edu/~mhan7 • 3. Identify those columns that need to be UNQUE - when the data in that column must have a different value (no duplicates) for each row of data within that table. Where a column or set of columns is designated as UNIQUE, this is a candidate key. Only one candidate key may be designated as a PRIMARY KEY • 4. Identify all primary key-foreign key mates. Foreign keys can be established immediately or later by altering the table. The parent table in such a parent-child relationship should be created first. The column constraint REFERENCES can be used to enforce referential integrity 10/26/2015 -36- ©2015 http://cs.gsu.edu/~mhan7 9 10/26/2015 Creating tables Table creation • 5. Determine values to be inserted into any columns for which a DEFAULT value is desired - can be used to define a value that is automatically inserted when no value is provided during data entry. • 6. Identify any columns for which domain specifications may be stated that are more constrained than those established by data type. Using CHECK it is possible to establish validation rules for values to be inserted into the database • 7. Create the table and any desired indexes using the CREATE TABLE and CREATE INDEX statements 10/26/2015 -37- ©2015 http://cs.gsu.edu/~mhan7 General syntax for CREATE TABLE 10/26/2015 -38- ©2015 http://cs.gsu.edu/~mhan7 SQL database definition commands for Pine Valley Furniture Table creation • The following Fig. Shows SQL database definition commands • Here some additional column constraints are shown, and primary and foreign keys are given names • For example, the CUSTOMER table’s primary key is CUSTOMER_ID • The primary key constraint is named CUSTOMER_PK, without the constraint name a system identifier would be assigned automatically and the identifier would be difficult to read 10/26/2015 -39- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -40- ©2015 http://cs.gsu.edu/~mhan7 10 10/26/2015 STEP 1 STEP2 Defining attributes and their data types Non-nullable specifications Note: primary keys should not be null 10/26/2015 -41- ©2015 http://cs.gsu.edu/~mhan7 STEP 3 10/26/2015 -42- ©2015 http://cs.gsu.edu/~mhan7 STEP 4 Identifying foreign keys and establishing relationships Identifying primary keys This is a composite primary key 10/26/2015 -43- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -44- ©2015 http://cs.gsu.edu/~mhan7 11 10/26/2015 STEPS 5 and 6 STEP 7 Default values and domain constraints 10/26/2015 -45- ©2015 http://cs.gsu.edu/~mhan7 The Relational Model 10/26/2015 -46- ©2015 http://cs.gsu.edu/~mhan7 The Relational Model •E.F. Codd: (1923-2003) – Developed the relational model while at IBM San Jose Research Laboratory – IBM Fellow 1976 – Turing Award 1981 – ACM Fellow 1994 – British, by birth •“A Relational Model of Data for Large Shared Data Banks,” E.F. Codd, Communications of the ACM, Vol. 13, No. 6, June, 1970. •“Further Normalization of the Data Base Relational Model,” E.F. Codd, Data Base Systems, Proceedings of 6th Courant Computer Science Symposium, May, 1971. •“Relational Completeness of Data Base Sublanguages,” E.F. Codd, Data Base Systems, Proceedings of 6th Courant Computer Science Symposium, May, 1971. •Associations: – Raymond F. Boyce – Hugh Darwen – C.J. Date – Nikos Lorentzos – David McGoveran – Fabian Pascal 10/26/2015 Overall table definitions •Plus others… -47- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -48- ©2015 http://cs.gsu.edu/~mhan7 12 10/26/2015 Relational Database Management Systems (RDBMS) The Relational Model •The basic data model: – Relations, tuples, attributes, domains “Employee” – Primary & foreign keys ID Last-Name Date-of-Birth – Normal forms 21621 Smith 6/24/69 17852 Brown 32904 Carson •Database Management Systems Based on the Relational Model: Job-Category Management Hardware Software 8/14/72 10/29/64 : : •Query model: – Relational algebra – cartesian product, selection, projection, union, set-difference – Relational calculus – – – – – – – System R – IBM research project (1974) Ingres – University of California Berkeley (early 1970’s) Oracle – Rational Software, now Oracle Corporation (1974) SQL/DS – IBM’s first commercial RDBMS (1981) Informix – Relational Database Systems, now IBM (1981) DB2 – IBM (1984) Sybase SQL Server – Sybase, now SAP (1988) •A primary theme: – Physical data independence -49- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Structure Query Language (SQL) 10/26/2015 -50- ©2015 http://cs.gsu.edu/~mhan7 SQL and the Relational Model •A text search of E.F. Codd’s early papers for “SQL” (or SEQUEL) reveals: SQL is a language for querying relational databases. History: Developed at IBM San Jose Research Laboratory, early 1970’s, for System R Credited to Donald D. Chamberlin and Raymond F. Boyce Based on relational algebra and tuple calculus Originally called SEQUEL Language Elements: Clauses, expressions, predicates, queries, statements, transactions, operators, nesting etc. select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '[DATE]‘ and o_orderdate < date '[DATE]' + interval '3' month and exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by o_orderpriority order by o_orderpriority; 10/26/2015 -51- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -52- ©2015 http://cs.gsu.edu/~mhan7 52 13 10/26/2015 Relational Query Languages What is Wrong With RDBMS? • Nothing. One size fits all? Not really. • Impedance mismatch. – Object Relational Mapping doesn't work quite well. • Rigid schema design. • Harder to scale. • Replication. • Joins across multiple nodes? Hard. • How does RDMS handle data growth? Hard. • Need for a DBA. • Many programmers are already familiar with it. • Transactions and ACID make development easy. • Lots of tools to use. •Other Relational Query Languages: – Datalog – QUEL – Query By Example (QBE) – SQL variations – shell scripts, with relational extensions 10/26/2015 -53- ACID Semantics • • • • 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 -54- ©2015 http://cs.gsu.edu/~mhan7 Enter CAP Theorem Atomicity: All or nothing. Consistency: Consistent state of data and transactions. Isolation: Transactions are isolated from each other. Durability: When the transaction is committed, state will be durable. Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency? No. • Also known as Brewer’s Theorem by Prof. Eric Brewer, published in 2000 at University of Berkeley. • “Of three properties of a shared data system: data consistency, system availability and tolerance to network partitions, only two can be achieved at any given moment.” • Proven by Nancy Lynch et al. MIT labs. • http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf By giving up ACID properties, one can achieve higher performance and scalability. 10/26/2015 -55- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -56- ©2015 http://cs.gsu.edu/~mhan7 14 10/26/2015 CAP Semantics A Simple Proof • Consistency: Clients should read the same data. There are many levels of consistency. – Strict Consistency – RDBMS. – Tunable Consistency – Cassandra. – Eventual Consistency – Amazon Dynamo. • Availability: Data to be available. • Partial Tolerance: Data to be partitioned across network segments due to network failures. Consistent and available No partition. App Data Data A -57- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 A Simple Proof B -58- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 A Simple Proof Consistent and partitioned Not available, waiting… Available and partitioned Not consistent, we get back old data. App Data App New Data Old Data Wait for new data A 10/26/2015 B -59- A ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 B -60- ©2015 http://cs.gsu.edu/~mhan7 15 10/26/2015 BASE, an ACID Alternative A Clash of cultures Almost the opposite of ACID. • Basically available: Nodes in the a distributed environment can go down, but the whole system shouldn’t be affected. • Soft State (scalable): The state of the system and data changes over time. • Eventual Consistency: Given enough time, data will be consistent across the distributed system. -61- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Distributed Transactions ©2015 http://cs.gsu.edu/~mhan7 • Solves Partitioning Problem. • Consistent Hashing, Memcahced. – Starbucks doesn’t use two phase commit by Gregor Hophe. • Possible failures – servers = [s1, s2, s3, s4, s5] – serverToSendData = servers[hash(data) % servers.length] • A New Hope – Continuum Approach. – Network errors. – Node errors. – Database errors. Coordinato r -62- 10/26/2015 Consistent Hashing • Two phase commit. Commit ACID: • Strong consistency. • Less availability. • Pessimistic concurrency. • Complex. BASE: • Availability is the most important thing. Willing to sacrifice for this (CAP). • Weaker consistency (Eventual). • Best effort. • Simple and fast. • Optimistic. Rollback Acknowledge Problems: Locking the entire cluster if one node is down Possible to implement timeouts. Possible to use Quorum. Quorum: in a distributed environment, if there is partition, then the nodes vote to commit or rollback. • Virtual Nodes in a cycle. • Hash both objects and caches. • Easy Replication. – Eventually Consistent. • What happens if nodes fail? • How do you add nodes? Complete operation Release locks 10/26/2015 -63- ©2015 http://cs.gsu.edu/~mhan7 http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforreliev ingHotSpotsontheworldwideweb.pdf 10/26/2015 -64©2015 http://cs.gsu.edu/~mhan7 16 10/26/2015 Concurrency models Vector Clocks • Optimistic concurrency. • Pessimistic concurrency. • MVCC. • Used for conflict detection of data. • Timestamp based resolution of conflicts is not enough. Time 1: Time 2: Replicated Time 3: Update Time 4: Update Time 5: -65- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Vector Clocks 10/26/2015 Replicated Conflict detection -66- ©2015 http://cs.gsu.edu/~mhan7 Read Repair Document.v.1([A, 1]) A Value = Data.v2 Update Document.v.2([A, 2]) Client GET (K, Q=2) A Value = Data.v2 Document.v.2([A, 2],[B,1]) B C Update K = Data.v2 Document.v.2([A, 2],[C,1]) Value = Data.v1 Conflicts are detected. 10/26/2015 -67- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -68- ©2015 http://cs.gsu.edu/~mhan7 17 10/26/2015 Gossip Protocol & Hinted Handoffs Data Models • Most preferred communication protocol in a distributed environment is Gossip Protocol. A • All the nodes talk to each other peer wise. • There is no global state. • No single point of coordinator. • If one node goes down and there is a Quorum load for that node is shared among others. • Self managing system. G • If a new node joins, load is also distributed. D H F 10/26/2015 Requests coming to F will be handled by the nodes who takes the load of F, lets say C with the hint that it took the requests which was for F, when F becomes available, F will get this Information from C. Self healing property. -69- Key/Value Pairs. Tuples (rows). Documents. Columns. Objects. Graphs. There are corresponding data stores. B C • • • • • • ©2015 http://cs.gsu.edu/~mhan7 Complexity 10/26/2015 -70- ©2015 http://cs.gsu.edu/~mhan7 Key-Value Stores • Memcached – Key value stores. • Membase – Memcached with persistence and improved consistent hashing. • AppFabric Cache – Multi region Cache. • Redis – Data structure server. • Riak – Based on Amazon’s Dynamo. • Project Voldemort – eventual consistent key value stores, auto scaling. 10/26/2015 -71- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -72- ©2015 http://cs.gsu.edu/~mhan7 18 10/26/2015 Memcached • • • • • • • • Membase • • • • Very easy to setup and use. Consistent hashing. Scales very well. In memory caching, no persistence. LRU eviction policy. O(1) to set/get/delete. Atomic operations set/get/delete. No iterators, or very difficult. 10/26/2015 -73- • • • • • • • -74- ©2015 http://cs.gsu.edu/~mhan7 Microsoft AppFabric • • • • • • Distributed Data structure server. Consistent hashing at client. Non-blocking I/O, single threaded. Values are binary safe strings: byte strings. String : Key/Value Pair, set/get. O(1) many string operations. Lists: lpush, lpop, rpush, rpop.you can use it as stack or queue. O(1). Publisher/Subscriber is available. • Set: Collection of Unique elements, add, pop, union, intersection etc. set operations. • Sorted Set: Unique elements sorted by scores. O(logn). Supports range operations. • Hashes: Multiple Key/Value pairs HMSET user 1 username foo password bar age 30 HGET user 1 age 10/26/2015 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Redis Easy to manage via web console. Monitoring and management via Web console . Consistency and Availability. Dynamic/Linear Scalability, add a node, hit join to cluster and rebalance. Low latency, high throughput. Compatible with current Memcached Clients. Data Durability, persistent to disk asynchronously. Rebalancing (Peer to peer replication). Fail over (Master/Slave). vBuckets are used for consistent hashing. O(1) to set/get/delete. -75- ©2015 http://cs.gsu.edu/~mhan7 • • • • • • • • Add a node to the cluster easily. Elastic scalability. Namespaces to organize different caches. LRU Eviction policy. Timeout/Time to live is default to 10 min. No persistence. O(1) to set/get/delete. Optimistic and pessimistic concurrency. Supports tagging. 10/26/2015 -76- ©2015 http://cs.gsu.edu/~mhan7 19 10/26/2015 Document Stores Mongodb • Data types: bool, int, double, string, object(bson), oid, array, null, date. • Database and collections are created automatically. • Lots of Language Drivers. • Capped collections are fixed size collections, buffers, very fast, FIFO, good for logs. No indexes. • Object id are generated by client, 12 bytes packed data. 4 byte time, 3 byte machine, 2 byte pid, 3 byte counter. • Possible to refer other documents in different collections but more efficient to embed documents. • Replication is very easy to setup. You can read from slaves. • Schema Free. • Usually JSON like interchange model. • Query Model: JavaScript or custom. • Aggregations: Map/Reduce. • Indexes are done via B-Trees. 10/26/2015 -77- ©2015 http://cs.gsu.edu/~mhan7 Mongodb -78- ©2015 http://cs.gsu.edu/~mhan7 Mongodb - Sharding • Connection pooling is done for you. Sweet. • Supports aggregation. – Map Reduce with JavaScript. • You have indexes, B-Trees. Ids are always indexed. • Updates are atomic. Low contention locks. • Querying mongo done with a document: – Lazy, returns a cursor. – Reduceable to SQL, select, insert, update limit, sort etc. • There is more: upsert (either inserts of updates) – Several operators: • $ne, $and, $or, $lt, $gt, $incr,$decr and so on. • Repository Pattern makes development very easy. 10/26/2015 10/26/2015 -79- ©2015 http://cs.gsu.edu/~mhan7 Config servers: Keeps mapping Mongos: Routing servers Mongod: master-slave replicas 10/26/2015 -80- ©2015 http://cs.gsu.edu/~mhan7 20 10/26/2015 Couchdb Objectivity • Availability and Partial Tolerance. • Views are used to query. Map/Reduce. • MVCC – Multiple Concurrent versions. No locks. – A little overhead with this approach due to garbage collection. – Conflict resolution. • Very simple, REST based. Schema Free. • Shared nothing, seamless peer based Bi-Directional replication. • Auto Compaction. Manual with Mongodb. • Uses B-Trees • Documents and indexes are kept in memory and flushed to disc periodically. • Documents have states, in case of a failure, recovery can continue from the state documents were left. • No built in auto-sharding, there are open source projects. • You can’t define your indexes. -81- 10/26/2015 • • • • • No need for ORM. Closer to OOP. Complex data modeling. Schema evolution. Scalable Collections: List, Set, Map. Object relations. – Bi-Directional relations • ACID properties. • Blazingly fast, uses paging. • Supports replication and clustering. 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Column Stores ©2015 http://cs.gsu.edu/~mhan7 Cassandra Row oriented Id username email Department 1 John [email protected] Sales 2 Mary [email protected] Marketing 3 Yoda [email protected] IT • • • • • • Column oriented Id Username email Department 1 John [email protected] Sales 2 Mary [email protected] Marketing 3 Yoda [email protected] IT 10/26/2015 -82- -83- ©2015 http://cs.gsu.edu/~mhan7 • • • • Tunable consistency. Decentralized. Writes are faster than reads. No Single point of failure. Incremental scalability. Uses consistent hashing (logical partitioning) when clustered. Hinted handoffs. Peer to peer routing(ring). Thrift API. Multi data center support. 10/26/2015 -84- ©2015 http://cs.gsu.edu/~mhan7 21 10/26/2015 Cassandra at Netflix Graph Stores • Based on Graph Theory. • Scale vertically, no clustering. • You can use graph algorithms easily. http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 10/26/2015 -85- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Neo4J -86- ©2015 http://cs.gsu.edu/~mhan7 Which one to use? • Key-value stores: – Processing a constant stream of small reads and writes. Document databases: – Natural data modeling. Programmer friendly. Rapid development. Web friendly, CRUD. • RDMBS: – OLTP. SQL. Transactions. Relations. • OODBMS – Complex object models. • Data Structure Server: – Quirky stuff. • Columnar: – Handles size well. Massive write loads. High availability. Multiple-data centers. MapReduce • Graph: – Graph algorithms and relations. • Want more ideas ? http://highscalability.com/blog/2011/6/20/35-use-cases-for-choosing-your-nextnosql-database.html • Nodes, Relationship. • • Traversals. • HTTP/REST. • ACID. • Web Admin. • Not too much support for languages. • Has transactions. 10/26/2015 -87- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -88- ©2015 http://cs.gsu.edu/~mhan7 22 10/26/2015 The NoSQL RDBMS Operator/stream Paradigm Commonly referenced papers: “The Next Generation,” E. Schaffer and M. Wolf, UNIX Review, March, 1991, page 24. “The UNIX Shell as a Fourth Generation Language,” E. Schaffer and M. Wolf, Revolutionary Software. One of first uses of the phrase NoSQL is due to Carlo Strozzi, circa 1998. NoSQL: Regarding Database Management Systems: “…almost all are software prisons that you must get into and leave the power of UNIX behind.” A fast, portable, open-source RDBMS A derivative of the RDB database system (Walter Hobbs, RAND) “…large, complex programs which degrade total system performance, especially when they are run in a multi-user environment.” Not a full-function DBMS, per se, but a shell-level tool User interface – Unix shell “…put walls between the user and UNIX, and the power of UNIX is thrown away.” Based on the “operator/stream paradigm” In summary: Relational model => yes UNIX => big yes Big, COTS, relational DBMS => no SQL => no http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page 10/26/2015 -89- ©2015 http://cs.gsu.edu/~mhan7 The NoSQL RDBMS -90- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 NoSQL Today More recently: The term has taken on different meanings One common interpretation is “not only SQL” •Getting back to Strozzi’s NoSQL RDBMS: – Based on the relational model – Based on UNIX and shell scripts – Does not have an SQL interface Most modern NoSQL systems diverge from the relational model or standard RDBMS functionality: The data model: •In that sense, and interpreted literally, NoSQL means “no sql,” i.e., we are not using the SQL language. The query model: The implementation: relations tuples attributes domains normalization vs. documents graphs key/values relational algebra tuple calculus vs. graph traversal text search map/reduce rigid schemas vs. flexible schemas (schema-less) ACID compliance vs. BASE In that sense, NoSQL today is more commonly meant to be something like “non-relational” 10/26/2015 -91- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -92- ©2015 http://cs.gsu.edu/~mhan7 23 10/26/2015 NoSQL Today NoSQL Today (a partial, unrefined list) Motivation for recent NoSQL systems is also quite varied: “…there are significant advantages to building our own storage solution at Google,” Chang et. al., 2006 Scalability, performance, availability, flexibility Speculation - $$$, control MySQL vs. MongoDB: • http://www.youtube.com/watch?v=b2F-DItXtZs How “big” is the NoSQL movement? Is this another grand conspiracy by the government and, you know, that guy…. -93- ©2015 http://cs.gsu.edu/~mhan7 NoSQL Today • http://www.vertabelo.com/blog/vertabelo-news/jdd-2013-what-we-found-out-about-databases It is easy to find diagrams that look like this: http://db-engines.com/en/ranking_categories It is easy to find diagrams that look like this: • http://www.odbms.org/2014/11/gartner-2014-magic-quadrant-operational-database-managementsystems-2/ 10/26/2015 Hypertable BigTable QD Technology Vertica Qbase–MetaCarta OpenNeptune Accumulo Stratosphere flare SmartFocus KDI Alterian Cloudera C-Store HPCC Mongo DB Amazon SimpleDB SciDB CouchDB Clusterpoint ServerTerrastore Djondb SchemaFreeDB Jackrabbit OrientDB Perservere CoudKit RaptorDB ThruDB RavenDB DynamoDB Azure Table Storage Couchbase Server Riak LevelDB Chordless GenieDB Scalaris Tokyo Kyoto Cabinet Tyrant Faircom C-Tree Berkeley DB Voldemort Dynomite KAI MemcacheDB Tarantool/Box Maxtable Pincaster RaptorDB TIBCO Active Spaces Hibari BangDB SDB JasDB Scalien HamsterDB STSdb allegro-C nessDBHyperDex Mnesia LightCloud OpenLDAP/MDB/Lightning Scality Redis KaTree TomP2P Kumofs TreapDB NMDB luxio actord Keyspace schema-free RAMCloud SubRecord Mo8onDb Dovetaildb JDBM Neo4 InfiniteGraph DEX BrightstarDB Sones InfoGrid HyperGraphDB GraphBase Trinity AllegroGraph Bigdata Meronymy OpenLink Virtuoso VertexDB FlockDB Execom IOG Java Univ Netwrk/Graph Framework OpenRDF/Sesame Filament OWLim iGraph Jena SPARQL ArangoDB AlchemyDB Soft NoSQL Systems Db4o Versant Objectivity Starcounter ZODB Magma NEO siaqodb Sterling Morantex EyeDB FramerD NetworkX PicoList Ninja Database Pro StupidDB Hazelcast KiokuDB Perl solution OrientDb Durus GigaSpaces Infinispan Queplix GridGain Galaxy SpaceBase JoafipCoherence eXtremeScale MarkLogic Server EMC Documentum xDB eXist Sedna BaseX Qizx Berkeley DB XML Xindice Tamino Intersystems Cache GT.M EGTM ESENT MultiValue Globals U2 OpenInsight Reality OpenQM eXtremeDB RDM Embedded ISIS Family Prevayler Yserial Vmware vFabric GemFire KirbyBase Tokutek Recutils FileDB Armadillo illuminate Correlation Database FluidDB Fleet DB Twisted Storage Rindo Sherpa tin Dryad SkyNet Disco MUMPS Adabas Oracle Big Data Appliance 10/26/2015 jBASE Lotus/Domino Btrieve XAP In-Memory Grid eXtreme Scale MckoiDDB Mckoi SQL Database Innostore No-List -94- KDI Perst FleetDB IODB©2015 http://cs.gsu.edu/~mhan7 Primary NoSQL Categories It is easy to find diagrams that look like this: • Cassandra Cloudata HSS Database Will they eventually eliminate the need for relational databases? 10/26/2015 Hbase -95- ©2015 http://cs.gsu.edu/~mhan7 •General Categories of NoSQL Systems: – Key/value store – (wide) Column store – Graph store – Document store •Compared to the relational model: – Query models are not as developed. – Distinction between abstraction & implementation is not as clear. 10/26/2015 -96- ©2015 http://cs.gsu.edu/~mhan7 24 10/26/2015 Key/Value Store Wide Column Store •“Dynamo: Amazon’s Highly Available Key-value Store,” DeCandia, G., et al., SOSP’07, 21st ACM •Symposium on Operating Systems Principles. •“Bigtable: A Distributed Storage System for Structured Data,” Chang, F., et al., OSDI’06: Seventh Symposium on Operating System Design and implementation, 2006. •The basic data model: – Database is a collection of key/value pairs – The key for each pair is unique •The basic data model: – Database is a collection of key/value pairs – Key consists of 3 parts – a row key, a column key, and a time-stamp (i.e., the version) – Flexible schema - the set of columns is not fixed, and may differ from row-to-row No requirement for normalization (and consequently dependency preservation or lossless join) •Primary operations: – insert(key,value) – delete(key) – update(key,value) – lookup(key) •One last column detail: – Column key consists of two parts – a column family, and a qualifier •Additional operations: – variations on the above, e.g., reverse lookup – iterators Warning #1! -97- 10/26/2015 ©2015 http://cs.gsu.edu/~mhan7 Wide Column Store -98- 10/26/2015 Wide Column Store Column families Personal data Row key Personal data ID First Name Last Name ©2015 http://cs.gsu.edu/~mhan7 Professional data ID First Name Last Name Date of Birth Job Category Salary Date of Hire ID First Name Middle Name Last Name Job Category Employer Hourly Rate ID First Name ID Last Name Employer Professional data Date of Birth Job Category Salary Date of Hire Employer Last Name Job Category Job Category Salary Employer Date of Hire Salary Column qualifiers Group Employer Seniority Insurance ID Bldg # Office # Emergenc y Contact Medical data One “table” 10/26/2015 -99- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -100- ©2015 http://cs.gsu.edu/~mhan7 25 10/26/2015 Wide Column Store Graph Store •Neo4j - “The Neo Database – A Technology Introduction,” 2006. Row key t1 t0 ID First Name Last Name Date of Birth Job Category Personal data Salary Date of Hire Employer •The basic data model: – Directed graphs – Nodes & edges, with properties, i.e., “labels” Professional data One “row” One “row” in a wide-column NoSQL database table = Many rows in several relations/tables in a relational database 10/26/2015 -101- ©2015 http://cs.gsu.edu/~mhan7 Document Store -102- ©2015 http://cs.gsu.edu/~mhan7 ACID vs. BASE MongoDB - “How a Database Can Make Your Organization Faster, Better, Leaner,” February 2015. The basic data model: The general notion of a document – words, phrases, sentences, paragraphs, sections, subsections, footnotes, etc. Flexible schema – subcomponent structure may be nested, and vary from document-to-document. Metadata – title, author, date, embedded tags, etc. Key/identifier. One implementation detail: Formats vary greatly – PDF, XML, JSON, BSON, plain text, various binary, scanned image. 10/26/2015 10/26/2015 -103- ©2015 http://cs.gsu.edu/~mhan7 •Database systems traditionally support ACID requirements: – Atomicity, Consistency, Isolation, Durability •In a distributed web applications the focus shifts to: – Consistency, Availability, Partition tolerance •CAP theorem - At most two of the above can be enforced at any given time. – Conjecture – Eric Brewer, ACM Symposium on the Principles of Distributed Computing, 2000. – Proved – Seth Gilbert & Nancy Lynch, ACM SIGACT News, 2002. •Reducing consistency, at least temporarily, maintains the other two. 10/26/2015 -104- ©2015 http://cs.gsu.edu/~mhan7 26 10/26/2015 Questions ACID vs. BASE Thus, distributed NoSQL systems are typically said to support some form of BASE: Basic Availability Soft state Eventual consistency* Thank You! “We’d really like everything to be structured, consistent and harmonious,…, but what we are faced with is a little bit of punk-style anarchy. And actually, whilst it might scare our grandmothers, it’s OK...” -Julian Browne Email: [email protected] 10/26/2015 -105- ©2015 http://cs.gsu.edu/~mhan7 10/26/2015 -106- ©2015 http://cs.gsu.edu/~mhan7 27