* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Relational Databases vs Non-Relational Databases vs
Survey
Document related concepts
Data Protection Act, 2012 wikipedia , lookup
Versant Object Database wikipedia , lookup
Data center wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Apache Hadoop wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
Data vault modeling wikipedia , lookup
Business intelligence wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
Relational Databases vs Non-Relational Databases vs Hadoop Presented by James Serra Moderated by Yusuf Kothari Thank You microsoft.com hortonworks.com aws.amazon.com red-gate.com Empower users with new insights through familiar tools while balancing the need for IT to monitor and manage user created content. Deliver access to all data types across structured and unstructured sources. Hortonworks develops, distributes and supports the only 100% open source distribution of Apache Hadoop explicitly architected, built and tested for enterprise grade deployments. It is the only Hadoop-based platform available on both Linux and Windows. Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale Microsoft SQL Server databases in the cloud. Redgate makes ingeniously simple tools for Microsoft technology professionals working with SQL Server, .NET, Visual Studio, Azure, TFS. Trusted by 91% of the Fortune 100. 2 JOIN PASS PASS is a not-for-profit organization which offers year-round learning opportunities to data professionals Membership is free, join today at www.sqlpass.org Access to online training and content Enjoy discounted event rates Join Local Chapters and Virtual Chapters Get advance notice of member exclusives 3 BIO James is a big data and data warehousing solution architect at Microsoft. He is a thought leader in the use and application of Big Data and advanced analytics, including solutions involving hybrid technologies of relational and non-relational data, Hadoop, MPP, IoT, Data Lake, and private and public cloud. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 30 years of IT experience. James is a popular blogger (JamesSerra.com) and speaker, having presented at dozens of PASS events including the PASS Business Analytics conference and the PASS Summit, as well as the Enterprise Data World conference. He is the author of the book “Reporting with Microsoft SQL Server 2012”. He received a Bachelor of Science degree in Computer Engineering from the University of Nevada-Las Vegas. twitter.com/JamesSerra linkedin.com/in/JamesSerra Agenda Definition and differences ACID vs BASE Four categories of NoSQL Use cases CAP theorem On-prem vs cloud Product categories Polyglot persistence Architecture samples Goal My goal is to give you a high level overview of all the technologies so you know where to start and put you on the right path to be a hero! Relational and non-relational defined Relational databases • Also called relational database management systems (RDBMS) or SQL databases • Most popular are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2 • Mostly used in large enterprise scenarios (exception is MySQL, which is mostly used to store data for web applications, typically as part of the popular LAMP stack) • Analytical RDBMS (OLAP, MPP) solutions are Analytics Platform System, Teradata, Netezza Non-relational databases • Also called NoSQL databases • Most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j • Usually grouped into four categories: Key-value stores, Wide-column stores, Document stores and Graph stores Hadoop: Made up of Hadoop Distributed File System (HDFS), YARN and MapReduce Origins Using RDBMS, I need to index a few thousand documents. No problem. I can use full-text search. Now I want to index a few million web pages. Problem. Enter Hadoop. Using RDBMS, my internal company app needs to handle a few thousand transactions a day. No problem. I can handle that with a nice size server. Now I have a web site where users can enter millions of transactions a day. Problem. Enter NoSQL. But keep in mind most enterprise data remains a great fit for an RDBMS (89% market share – Gartner). Main differences (Relational) Pros • Works with structured data • Supports strict ACID transactional consistency • Supports joins • Built-in data integrity • Large eco-system • Relationships via constraints • Limitless indexing • Strong SQL • OLTP and OLAP Main differences (Relational) Cons • Does not scale out horizontally (concurrency and data size) – only vertically, unless use sharding • Data is normalized, meaning lots of joins, affecting speed • Difficulty in working with semi-structured data • Schema-on-write • Cost Main differences (Non-relational or NoSQL) Pros • Works with semi-structured data (JSON, XML) • Scales out (horizontal scaling – parallel query performance, replication) • High concurrency, high volume random reads and writes • Massive data stores • Schema-free, schema-on-read • Supports documents with different fields • High availability • Cost • Simplicity of design: no “impedance mismatch” • Finer control over availability • Speed, due to not having to join tables Main differences (Non-relational or NoSQL) Cons • Weaker or eventual consistency (BASE) instead of ACID • Limited support for joins • Data is denormalized • Does not have built-in data integrity (must do in code) • No relationship enforcement • Limited indexing • Weak SQL • Limited transaction support • Slow mass updates • Uses 10-50x more space (replication, denormalized, documents) • Difficulty tracking schema changes over time Main differences (Hadoop) Pros • Not a type of database, but rather a open-source software ecosystem that allows for massively parallel computing • No inherent structure (no conversion to relational or JSON needed) • Good for batch processing, large files, volume writes, parallel scans, sequential access • Great for large, distributed data processing tasks where time isn’t a constraint (i.e. end-of-day reports, scanning months of historical data) • Tradeoff: In order to make deep connections between many data points, the technology sacrifices speed • Some NoSQL databases such as HBase are built on top of HDFS Main differences (Hadoop) Cons • File system, not a database • Not good for millions of users, random access, fast individual record lookups or updates (OLTP) • Not so great for real-time analytics • Lacks: indexing, metadata layer, query optimizer, memory management • Same cons at non-relational: no ACID support, data integrity, limited indexing, weak SQL, etc • Security limitations ACID (RDBMS) vs BASE (NoSQL) ATOMICITY: All data and commands in a transaction succeed, or all fail and roll back CONSISTENCY: All committed data must be consistent with all data rules including constraints, triggers, cascades, atomicity, isolation, and durability Basically Available: Guaranteed Availability Soft-state: The state of the system may change, even without a query (because of node updates) Eventually Consistent: The system will become consistent over time ISOLATION: Other operations cannot access data that has been modified during a transaction that has not yet completed DURABILITY: Once a transaction is committed, data will survive system failures, and can be reliably recovered after an unwanted deletion ACID BASE Strong Consistency Weak Consistency – stale data OK Isolation Last Write Wins Transaction Programmer Managed Available/Consistent Available/Partition Tolerant Robust Database/Simpler Code Simpler Database, Harder Code Data stored in tables. Tables contain some number of columns, each of a type. A schema describes the columns each table can have. Every table’s data is stored in one or more rows. Each row contains a value for every column in that table. Rows aren’t kept in any particular order. Relational stores Thanks to: Harri Kauhanan, http://www.slideshare.net/harrikauhanen/nosql-3376398 Key-value stores offer very high speed via the least complicated data model—anything can be stored as a value, as long as each value is associated with a key or name. Key Value Key-value stores Key “dog_12”: value_name “Stella”, value_mood “Happy”, etc Wide-column stores are fast and can be nearly as simple as key-value stores. They include a primary key, an optional secondary key, and anything stored as a value. Values Primary key Secondary key Keys and values can be sparse or numerous Wide-column stores Document stores contain data objects that are inherently hierarchical, tree-like structures (most notably JSON or XML). Not Word documents! Document stores Title: Date: 03-02-2011 Purchased Title: Mythical Bridges Name: Ian Purchased Forgotten Bridges Date: 05-07-2011 Purchased Date: 09-09-2011 Name: Alan Graph store Use cases for NoSQL categories • • • • Key-value stores: [Redis] For cache, queues, fit in memory, rapidly changing data, store blob data. Examples: shopping cart, session data, leaderboards, stock prices. Fastest performance Wide-column stores: [Cassandra] Real-time querying of random (non-sequential) data, huge number of writes, sensors. Examples: Web analytics, time series analytics, real-time data analysis, banking industry. Internet scale Document stores: [MongoDB] Flexible schemas, dynamic queries, defined indexes, good performance on big DB. Examples: order data, customer data, log data, product catalog, user generated content (chat sessions, tweets, blog posts, ratings, comments). Fastest development Graph databases: [Neo4j] Graph-style data, social network, master data management, network and IT operations. Examples: social relations, real-time recommendations, fraud detection, identity and access management, graph-based search, web browsing, portfolio analytics, gene sequencing, class curriculum Velocity * Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash) Focus of different data models …you may not have the data volume for NoSQL (yet), but there are other reasons to use NoSQL (semi-structured data, schemaless, high availability, etc) Relational NewSQL stores are designed for web-scale applications, but still require up-front schemas, joins, and table management that can be labor intensive. Blend RDBMS with NoSQL: provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system. Use case for different database technologies • • • • • Traditional OLTP business systems (i.e. ERP, CRM, In-house app): relational database Data warehouses (OLAP): relational database (SMP or MPP) or Hadoop (factor is end-user speed) Web and mobile global OLTP applications: non-relational database Data lake: Hadoop HDFS and querying (Drill), refining (Hive), machine leaning (Mahout) Relational and scalable OLTP: NewSQL CAP Theorem Impossible for any shared data system to guarantee simultaneously all of the following three properties: Consistency: Once data is written, all future requests will contain the data. “Is the data I’m looking at now the same if I look at it somewhere else?” Availability: The database is always available and responsive. “What happens if my database goes down?” Partitioning: If part of the database is unavailable, other parts are unaffected. “What if my data is on a different node?” Relational: CA Non-relational: AP (Cassandra, CoachDB, Riak); CP (Hbase, DocumentDB, MongoDB, Redis) NoSQL can’t be both consistent and available. If two nodes (A and B) and B goes down, if the A node takes requests, it is available but not consistent with B node. If A node stops taking requests, it remains consistent with B node but it is not available. RDBMS is consistent and available because it only has one node/partition (so no partition tolerance) Microsoft data platform solutions Product Category Description More Info SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support https://www.microsoft.com/en-us/servercloud/products/sql-server-2016/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support (preview) https://azure.microsoft.com/enus/services/sql-database/ SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles big data. Provision and scale quickly. Can pause service to reduce cost https://azure.microsoft.com/enus/services/sql-data-warehouse/ Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and seamless integration of all your data https://www.microsoft.com/en-us/servercloud/products/analytics-platformsystem/ Azure Data Lake Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics https://azure.microsoft.com/enus/solutions/data-lake/ DocumentDB PaaS NoSQL: Document Store Get your apps up and running in hours with a fully managed NoSQL database service that indexes, stores, and queries data using familiar SQL syntax https://azure.microsoft.com/enus/services/documentdb/ HDInsight PaaS Hadoop compute A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service made easy https://azure.microsoft.com/enus/services/hdinsight/ Azure Table Storage PaaS NoSQL: Key-value Store Store large amount of semi-structured data in the cloud https://azure.microsoft.com/enus/services/storage/tables/ PolyBase Query relational and non-relational data with T-SQL DocumentDB consistency options • • • • Strong, which is the slowest of the four, but is guaranteed to always return correct data Bounded staleness, which ensures that an application will see changes in the order in which they were made. This option does allow an application to see out-of-date data, but only within a specified window, e.g., 500 milliseconds Session, which ensures that an application always sees its own writes correctly, but allows access to potentially out-of-date or out-of-order data written by other applications Eventual, which provides the fastest access, but also has the highest chance of returning out-of-date data On-prem vs Cloud • • • On-prem: SQL Server, APS, MongoDB, Oracle, Cassandra, Neo4J IaaS Cloud: SQL Server in Azure VM, Oracle in Azure VM DBaaS/PaaS Cloud: SQL Database, SQL Data Warehouse, DocumentDB, Redshift, RDS, MongoLab 36 Product Categories , APS, SQL DW , Redis , DocumentDB, Coachbase , PostgreSQL SQL Database, SQLite , OrientDB Product Categories Azure Product Categories db-engines.com/en/ranking Method of calculation: • Number of mentions of the system on websites • General interest in the system • Frequency of technical discussions about the system • Number of job offers, in which the system is mentioned • Number of profiles in professional networks, in which the system is mentioned • Relevance in social networks db-engines.com/en/ranking_definition db-engines.com/en/ranking_categories NoSQL = 14% Polyglot Persistence • • Sometimes a relational store is the right choice, sometimes a NoSQL store is the right choice Sometimes you need more than one store: Using the right tool for the right job Summary Choose NoSQL when… • • • • • • You are bringing in new data with a lot of volume and/or variety Your data is non-relational/semi-structured Your team will be trained in these new technologies (NoSQL) You have enough information to correctly select the type and product of NoSQL for your situation You can relax transactional consistency when scalability or performance is more important You can service a large number of user requests vs rigorously enforcing business rules Relational databases are created for strong consistency, but at the cost of speed and scale. NoSQL slightly sacrifices consistency across nodes for both speed and scalability. NoSQL and Hadoop are viable technologies for a subset of specialized needs and use cases. Lines are getting blurred – do your homework! Bottom line! • RDBMS for enterprise OLTP and ACID compliance, or db’s under 1TB • NoSQL for scaled OLTP and JSON documents • Hadoop for big data analytics (OLAP) Resources Relational database vs Non-relational databases: http://bit.ly/1HXn2Rt Types of NoSQL databases: http://bit.ly/1HXn8Zl What is Polyglot Persistence? http://bit.ly/1HXnhMm Hadoop and Data Warehouses: http://bit.ly/1xuXfu9 Hadoop and Microsoft: http://bit.ly/20Cg2hA