Download Relational Databases vs Non-Relational Databases vs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Big data wikipedia , lookup

Versant Object Database wikipedia , lookup

PL/SQL wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

SQL wikipedia , lookup

Database wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Apache Hadoop wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

SAP IQ wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Relational Databases vs Non-Relational
Databases vs Hadoop
Presented by James Serra
Moderated by Yusuf Kothari
Thank You
microsoft.com
hortonworks.com
aws.amazon.com
red-gate.com
Empower users with new
insights through familiar tools
while balancing the need for
IT to monitor and manage
user created content. Deliver
access to all data types
across structured and
unstructured sources.
Hortonworks develops,
distributes and supports the
only 100% open source
distribution of Apache
Hadoop explicitly architected,
built and tested for enterprise
grade deployments. It is the
only Hadoop-based platform
available on both Linux and
Windows.
Amazon Relational Database
Service (Amazon RDS)
makes it easy to set up,
operate, and scale Microsoft
SQL Server databases in the
cloud.
Redgate makes ingeniously
simple tools for Microsoft
technology professionals
working with SQL Server,
.NET, Visual Studio, Azure,
TFS. Trusted by 91% of the
Fortune 100.
2
JOIN
PASS
PASS is a not-for-profit
organization which offers
year-round learning
opportunities to data
professionals
Membership is free, join today
at www.sqlpass.org
Access to
online training
and content
Enjoy
discounted
event rates
Join Local
Chapters and
Virtual Chapters
Get advance
notice of member
exclusives
3
BIO
James is a big data and data warehousing solution architect at
Microsoft. He is a thought leader in the use and application of
Big Data and advanced analytics, including solutions involving
hybrid technologies of relational and non-relational data,
Hadoop, MPP, IoT, Data Lake, and private and public cloud.
Previously he was an independent consultant working as a Data
Warehouse/Business Intelligence architect and developer. He is
a prior SQL Server MVP with over 30 years of IT
experience. James is a popular blogger (JamesSerra.com) and
speaker, having presented at dozens of PASS events including
the PASS Business Analytics conference and the PASS Summit,
as well as the Enterprise Data World conference. He is the
author of the book “Reporting with Microsoft SQL Server
2012”. He received a Bachelor of Science degree in Computer
Engineering from the University of Nevada-Las Vegas.
twitter.com/JamesSerra
linkedin.com/in/JamesSerra
Agenda









Definition and differences
ACID vs BASE
Four categories of NoSQL
Use cases
CAP theorem
On-prem vs cloud
Product categories
Polyglot persistence
Architecture samples
Goal
My goal is to give you a high level overview of all the technologies so you know where to start and put you on
the right path to be a hero!
Relational and non-relational defined
Relational databases
• Also called relational database management systems (RDBMS) or SQL databases
• Most popular are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2
• Mostly used in large enterprise scenarios (exception is MySQL, which is mostly used to store data for
web applications, typically as part of the popular LAMP stack)
• Analytical RDBMS (OLAP, MPP) solutions are Analytics Platform System, Teradata, Netezza
Non-relational databases
• Also called NoSQL databases
• Most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j
• Usually grouped into four categories: Key-value stores, Wide-column stores, Document stores and
Graph stores
Hadoop: Made up of Hadoop Distributed File System (HDFS), YARN and MapReduce
Origins
Using RDBMS, I need to index a few thousand documents.
No problem. I can use full-text search.
Now I want to index a few million web pages.
Problem. Enter Hadoop.
Using RDBMS, my internal company app needs to handle a few thousand transactions a day.
No problem. I can handle that with a nice size server.
Now I have a web site where users can enter millions of transactions a day.
Problem. Enter NoSQL.
But keep in mind most enterprise data remains a great fit for an RDBMS (89% market share – Gartner).
Main differences (Relational)
Pros
• Works with structured data
• Supports strict ACID transactional consistency
• Supports joins
• Built-in data integrity
• Large eco-system
• Relationships via constraints
• Limitless indexing
• Strong SQL
• OLTP and OLAP
Main differences (Relational)
Cons
• Does not scale out horizontally (concurrency and data size) – only vertically, unless use sharding
• Data is normalized, meaning lots of joins, affecting speed
• Difficulty in working with semi-structured data
• Schema-on-write
• Cost
Main differences (Non-relational or NoSQL)
Pros
• Works with semi-structured data (JSON, XML)
• Scales out (horizontal scaling – parallel query performance, replication)
• High concurrency, high volume random reads and writes
• Massive data stores
• Schema-free, schema-on-read
• Supports documents with different fields
• High availability
• Cost
• Simplicity of design: no “impedance mismatch”
• Finer control over availability
• Speed, due to not having to join tables
Main differences (Non-relational or NoSQL)
Cons
• Weaker or eventual consistency (BASE) instead of ACID
• Limited support for joins
• Data is denormalized
• Does not have built-in data integrity (must do in code)
• No relationship enforcement
• Limited indexing
• Weak SQL
• Limited transaction support
• Slow mass updates
• Uses 10-50x more space (replication, denormalized, documents)
• Difficulty tracking schema changes over time
Main differences (Hadoop)
Pros
• Not a type of database, but rather a open-source software ecosystem that allows for massively
parallel computing
• No inherent structure (no conversion to relational or JSON needed)
• Good for batch processing, large files, volume writes, parallel scans, sequential access
• Great for large, distributed data processing tasks where time isn’t a constraint (i.e. end-of-day
reports, scanning months of historical data)
• Tradeoff: In order to make deep connections between many data points, the technology
sacrifices speed
• Some NoSQL databases such as HBase are built on top of HDFS
Main differences (Hadoop)
Cons
• File system, not a database
• Not good for millions of users, random access, fast individual record lookups or updates (OLTP)
• Not so great for real-time analytics
• Lacks: indexing, metadata layer, query optimizer, memory management
• Same cons at non-relational: no ACID support, data integrity, limited indexing, weak SQL, etc
• Security limitations
ACID (RDBMS) vs BASE (NoSQL)
ATOMICITY: All data and commands in a
transaction succeed, or all fail and roll back
CONSISTENCY: All committed data must be
consistent with all data rules including
constraints, triggers, cascades, atomicity,
isolation, and durability
Basically Available: Guaranteed Availability
Soft-state: The state of the system may change, even
without a query (because of node updates)
Eventually Consistent: The system will become
consistent over time
ISOLATION: Other operations cannot access
data that has been modified during a
transaction that has not yet completed
DURABILITY: Once a transaction is
committed, data will survive system failures,
and can be reliably recovered after an
unwanted deletion
ACID
BASE
Strong Consistency
Weak Consistency – stale data OK
Isolation
Last Write Wins
Transaction
Programmer Managed
Available/Consistent
Available/Partition Tolerant
Robust Database/Simpler Code
Simpler Database, Harder Code
Data stored in tables.
Tables contain some number of columns, each of a type.
A schema describes the columns each table can have.
Every table’s data is stored in one or more rows.
Each row contains a value for every column in that table.
Rows aren’t kept in any particular order.
Relational stores
Thanks to: Harri Kauhanan, http://www.slideshare.net/harrikauhanen/nosql-3376398
Key-value stores offer very high speed via the
least complicated data model—anything can
be stored as a value, as long as each value is
associated with a key or name.
Key
Value
Key-value stores
Key “dog_12”: value_name “Stella”, value_mood “Happy”, etc
Wide-column stores are fast and can be nearly as simple as key-value stores. They include a primary
key, an optional secondary key, and anything stored as a value.
Values
Primary key
Secondary
key
Keys and values can
be sparse or
numerous
Wide-column stores
Document stores contain data objects that are
inherently hierarchical, tree-like structures (most
notably JSON or XML). Not Word documents!
Document stores
Title:
Date: 03-02-2011
Purchased
Title:
Mythical
Bridges
Name:
Ian
Purchased
Forgotten
Bridges
Date: 05-07-2011
Purchased
Date: 09-09-2011
Name:
Alan
Graph store
Use cases for NoSQL categories
•
•
•
•
Key-value stores: [Redis] For cache, queues, fit in memory, rapidly changing data, store blob data.
Examples: shopping cart, session data, leaderboards, stock prices. Fastest performance
Wide-column stores: [Cassandra] Real-time querying of random (non-sequential) data, huge
number of writes, sensors. Examples: Web analytics, time series analytics, real-time data analysis,
banking industry. Internet scale
Document stores: [MongoDB] Flexible schemas, dynamic queries, defined indexes, good
performance on big DB. Examples: order data, customer data, log data, product catalog, user
generated content (chat sessions, tweets, blog posts, ratings, comments). Fastest development
Graph databases: [Neo4j] Graph-style data, social network, master data management, network and
IT operations. Examples: social relations, real-time recommendations, fraud detection, identity and
access management, graph-based search, web browsing, portfolio analytics, gene sequencing, class
curriculum
Velocity
* Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash)
Focus of different data models
…you may not have the data volume for NoSQL (yet), but there are other reasons to use
NoSQL (semi-structured data, schemaless, high availability, etc)
Relational NewSQL stores are designed for web-scale
applications, but still require up-front schemas, joins, and
table management that can be labor intensive.
Blend RDBMS with NoSQL: provide the same scalable
performance of NoSQL systems for OLTP read-write
workloads while still maintaining the ACID guarantees of
a traditional relational database system.
Use case for different database technologies
•
•
•
•
•
Traditional OLTP business systems (i.e. ERP, CRM, In-house app): relational database
Data warehouses (OLAP): relational database (SMP or MPP) or Hadoop (factor is end-user speed)
Web and mobile global OLTP applications: non-relational database
Data lake: Hadoop HDFS and querying (Drill), refining (Hive), machine leaning (Mahout)
Relational and scalable OLTP: NewSQL
CAP Theorem
Impossible for any shared data system to guarantee simultaneously all of the
following three properties:
Consistency: Once data is written, all future requests will contain the data.
“Is
the data I’m looking at now the same if I look at it somewhere else?”
Availability: The database is always available and responsive.
“What happens
if my database goes down?”
Partitioning:
If part of the database is unavailable, other parts are unaffected.
“What if my data is on a different node?”
Relational: CA
Non-relational: AP (Cassandra, CoachDB, Riak); CP (Hbase, DocumentDB, MongoDB, Redis)
NoSQL can’t be both consistent and available. If two nodes (A and B) and B goes down, if
the A node takes requests, it is available but not consistent with B node. If A node stops
taking requests, it remains consistent with B node but it is not available. RDBMS is
consistent and available because it only has one node/partition (so no partition tolerance)
Microsoft data platform solutions
Product
Category
Description
More Info
SQL Server 2016
RDBMS
Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support
https://www.microsoft.com/en-us/servercloud/products/sql-server-2016/
SQL Database
RDBMS/DBaaS
Cloud-based service that is provisioned and scaled quickly.
Has built-in high availability and disaster recovery. JSON
support (preview)
https://azure.microsoft.com/enus/services/sql-database/
SQL Data Warehouse
MPP RDBMS/DBaaS
Cloud-based service that handles big data. Provision and
scale quickly. Can pause service to reduce cost
https://azure.microsoft.com/enus/services/sql-data-warehouse/
Analytics Platform System (APS)
MPP RDBMS
Big data analytics appliance for high performance and
seamless integration of all your data
https://www.microsoft.com/en-us/servercloud/products/analytics-platformsystem/
Azure Data Lake
Hadoop storage
Removes the complexities of ingesting and storing all of
your data while making it faster to get up and running with
batch, streaming, and interactive analytics
https://azure.microsoft.com/enus/solutions/data-lake/
DocumentDB
PaaS NoSQL: Document
Store
Get your apps up and running in hours with a fully
managed NoSQL database service that indexes, stores, and
queries data using familiar SQL syntax
https://azure.microsoft.com/enus/services/documentdb/
HDInsight
PaaS Hadoop compute
A managed Apache Hadoop, Spark, R, HBase, and Storm
cloud service made easy
https://azure.microsoft.com/enus/services/hdinsight/
Azure Table Storage
PaaS NoSQL: Key-value
Store
Store large amount of semi-structured data in the cloud
https://azure.microsoft.com/enus/services/storage/tables/
PolyBase
Query relational and non-relational data with T-SQL
DocumentDB consistency options
•
•
•
•
Strong, which is the slowest of the four, but is guaranteed to always return correct data
Bounded staleness, which ensures that an application will see changes in the order in which they were
made. This option does allow an application to see out-of-date data, but only within a specified
window, e.g., 500 milliseconds
Session, which ensures that an application always sees its own writes correctly, but allows access to
potentially out-of-date or out-of-order data written by other applications
Eventual, which provides the fastest access, but also has the highest chance of returning out-of-date
data
On-prem vs Cloud
•
•
•
On-prem: SQL Server, APS, MongoDB, Oracle, Cassandra, Neo4J
IaaS Cloud: SQL Server in Azure VM, Oracle in Azure VM
DBaaS/PaaS Cloud: SQL Database, SQL Data Warehouse, DocumentDB, Redshift, RDS, MongoLab
36
Product Categories
, APS, SQL DW
, Redis
, DocumentDB, Coachbase
, PostgreSQL
SQL Database, SQLite
, OrientDB
Product Categories
Azure Product Categories
db-engines.com/en/ranking
Method of calculation:
• Number of mentions of the
system on websites
• General interest in the system
• Frequency of technical discussions
about the system
• Number of job offers, in which the
system is mentioned
• Number of profiles in professional
networks, in which the system is
mentioned
• Relevance in social networks
db-engines.com/en/ranking_definition
db-engines.com/en/ranking_categories
NoSQL = 14%
Polyglot Persistence
•
•
Sometimes a relational store is the right choice, sometimes a NoSQL store is the right choice
Sometimes you need more than one store: Using the right tool for the right job
Summary
Choose NoSQL when…
•
•
•
•
•
•
You are bringing in new data with a lot of volume and/or variety
Your data is non-relational/semi-structured
Your team will be trained in these new technologies (NoSQL)
You have enough information to correctly select the type and product of NoSQL for your situation
You can relax transactional consistency when scalability or performance is more important
You can service a large number of user requests vs rigorously enforcing business rules
Relational databases are created for strong consistency, but at the cost of speed and scale. NoSQL slightly sacrifices
consistency across nodes for both speed and scalability.
NoSQL and Hadoop are viable technologies for a subset of specialized needs and use cases.
Lines are getting blurred – do your homework!
Bottom line!
• RDBMS for enterprise OLTP and ACID compliance, or db’s under 1TB
• NoSQL for scaled OLTP and JSON documents
• Hadoop for big data analytics (OLAP)
Resources





Relational database vs Non-relational databases: http://bit.ly/1HXn2Rt
Types of NoSQL databases: http://bit.ly/1HXn8Zl
What is Polyglot Persistence? http://bit.ly/1HXnhMm
Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
Hadoop and Microsoft: http://bit.ly/20Cg2hA