Download Big Data Storage for Big Data Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Big data wikipedia , lookup

Transcript
Shared Storage for
Shared Nothing
MAKING THE CASE FOR SAN AND NAS IN BIG
DATA ANALYTICS
John Webster, Senior Partner, Evaluator Group
Agenda
• Two Ways to Say Big Data:
–
–
Big Data Storage
Big Data Analytics
• MapReduce (i.e. Apache Hadoop and knock-offs) and the
Shared Nothing Architecture
• Scalable Database From Open Source and The Traditional
Data Warehouse Vendors
• Stream Computing and Complex Event Processing
• The Big Data Appliance
• Shared Big Data Storage For Big Data Analytics
The Storage Way to Say Big Data
Defined by architectural platform, Big Data Storage is:
Scale-out NAS
Single NameSpace, Global NameSpace File System
NAS gateway to SAN and Scale-out SAN
Object-based Storage
Defined by application, Big Data Storage is:
Storage for applications that handle large data sets and require
performance
Examples: Media & Entertainment, Oil & Gas Exploration, Life Sciences,
etc.
The Analytics Way to Say Big
Data
Big Data Analytics is:
A term for business intelligence (BI) processes that are different
from traditional Data Warehousing
The ability to tap unstructured data as a source for BI processes
Information delivered to users in real or near-real time (but not an
absolute requirement)
Convergence of multiple data sources
Latency introduced by storage, including networked storage, is
often assiduously avoided
Shared Storage for the Traditional Data
Warehouse
OLTP
Files /
XML data
Log Files
Operational
Extract, Transform, Load (ETL)
Archive
Data
Warehouse
Schedules
Ad hoc
Queries
Reports
Dashboards
Notifications
Storage for Big Data Analytics and
Shared Nothing Architectures
B8GMR3
FABRIC
1
Link
2
Active
4
Link
Active
C
O
N
T
R
O
L
Architectural
model often
called
“Shared
Nothing”
3
5
6
Link
Active
7
Link
Active
N
O
D
E
N
O
D
E
N
O
D
E
1
2
3
DAS
DAS
DAS
8
Pwr
Console
Active
●●●
N
O
D
E
n
DAS
MapReduce and Apache Hadoop
• Apache Hadoop—Open Source project inspired by
Google’s MapReduce framework and the need for an
alternative to traditional data warehousing
• Cloudera is the commercial face of Apache Hadoop
• However, there are derivatives (Facebook and Yahoo)…
• … and “Enterprise” knock-offs (MapR/EMC Greenplum,
– Yahoo Hortonworks, IBM BigInsights)
The MapReduce “Shared Nothing” Framework
Scalable Database
• The NoSQL community is another open source
way do Big Data Analytics
– A vibrant and growing community
– Examples: MongoDB (as in “humongous”),
Terrastore
• The traditional DW vendors are responding
with:
– In-memory DB
– In-memory Hadoop
– The discovery of Flash-based SSD
Stream Processing for Real Time
Analytics
Big Data Analytics delivering information in
real of near real time
StreamSQL says process first, then store
Examples: StreamBase, IBM
InfoStreams, Ingress VectorWise
Real time processing applications using
StreamSQL today:
Equity Trading, Telecomm
Infrastructure Monitoring,
Intelligence, eCommerce
Complex Event Processing (CEP) is platform
for real time analytics using stream
processing
Source: StreamBase
The Big Data Appliance
• Big Data Analytics in a Rack
– Pre-integrates server, networking, and storage
gear
– Simplified management and implementation to
speed information delivery to users
• Aimed at the Enterprise Buyer
• All the big name vendors either have an
appliance here now or will have one
• Primary storage is DAS
Big Data Storage for Big Data
Analytics
• Shared Storage as Secondary Storage for Big
Data Analytics
– Data Protection, Database of Record, Archive
– Examples: NetApp and ParAccel, EMC Data
Domain/VMAX and Greenplum
• Shared Storage as Primary Storage for Big Data
Analytics
– Examples: Calpont, Gluster, IBM GPFS, MapReduce
nodes in Virtual Machines
B8GMR3
Shared Secondary Storage for
Shared Nothing
1
Link
2
3
Active
Link
4
5
Active
C
O
N
T
R
O
L
6
Link
Active
7
Link
8
Active
N
O
D
E
N
O
D
E
N
O
D
E
1
2
3
Active
●●●
Node-based data mirrored to backend SAN or NAS
Reduces latency for queries that
span nodes
Enhances system availability and
data protection
Console
Pwr
Mirrored Data
NAS/SAN
N
O
D
E
n
B8GMR3
Shared Primary Storage for Shared
Nothing
1
Link
2
3
Active
Link
4
5
Active
6
Link
Active
7
Link
8
Active
N
O
D
E
N
O
D
E
N
O
D
E
1
2
3
Active
●●●
Eliminates centralized
metadata server
Variable block sizes
Reduces latency for queries
that span nodes
Supports off-site replication
for DR
Console
Pwr
Scale-out NAS
Files and Objects
N
O
D
E
n
Can You Run Hadoop in a Virtualized
Server?
C
O
N
T
R
O
L
V
M
V
M
V
M
N
O
D
E
N
O
D
E
N
O
D
E
1
2
●●●
SAN/NAS
Consolidates servers
Centralized management of server/storage resources
Consolidated data protection and DR
But, what abut system latency?
n
Summary
• Big Data Analytics will produce data sets that dwarf
anything seen today.
• While Apache Hadoop and derivatives have garnered
the Big Data Analytics spot light, there are multiple
alternatives.
• Proponents of shared storage for Big Data Analytics and
Shared Nothing architectures will face skeptical
practitioners.
• However, shared storage in Big Data Analytics is just
emerging and new developments are coming quickly.
Thank You!
John Webster, Senior Partner, Evaluator Group