Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Shared Storage for Shared Nothing MAKING THE CASE FOR SAN AND NAS IN BIG DATA ANALYTICS John Webster, Senior Partner, Evaluator Group Agenda • Two Ways to Say Big Data: – – Big Data Storage Big Data Analytics • MapReduce (i.e. Apache Hadoop and knock-offs) and the Shared Nothing Architecture • Scalable Database From Open Source and The Traditional Data Warehouse Vendors • Stream Computing and Complex Event Processing • The Big Data Appliance • Shared Big Data Storage For Big Data Analytics The Storage Way to Say Big Data Defined by architectural platform, Big Data Storage is: Scale-out NAS Single NameSpace, Global NameSpace File System NAS gateway to SAN and Scale-out SAN Object-based Storage Defined by application, Big Data Storage is: Storage for applications that handle large data sets and require performance Examples: Media & Entertainment, Oil & Gas Exploration, Life Sciences, etc. The Analytics Way to Say Big Data Big Data Analytics is: A term for business intelligence (BI) processes that are different from traditional Data Warehousing The ability to tap unstructured data as a source for BI processes Information delivered to users in real or near-real time (but not an absolute requirement) Convergence of multiple data sources Latency introduced by storage, including networked storage, is often assiduously avoided Shared Storage for the Traditional Data Warehouse OLTP Files / XML data Log Files Operational Extract, Transform, Load (ETL) Archive Data Warehouse Schedules Ad hoc Queries Reports Dashboards Notifications Storage for Big Data Analytics and Shared Nothing Architectures B8GMR3 FABRIC 1 Link 2 Active 4 Link Active C O N T R O L Architectural model often called “Shared Nothing” 3 5 6 Link Active 7 Link Active N O D E N O D E N O D E 1 2 3 DAS DAS DAS 8 Pwr Console Active ●●● N O D E n DAS MapReduce and Apache Hadoop • Apache Hadoop—Open Source project inspired by Google’s MapReduce framework and the need for an alternative to traditional data warehousing • Cloudera is the commercial face of Apache Hadoop • However, there are derivatives (Facebook and Yahoo)… • … and “Enterprise” knock-offs (MapR/EMC Greenplum, – Yahoo Hortonworks, IBM BigInsights) The MapReduce “Shared Nothing” Framework Scalable Database • The NoSQL community is another open source way do Big Data Analytics – A vibrant and growing community – Examples: MongoDB (as in “humongous”), Terrastore • The traditional DW vendors are responding with: – In-memory DB – In-memory Hadoop – The discovery of Flash-based SSD Stream Processing for Real Time Analytics Big Data Analytics delivering information in real of near real time StreamSQL says process first, then store Examples: StreamBase, IBM InfoStreams, Ingress VectorWise Real time processing applications using StreamSQL today: Equity Trading, Telecomm Infrastructure Monitoring, Intelligence, eCommerce Complex Event Processing (CEP) is platform for real time analytics using stream processing Source: StreamBase The Big Data Appliance • Big Data Analytics in a Rack – Pre-integrates server, networking, and storage gear – Simplified management and implementation to speed information delivery to users • Aimed at the Enterprise Buyer • All the big name vendors either have an appliance here now or will have one • Primary storage is DAS Big Data Storage for Big Data Analytics • Shared Storage as Secondary Storage for Big Data Analytics – Data Protection, Database of Record, Archive – Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum • Shared Storage as Primary Storage for Big Data Analytics – Examples: Calpont, Gluster, IBM GPFS, MapReduce nodes in Virtual Machines B8GMR3 Shared Secondary Storage for Shared Nothing 1 Link 2 3 Active Link 4 5 Active C O N T R O L 6 Link Active 7 Link 8 Active N O D E N O D E N O D E 1 2 3 Active ●●● Node-based data mirrored to backend SAN or NAS Reduces latency for queries that span nodes Enhances system availability and data protection Console Pwr Mirrored Data NAS/SAN N O D E n B8GMR3 Shared Primary Storage for Shared Nothing 1 Link 2 3 Active Link 4 5 Active 6 Link Active 7 Link 8 Active N O D E N O D E N O D E 1 2 3 Active ●●● Eliminates centralized metadata server Variable block sizes Reduces latency for queries that span nodes Supports off-site replication for DR Console Pwr Scale-out NAS Files and Objects N O D E n Can You Run Hadoop in a Virtualized Server? C O N T R O L V M V M V M N O D E N O D E N O D E 1 2 ●●● SAN/NAS Consolidates servers Centralized management of server/storage resources Consolidated data protection and DR But, what abut system latency? n Summary • Big Data Analytics will produce data sets that dwarf anything seen today. • While Apache Hadoop and derivatives have garnered the Big Data Analytics spot light, there are multiple alternatives. • Proponents of shared storage for Big Data Analytics and Shared Nothing architectures will face skeptical practitioners. • However, shared storage in Big Data Analytics is just emerging and new developments are coming quickly. Thank You! John Webster, Senior Partner, Evaluator Group