Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Vertica Database - simply fast Análisis extremo de los datos estructurados © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The challenges of Big Data Volume, Variety, Velocity, Complexity 59% 70% 85% Worldwide information volume is growing at a minimum rate of 59% annually of currently deployed data warehouses will not scale sufficiently to meet new information volume and complexity demands by 2014 70% to 85% of data is “complex mixed data types.” * Source: Gartner, Coleman Parkes, October 2011 2% of corporations can deliver the right information, at right time to support enterprise outcomes all of the time Michael Stonebraker No Limits, No Compromises Conceived by legendary database guru Michael Stonebraker, the HP Vertica Analytics Platform was purpose built from the first line of code for Big Data Analytics. Why? Because it was clear that data warehouses and “business-as-usual” practices were limiting technologies, causing businesses to make painful compromises. Designed with speed, scalability, simplicity, and openness at its core and architected to handle analytical workloads via a distributed compressed columnar architecture. • Main architect of the INGRES relational DBMS, POSTGRES • Founder of various venture-capital backed startups: Ingres Corporation, Illustra Information Technologies (acquired by Informix Corporation), StreamBase Systems, and Vertica. Big Data «5V s» Break the walls of current technologies Volume Variety Velocity a real-time data analytics platform purpose built for Big Data = VERTICA 1000x 50x-1,000x faster performance at 30% the cost proven by hundreds of customers & OEMs Value © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Database and Analytics Architecture Drive transactions Manage and store information Data Management Platform Transactional/ OLTP Data Integration Enterprise Data Warehouse Data Mart Generate insight Business Reporting & Analytic Applications (Visualization) Analytic Applications Sources CRM Order ERP Finance Unstructured Data Select Extract Transform Integrate & Load Reports Apps OLAP Exec Dashboards Analytics Reporting, Dashboards, OLAP, Information Delivery Big Data Opportunities across industries and use cases Big data use cases are business-driven and cut across a wide range of industries & functions Finance Government • Fraud detection • Law enforcement • Anti-money laundering • Counter terrorism • Risk management • Traffic flow optimization Telecom Manufacturing Energy • Broadcast monitoring • Supply chain optimization • Weather forecasting • Churn prevention • Defect tracking • Advertising optimization • RFID Correlation • Natural resource exploration Healthcare • Drug development • Scientific research • Evidence based medicine • Warranty management Horizontal Use Cases • Churn mitigation • Social media analytics • Logistics optimization • Cross and Up sell • Pricing optimization • Clickstream analysis • Loyalty & promotion analysis • Customer behavior analysis • Influencer analysis • Revenue assurance • IT infrastructure analysis • Web application optimization Sources: IDC: 2012 “Worldwide Big Data Technology and Services Forecast: 2011-2015, Gartner: 2012 “Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016 Telecommunications • 7 of the top 10 global telecommunications firms run their business on HP Vertica • Revenue & service assurance and fraud detection • Sensor & device management and performance monitoring • Subscriber insights and targeted marketing and advertising 6 “HP Vertica opened doors to analyses that otherwise were too time-intensive or impossible. A larger team of business managers now have faster, easier access to more information. That knowledge is invaluable in an aggressively competitive market like ours.” - Brian Harvell, Executive Director, Comcast Network Operations © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. COMCAST Key Stats Scaling + cents of TB 5x DL380 cluster Largest cable CSP in US. Comcast is the largest cable communications company in the United States. Company serves tens of millions of consumers and Enterprises customers. Importance of Data Challenge Business Benefits Network Quality CEM • Load 50K+ samples per second • Query response times of 1-2 seconds • Annual detail views, not just weekly • Deliver at least 10:1 data compression • Scale to accommodate 40+ terabytes (TB) of data using standard hardware Overperformance minimun requirements • From 10:1 compression level to 10,7:1 • Queries processed bellow second • Room for increasing requirements: Sustained data insertion rate for more than 130K rows/sec Comcast’s network has millions of components, and there are billions of metrics that could indicate a potential service interruption or other problem. CEM Reduce churn Inserts 46,000 new rows of SNMP data per second 24x7 (5.5TB/year) Competitive Landscape Open-source (not scalable) Faster than other column-store DB technologies High level of Support (working on delivering higher compression) Facebook Key Stats Cents of PB 500 TB / day Leading social website focused on connecting the world, largest Database in the world, driving revenue through targeted online marketing/revenue from data Importance of Data Challenge Business Benefits “At Facebook, we move incredibly fast. It's important for us to be able to handle massive amounts of data in a respectful way without compromising speed, which is why HP Vertica is such a perfect fit." • • • • 6 queries run in 1 minute not days • Growth had been fuelled by advertising income, which leapt 66 per cent year-onyear. • Facebook did not have any mobile advertising revenue 18 months ago. TIM CAMPOS – CIO FACEBOOK 6 queries take 1 day in Hadoop Exadata could not scale-up Teradata too expensive To increase revenue from information, through massive volume and variety of queries and profiling people with the right advertising campaigns Queries now in minutes not days Competitive Landscape Ex….. could not scale Tera…. too expensive “Hadoop not enough Zynga World’s leading social game provider and growing rapidly web & mobile 3rd party games on the Zynga Platform Key Stats ~60 billion rows/day ~10TB daily semistructured data ~1.5PB source data Importance of Data Challenge Business Benefits Churn rate of 50% per month • • • • The first thing the Zynga team did was evaluate graph engines (dedicated software for graph analysis), however, none of the solutions they evaluated would operate at the necessary scale or performance. Viral Coefficient: users likely to cause their friends to sign up Revenue Per User 1. Capture Events 2. What happened 3. Why did it happen 4. Create Advantage • Make every aspect of the game more profitable by improving the player experience significantly Largest 230 2U nodes ~260m MAUs ~60m avg DAUs worldwide +2500 Customers.. MPP-Columnar DBMS With A Unique Combination Of Innovations Leverages existing BI, ETL, Hadoop / MapReduce and OLTP investments No disk I/O bottleneck; simultaneously load & query Standard SQL Interface High Availability Auto Database Design Column Orientation MPP Massively Parallel Processing Native DB-aware clustering on low-cost x86 Linux nodes Built-in redundancy that also speeds up queries Advanced Encoding Minimize IO using 14+ algorithms Automatic setup, optimization, and DB management Column Store – Column-Based Disk I/O I have a table with every test score for every US student for the last twenty years. How can I provide sub-second query response times? select avg( Score ) from example Column Store - Reads 3 columns where class = ‘Junior’ and Junior 94 M Soph 74 F gender = ‘F’ and Junior 86 F Senior 67 M grade = ‘A’ NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE Row Store - Reads all columns 1245 1453 4454 5654 NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE Junior Soph Junior Senior NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE 94 74 86 67 Column Store – Sort and Encode for Speed Student_ID Name Gender Class Score Grade 1256678 1254038 1278858 1230807 1210466 1249290 1244262 1252490 1267170 1248100 1243483 1230382 1240224 1222781 1231806 1246648 Cappiello, Emilia Dalal, Alana Orner, Katy Frigo, Avis Stober, Saundra Borba, Milagros Sosnowski, Hillary Nibert, Emilia Popovic, Tanisha Schreckengost, Max Porcelli, Darren Sinko, Erik Tarvin, Julio Lessig, Elnora Thon, Max Trembley, Allyson F F F M F F F F F M M M M F M F Sophomore Senior Junior Senior Junior Freshman Junior Sophomore Freshman Senior Junior Freshman Sophomore Junior Sophomore Junior 62 92 76 64 90 96 68 59 95 76 67 91 85 63 82 100 D A C D A A D F A C D A B D B A Original fact table… billions of rows Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F F F M F F F F F M M M M F M F Sophomore Senior Junior Senior Junior Freshman Junior Sophomore Freshman Senior Junior Freshman Sophomore Junior Sophomore Junior D A C D A A D F A C D A B D B A 62 92 76 64 90 96 68 59 95 76 67 91 85 63 82 100 Cappiello, Emilia Dalal, Alana Orner, Katy Frigo, Avis Stober, Saundra Borba, Milagros Sosnowski, Hillary Nibert, Emilia Popovic, Tanisha Schreckengost, Max Porcelli, Darren Sinko, Erik Tarvin, Julio Lessig, Elnora Thon, Max Trembley, Allyson 1256678 1254038 1278858 1230807 1210466 1249290 1244262 1252490 1267170 1248100 1243483 1230382 1240224 1222781 1231806 1246648 Columns used in predicates Correlated values “indexed” by preceding column values Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F F F F F F F F F F M M M M M M Freshman Freshman Junior Junior Junior Junior Junior Senior Sophomore Sophomore Freshman Junior Sophomore Sophomore Senior Senior A A A A C D D A D F A D B B C D 95 96 90 100 76 63 68 92 62 59 91 67 82 85 76 64 Popovic, Tanisha Borba, Milagros Stober, Saundra Trembley, Allyson Orner, Katy Lessig, Elnora Sosnowski, Hillary Dalal, Alana Cappiello, Emilia Nibert, Emilia Sinko, Erik Porcelli, Darren Thon, Max Tarvin, Julio Schreckengost, Max Frigo, Avis 1267170 1249290 1210466 1246648 1278858 1222781 1244262 1254038 1256678 1252490 1230382 1243483 1231806 1240224 1248100 1230807 Columns used in predicates Correlated values “indexed” by preceding column values Column Store – Sort and Encode for Speed Gender F F F F F F F F F F M stM M M M M 1 I/O Reads entire column Class Freshman Freshman offset Junior Junior Junior Junior Junior Senior Sophomore nd Sophomore Freshman Junior Sophomore Sophomore Senior Senior 2 I/O Grade Score A A A A C D D A D rd F A D B B C D 95 96 90 100 76 63 68 92 62 th59 91 67 82 85 76 64 offset 3 I/O 4 I/O Example query: select avg( Score ) from example where Class = ‘Junior’ and Gender = ‘F’ and Grade = ‘A’ Name Student_ID Popovic, Tanisha Borba, Milagros Stober, Saundra Trembley, Allyson Orner, Katy Lessig, Elnora Sosnowski, Hillary Dalal, Alana Cappiello, Emilia Nibert, Emilia Sinko, Erik Porcelli, Darren Thon, Max Tarvin, Julio Schreckengost, Max Frigo, Avis 1267170 1249290 1210466 1246648 1278858 1222781 1244262 1254038 1256678 1252490 1230382 1243483 1231806 1240224 1248100 1230807 Advanced Compression Vertica replaces slower disk IO with fast CPU cycles through aggressive compression Uses properties of the data like sorting & cardinality Operates across large numbers of rows Can be operated upon without decoding first Implements late materialization Decoded intelligently but as late as possible No hidden costs Encoding Mechanism Transaction Date Customer ID Trade 5/05/2009, 5/05/200916 0000001 0000001 0 0000001 2 0000003 2 0000003 4 0000005 10 10 0000011 19 0000011 25 0000020 49 0000026 0000050 0000051 0000052 100.25 .25 1 100.50 2 100.75 3 1 100.25 3 100.75 4 101.00 5 3 101.25 5 100.75 3 101.25 100.75 100.00 100.50 5/05/2009 5/05/2009 5/05/2009 5/05/2009 5/05/2009 5/05/2009 5/05/2009 5/05/2009 Few values Sorted RLE Many values Integer Maybe sorted DeltaVal Just-In-Time Decoding Disk: Encoding + Compression 100 Engine: Encoded blocks Many values Sorted Results Decoded Just-In-Time GCD Many Others… 17 Buffer Pool: De-compress only Raw Data Compressed Data Network: Encoded blocks + Optional LZO Massively Parallel Processing (MPP) Shared-nothing, grid based DB architecture scalability using Industry standard hardware Designed to Scale outwards Automatic replication, failover and recovery Add nodes ONLINE to optimise capacity and performance Client Network Private Data Network All Nodes are Peers Node 1 2 x 6 or 8 Core 96+GB RAM Node 2 2 x 6 or 8 Core 96+GB RAM Node 3 2 x 6 or 8 Core 96+GB RAM No specialized nodes – All nodes are peers – Query/Load to any node 4-10TB 4-10TB 4-10TB – Continuous/ real-time load and query Native High Availability RAID like function within database Projections are distributed amongst nodes for redundancy No need for manual log-based recovery Vertica continues to load and query when node down Missing data is recovered from other nodes within the cluster Vertica 3-Node Cluster Node 1 19 Node 3 Node 2 B2 A2 C2 B1 A1 C1 B3 A3 C3 A3 B3 C3 A2 B2 C2 A1 B1 C1 Automatic Design & Administration Vertica Database Designer recommend a optimised database design for best Performance for user query needs Minimize DBA effort for essential physical database design Database Designer runs and deploys whilst ONLINE and without impacting existing processing Database Designer Generates DBA Provides > Logical schema Create table > “Sample set” of Typical queries Sample data > Fault Tolerance Level k-safety > Physical schema, compression to: Make queries in sample set run fast Fit within trickle load requirements Ensure all SQL queries can be answered A B C (A B C | A) 20 B A C (B A C | B A) Standard SQL Interface Vertica supports ANSI SQL-99 plus Analytics to minimise integration effort with existing BI and ETL tools ANSI SQL-99 +Analytics Simple Integration Bulk & Trickle Loads SQL, ODBC, JDBC Vertica’s Hadoop Connector Database Connectors for JDBC ODBC ADO.NET 21 ETL, Replication, Data Quality Analytics, Reporting Real-time Analytics Real-time Analytics on large volumes of data is Reality for the Vertica Database Hybrid Storage Structure Concurrent load / enquiry enabled by an asynchronous “Tuple Mover” Process Write Optimized Store (WOS) Trickle Load A B C Read Optimized Store (ROS) TUPLE MOVER Asynchronous Data Transfer Memory based Unsorted / Uncompressed Current Epoch Closed Epochs • On disk • Sorted / Compressed • Segmented • Large data loaded direct A B C Historical Queries (no locks) Latest Epoch Inserts, Deletes, Updates and Up-to-date queries Segmented Low latency / Small quick inserts Data Loads 22 (A B C | A) Current epoch advanced monotonically (user defined) Transactions SQL Analytics+ - Built for Big Data Features • Time series gap filing and interpolation • Event window functions and sessionization • Social Graphing • Pattern matching • Event series join • Statistical functions • Geospatial functions Benefits • High performance (Keep Data close to CPU) • Low cost (Industry Standard building blocks) • Ease of use (Automated + Available) Use Cases • Tickstore data cleanups • CDR/VOD data analysis • Clickstream sessionization • Data aggregation and compression • Monte Carlo simulation • Graph algorithms • Sensor Data • Process Control Time Series • SmartGrid User Defined Extensions in R What is R • Open source language for statistical computing • Wide range of packages available for advanced data mining and statistical analysis Advantages of UDx in R Vertica automatically parallelizes the execution of user defined R code • Optimized data transfer between Vertica and R Vertica Cluster • Combining the Power of Vertica and Hadoop Vertica • Designed for Performance • Interactive Analytics • A rich SQL ecosystem Hadoop Both Purpose-built Scalable Analytics Platforms Integration via M/R, HDFS, HCatalog Read: http://www.vertica.com/2011/09/21/counting-triangles/ • Designed for Faulttolerance • Batch Analytics • A rich Programming Model Vertica - The new Database technology Vertica Main Features 2 Vertica Enables Columnar Store 50x – 1000x faster performance Advanced Compression Massively Parallel Processing (MPP) Automatic Database Designer Native High Availability NO SPOF, Up to 49% resiliency Standard SQL Interface Simple integration with existing ETL and BI solutions 1:10 compression rate Linear scalability from TBs to PBs One DBA can handle PB of data databases SQL Hadoop tools store now NoSQL analysis new Support real-time Big NOW Storage column-store Thanks compr ession © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Data processing Terabyte Time Time mobile Petabytes Vertica concurrency