Download Vertica Deep

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 1355 wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
Vertica Real-Time
Analytics Engine
Presenter Name
Date
The Vertica Real-Time Analytics Engine

Leverages BI, ETL,
Hadoop/MapReduce
and OLTP investments
Built-in redundancy
that also speeds up
queries


No disk I/O bottleneck
simultaneously load &
query
Automatic setup,
optimization, and DB
management



Native DB-aware
clustering on low-cost
x86 Linux nodes
Up to 90% space
reduction using 10+
algorithms

50x – 1000x faster
than traditional
RDBMS
Scales from TB to
PB with industrystandard hardware
Simple integration
with existing ETL
and BI solutions
SQL-99+ compliant
Ultimate deployment
flexibility
Extended advanced
analytics
24/7 Load & Query
Column Orientation
Vertica intelligently organizes data on disk for each column
•
•
•
Only reads the columns involved in the query from disk instead of every row and column
Reads and writes in very large block sizes
All operations in the query engine built for columnar execution
Ideal for load-/read-intensive workloads with dramatic reduction of disk I/O
SELECT
avg(price)
FROM
tickstore
WHERE
symbol = ‘AAPL”
date = ‘5/06/09’
Column Store - Reads 3 columns
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
AAPL
AAPL
BBY
BBY
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
143.74
143.75
37.03
37.13
Row Store - Reads all columns
AAPL
AAPL
BBY
BBY
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
143.74
143.74
37.03
37.13
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
5/05/09
5/06/09
5/05/09
5/06/09
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
5/05/09
5/06/09
5/05/09
5/06/09
Advanced Compression
Vertica replaces slower disk I/O with fast CPU cycles through aggressive compression
•
•
•
•
Uses properties of the data like sorting and cardinality
Can be operated upon without decoding first
Implements late materialization
Decoded intelligently and as late as possible
Encoding Mechanism
Transaction Date
5/05/2009
5/05/2009,
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
16
Few values
Sorted
RLE
Customer ID
0000001
0000001
0
0000001
2
0000003
2
0000003
4
0000005
10
10
0000011
19
0000011
25
0000020
49
0000026
0000050
0000051
0000052
Many values
Integer
Maybe sorted
DeltaVal
Many Others…
Clickstream
10
Audit
10
Trading
5
SNMP
20
Network Logs
60
Marketing
20
Consumer
30
CDR
8
0%
20%
Raw Data
40%
60%
80%
Compressed Data
100%
Automatic Design & Administration
Database Designer recommends a physical database design that provides the best performance for
the user’s workload
•
Minimize the time that DBAs spend on physical database tuning
Re-run incrementally to optimize for changing workloads overtime
•
Background process runs “on-the-fly”
Database Designer Generates
DBA Provides
Logical schema
 Create table
 “Sample set” of
 Typical queries
 Sample data
 K-safety level


Physical schema, compression to:
 Make queries in sample set run fast
 Fit within trickle load requirements
 Ensure all SQL queries can be answered
A
B
C
(A B C | A)
B
A
C
(B A C | B A)
Real-time Analytics
Real-time analytics on large volumes of data Hybrid storage architecture
•
Concurrent load / query enabled by asynchronous “tuple mover” process
Vertica achieves very low data latency (seconds) AND full context (store years of detailed history)
Load performance scales with cluster size – proven at 40+ TB per hour
Hybrid storage architecture
Loads
Read Optimized
Store (ROS)
Write Optimized
Store (WOS)
Trickle
Load
A




B
C
TUPLE MOVER
Asynchronous
Data Transfer
Memory based
Unsorted / Uncompressed
Segmented
Low latency / Small quick
inserts
•
•
•
•
On disk
Sorted / Compressed
Segmented
Large data loaded direct
A
B
C
(A B C | A)
Streaming Load from Apache Kafka
• Vertica loads continuously - consume from Kafka
• Near Real-Time (seconds)
CLI
Load
Scheduler
• High volume: 2TB/hr on 3 node cluster
• Exactly-Once (fault tolerant)
• JSON, Avro data formats
• CLI for easy setup
Kafka
Kafka
Kafka
Load
Export
• In-database monitoring
• Vertica can also produce query results to Kafka
Vertica
Vertica
Vertica
Kafka
Kafka
Plugin
Kafka
Plugin
Plugin
P2P Massively Parallel Processing (MPP)
Parallel design leverages data projections to enable distributed storage and workload
•
“Active” redundancy
•
Automatic replication, failover and recovery
Shared-nothing, grid-based architecture provides scalability on clusters of commodity servers
•
Add nodes to achieve optimal capacity and performance
Client Network
Nodes are Peers
 No specialized nodes
Private Data Network
Node 1
 2 x 12 Cores
 128+GB RAM
20 TB
Node 2
 2 x 12 Cores
 128+GB RAM
20 TB
Node 3
 2 x 12 Cores
 128+GB RAM
20 TB
 Query/Load to any
node
 Continuous/ realtime load and query
Concurrency & Workload Management
No leader node bottleneck!
•
•
Query initiation work is evenly distributed between cluster nodes
Concurrency scales as nodes are added to the cluster
Configurable resource manager
•
Resource pools for different query workloads
•
Limit or guarantee resource availability to targeted query workloads
•
Set per-pool runtime priority, concurrency limits and runtime limits, and resource allocation guidelines
•
Resource limits per pool, per user, or per session
Real-time administrative capabilities
•
•
tactical
tactical
Adjust query priority in real-time
Kill run away queries
Tactical
General
Analytic User 2
Analytic User 1
analytic
Native High Availability
RAID-like functionality within database
•
•
Projections are organized so if a node fails a copy is available on one of the surviving nodes
Automatically stores redundant data sets for query performance gains as well
Always-on queries & loads
•
•
•
•
•
•
No need for manual log-based recovery
System continues to load and query when nodes are down
Recover missing data by querying other nodes
Nodes recover individual tables
No longer binary “node up or still recovering”
Prioritize table recovery order
B2
A2
C2
B1
A1
C1
B3
A3
C3
A3
B3
C3
A2
B2
C2
A1
B1
C1
Elastic Cluster Scale-Out
Simple process to add more servers
•
•
Add nodes to increase performance or capacity
Vertica automatically redistributes data in the
background
No database downtime
•
Database continues to support query requests while
rebalance is in progress
High performance redistribution
•
•
Elastic cluster and local segmentation enable fast
cluster scaling
E.g. One customer expanded their 11 TB database
cluster from 16 nodes to 32 nodes in 65 minutes!
Terrace Routing
•
•
•
Efficient resource usage for larger clusters
Shuffle within rack, then cross rack
Potential ½ memory, 20% performance boost
Flexible Backup / Restore
N to One
File-based backup / restore utility
Full or incremental backup
•
(IP Network)
Change only files that have changed since previous backup
run
Hot backup
•
No lock contention with active database operations
N to N
Configurable backup options
•
•
•
Configurable mappings from Vertica nodes to backup
server(s)
Optional encryption option between database and backup
locations
Configurable number of restore points
Object-level backup / restore
•
Configure backups per application / user / schema to meet
individual SLAs
(IP Network)
N to M
(IP Network)
“Excavator” Backup / Restore Improvements
Backup time (shorter bars are better)
– Performance improvement up to several orders of magnitude
Full backup
– Remove need for file system hard linking
– Object restore from full backup
Pre-release
0
– Can implement active/active DR
Customers:
7.1.2
Minimal incremental
100
200
300
400
minutes
New York Data Center
Chicago Data Center
Vertica - Primary DB
Vertica – Secondary DB
store
sales
store
customers
products
Replicate schema
sales
customers
products
Vertica Management Console
•
Manage multiple clusters from a single web-based console
•
Real-time view of database activity and cluster status
•
Correlate system and database activity
Browser
Access
Vertica
Management
Console
Cluster 1
Cluster 2
Cluster 3
Additional Features in “Excavator”
Directed Queries
Automatic Eviction of Slow Nodes
Primary Key Enforcement
– If you like your plan, you can keep it
– “Bad” or “Hung” node impairs the
cluster
– Optionally validate keys on load
– Save plan for a query, use it later
– Directly inform optimizer how to
execute query
– Helpful during upgrade
– Configurable heartbeat interval to
evict
– Immediate take-over by standby node
– Can force inner/outer in hash joins
Operational Improvements
– Explicit move query between resource
pools
– Optimizer support for live-aggregate
projections
– Copy partitions between tables
– Admintools SSH handling
improvements
– Mechanism for preserving tables of
dropped users
– CUBE support
– Explain plans in JSON format
– Performant due to auto projection
design
– Precursor to many optimizer features
– Also validates Unique constraints
Vertica Analytics
A Rich Analytics Platform Benefits
ANSI SQL
• Window functions
• Statistical
Standard functionality
that performs at scale
HPE Vertica
Extensions
• Pattern matching
• Event series joins
• Time series
Sessionization
Conversion analysis
Fraud detection
See online doc
• Analytic functions - http://tinyurl.com/v-analytic-functions
• SDK - http://tinyurl.com/v-analytic-SDK
SDKs
•
•
•
•
C++
R
Java
Python
Monte Carlo simulation,
Custom Data Mining,
XML/JSON Parsers
and lots more
SQL Analytics+ - Built for Big Data
Features
•
•
•
•
•
•
•
Time series gap filing and interpolation
Event window functions and sessionization
Social Graphing
Pattern matching
Event series join
Statistical functions
Geospatial functions
Benefits
•
•
•
High performance (Keep Data close to CPU)
Low cost (Industry Standard building blocks)
Ease of use (Automated + Available)
Use Cases
•
Tickstore data cleanups
•
CDR/VOD data analysis
•
Clickstream sessionization
•
Data aggregation and compression
•
Monte Carlo simulation
•
Graph algorithms
•
Sensor Data
•
Process Control Time Series
•
SmartGrid
•
…
Analytics Using HPE Vertica
Graph Analytics
Text Analytics
HPE Vertica – R
UDFs
Network centrality metrics
for Twitter profiles using the
“igraph” package of R
• Betweeness centrality
• Closeness centrality
• Eigen-vector centrality
• Clustering coefficient
HPE Vertica – C++
UDFs
Tweet text processing and
mining using C++ functions
• @mention mining
• Retweet handle mining
• #tag mining
Statistical Modeling
HPE Vertica – R
UDFs
Statistical scorecard based
modeling
• Metric normalization
• Outlier treatment
• Structural equation
modeling
• Weighted scoring
Application Integration
HPE Vertica
Integration with 3rd Party ETL and BI Tools
Simple and seamless integration to existing BI and ETL tools
•
Vertica supports ODBC, JDBC, ADO.NET, and most ETL, BI, and visualization products
Leverage existing investments and lower TCO
Bulk & Trickle
Loads
HPE Vertica
ETL, Replication, Data Quality
Database
SQL, ODBC,
JDBC
Analytics, Reporting
Hadoop Integration
Combining the Power of Vertica and Hadoop
Vertica
HPE Vertica Optimized Storage
• Designed for Performance
• Interactive Analytics
• A rich SQL ecosystem
Both Purposebuilt
Scalable
Analytics
Platforms
Hadoop
• Designed for Faulttolerance
• Batch Analytics
• A rich Programming Model
Core HPE Vertica Engine
Partner Ecosystem
Full ANSI SQL
Known for:
– High performance Columnar RDBMS
– Scale Concurrency
Optimized Plan
– ANSI SQL Completeness + more
Distributed MPP Execution
– Modular Distributed Scaling
ROS Columnar Format
– Predictable Query Execution
– Secure
Encoded/Compressed
EXT4, Built-in HA, Resource
Management
HPE Big Data Reference HW
Architectures, x86, Cloud
– HA and Resource utilization
HPE Vertica for SQL on Hadoop
Partner Ecosystem
Full ANSI SQL
A new member of the Vertica product family
– Explore the data where it lives
– Query the data regardless of format or structure
Optimized Plan
– Runs on any flavor of Hadoop
Distributed MPP Execution
Hadoop Open Formats
ORCFile, Parquet, Avro, JSON
Encoded/Compressed
HDFS
HPE Big Data Reference HW
Architectures, x86, Cloud
– Enterprise grade reliability and manageability
– Simplify the ecosystem with a single query engine
Pushing the performance envelope
Partner Ecosystem
Full ANSI SQL
Native ORC Reader
– Open source project in collaboration with Hadoop community
– Locality – query where data resides
Optimized Plan
– Column Pruning
Distributed MPP Execution
– Predicate Pushdown
ORC READER
ORCFile
Encoded/Compressed
HDFS
HPE Big Data Reference HW
Architectures, x86, Cloud
NET: Even faster than before!
Map/Reduce and HDFS Connectors
Load data from HDFS directly into Vertica using
the HDFS Connector
Use the Map/Reduce Connector to stream data
directly between Vertica and your M/R job
Hadoop / Vertica: Advanced Analytics
Vertica
MapReduce / Pig Job
DFS Block 1 DFS Block 1
DFS Block 1 Map
Vertica
DFS Block 2
Map
DFS Block 2
Reduce
DFS Block 2
DFS Block 3
Map
DFS Block 3
Vertica
HPE Vertica for SQL on Hadoop features and benefits
Query data, no matter where it is located
Analytical Applications
– Install HPE Vertica directly on your Hadoop infrastructure
– Uses same architecture as HPE Vertica
R
Java
Python
SQL
– Ingest open source formats - AVRO, JSON, etc.
– Store in highly optimized ROS, ORC, or Parquet (future)
HPE Vertica Core Engine
– Query pre-existing data lakes
– Experience full-functionality ANSI SQL 99
– Run 100% of TPC-DS queries
Store: ROS
Ingest: AVRO, JSON,
etc.
– Support for leading Hadoop distros (Hortonworks,
MapR and Cloudera)
– Proven enterprise-grade scalability and reliability via
innovative “no helper node” architecture that eliminates a
key single point of failure
Query: ORC &
Parquet
HPE Vertica + Hadoop: Joint Use Cases
Hadoop for ETL, Vertica for Analytics
Log parsing / tagging / filtering
Convert JSON into relational tuples
HDFS for Storage, Vertica + Hadoop for Analytics
Real-time analytics on Vertica (needs speed)
Long-running / exploratory analytics on Hadoop (needs fault tolerance)
Load from HDFS directly to Vertica
Vertica SQL access to HDFS
Vertica for Storage and Analytics, Hadoop as a multi-purpose tool
Hadoop as a scheduler / load-balancer
Hadoop to convert to formats for other tools (e.g. STATA)
Hadoop for Backup via Sqoop
Flex Zone
HPE Vertica Flex Zone
Challenge:
Exploring varying and semi-structured data is time consuming
and error prone
Visualization
Vertica Analytics
Solution:
HPE Vertica Flex Zone
Daily Analytics
Store
and
Explore
Columnar Tables
Flex Zone Tables
Benefits:
 Cost-effective way to store and explore semi-structured data
 Skip creation and maintenance of time-consuming schemas
 One-step performance gains by seamlessly moving Flex Zone data
into Vertica
HPE Vertica Flex Zone
A new approach to loading, managing, and exploring semi-structured data
Features
 Auto-schematization for simple, semi-structured
data loading
 Flexible parsers for JSON, Avro, Regex and
delimited data
 Faster SQL querying on semi-structured data
 One-step schema for blazing-fast performance
Benefits
 Avoid time-consuming schema development
 Optimize selected data easily for accelerated performance
 Rely on single system for structured & semi-structured data
 Respond quickly to changing business requests
 Simplify the ETL process
 Visualize semi-structured data using existing BI tools
HPE Vertica 7 – Turbocharging Hadoop
Semi-Structured Data
Load, store and explore
with Auto Schematization TM
HPE Vertica
Data Visualization
Gain insight from all data
Using popular BI tools
Vertica SQL
HPE Vertica
Query with full,
standard SQL
Enterprise
Hadoop Storage
Hadoop Data
Load and explore from
PIG, MapReduce, HDFS or
Hcatalog
Flex Zone
Serve
Operationalize data
with one step
And Much More!
Feature Highlights
Feature Highlights
Improved Manageability
Enhanced Core Platform
and Security
– HPE Vertica Flex Zone
– Expanded Workload Manageability
– Kerberized HDFS Connector
– Hcatalog integration
– Intelligent Installation and Load
Balancing
– Kerberized Client Drivers
– Data Collector Retention
– Integrated Spread Installation and
Configuration
– Key / Value interface
– Parallelized Loader
– Large data types Long Strings
– Rejection Data as a Table
– EC2 AMI Instances
– Database Designer API
– Improved Control and Performance
of ODBC Drivers
– Secure Password Protection
– JDBC 4.0 Support
– Java SDK
– Improved Networking Infrastructure
– Faster Query Performance with
USDF Predicates
– Optimized Merge
Vertica Technology - Summary
Next-generation platform core


Columnar storage and execution
MPP architecture scales to petabytes
Simple and flexible deployment, enterprise-class manageability



Deploy in x86 hardware environments, scale to petabytes
Robust backup, high availability, disaster recovery, and security solutions
Service a range of workloads and users in one analytics hub
Advanced in-database analytics


Native SQL extensions for common analytics use cases
UDx framework for high-performance custom analytics
Seamless ecosystem integration


Continue to leverage existing BI, ETL, and Hadoop investments
Build custom plug-ins to load natively from varied data sources
HPE Vertica Express and Premium edition details
Analytics
at Scale
No Limits
No
Compromises
Express - $15k/TB
Premium - $25k/TB
MPP Architecture
✓
✓
High Availability
✓
✓
Role-Based Security
✓
✓
User Function Creation (UDx)
✓
✓
Standard SQL (ANSI 99)
✓
✓
Flex Tables
✓
✓
Workload Analyzer, DB Designer, Mgmt
Console
✓
✓
Elastic Cluster
✓
✓
Advanced SQL Analytics (Time Series,
SQL Windowing, Gap Filling, more)*
✓
✓
Fault Groups
✓
Key Value Interface
✓
Sentiment, Geospatial, R Extensions
✓
Column Security
✓
Live Aggregate & Pre-Join Projections
✓
Thank you