Download Análisis extremo de los datos estructurados

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 1355 wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
The Vertica Database - simply fast
Análisis extremo de los
datos estructurados
© Copyright 2012 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice.
The challenges of Big Data
Volume, Variety, Velocity, Complexity
59% 70%
85%
Worldwide
information volume is
growing at a
minimum rate of 59%
annually
of currently deployed
data warehouses will
not scale sufficiently
to meet new
information volume
and complexity
demands by 2014
70% to 85% of data is
“complex mixed data
types.”
* Source: Gartner, Coleman Parkes, October 2011
2%
of corporations can
deliver the right
information, at right time
to support enterprise
outcomes all of the time
Michael Stonebraker
No Limits, No Compromises
Conceived by legendary database guru Michael Stonebraker, the HP
Vertica Analytics Platform was purpose built from the first line of code
for Big Data Analytics.
Why?
Because it was clear that data warehouses and “business-as-usual”
practices were limiting technologies, causing businesses to make
painful compromises.
Designed with speed, scalability, simplicity, and openness at its
core and architected to handle analytical workloads via a distributed
compressed columnar architecture.
• Main architect of the INGRES relational DBMS, POSTGRES
• Founder of various venture-capital backed startups: Ingres
Corporation, Illustra Information Technologies (acquired by
Informix Corporation), StreamBase Systems, and Vertica.
Big Data «5V s»
Break the walls of current technologies
Volume
Variety
Velocity
a real-time data analytics platform
purpose built for Big Data
= VERTICA
1000x
50x-1,000x faster performance at 30% the cost
proven by hundreds of customers & OEMs
Value
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Database and Analytics Architecture
Drive transactions
Manage and store information
Data Management Platform
Transactional/
OLTP
Data
Integration
Enterprise
Data
Warehouse
Data Mart
Generate insight
Business Reporting & Analytic
Applications (Visualization)
Analytic
Applications
Sources
CRM
Order
ERP
Finance
Unstructured
Data
Select
Extract
Transform
Integrate
& Load
Reports
Apps
OLAP
Exec
Dashboards
Analytics
Reporting, Dashboards,
OLAP, Information
Delivery
Big Data Opportunities across industries and use cases
Big data use cases are business-driven and cut across a wide range of industries & functions
Finance
Government
• Fraud detection
• Law enforcement
• Anti-money
laundering
• Counter terrorism
• Risk management
• Traffic flow
optimization
Telecom
Manufacturing
Energy
• Broadcast
monitoring
• Supply chain
optimization
• Weather
forecasting
• Churn prevention
• Defect tracking
• Advertising
optimization
• RFID Correlation
• Natural resource
exploration
Healthcare
• Drug development
• Scientific research
• Evidence based
medicine
• Warranty
management
Horizontal Use Cases
• Churn mitigation
• Social media analytics
• Logistics optimization
• Cross and Up sell
• Pricing optimization
• Clickstream analysis
• Loyalty & promotion
analysis
• Customer behavior analysis
• Influencer analysis
• Revenue assurance
• IT infrastructure analysis
• Web application
optimization
Sources: IDC: 2012 “Worldwide Big Data Technology and Services Forecast: 2011-2015, Gartner: 2012 “Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016
Telecommunications
• 7 of the top 10 global telecommunications
firms run their business on HP Vertica
• Revenue & service assurance and fraud
detection
• Sensor & device management and
performance monitoring
• Subscriber insights and targeted marketing
and advertising
6
“HP Vertica opened doors to analyses that
otherwise were too time-intensive or
impossible. A larger team of business
managers now have faster, easier access
to more information. That knowledge is
invaluable in an aggressively competitive
market like ours.”
- Brian Harvell,
Executive Director, Comcast Network
Operations
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
COMCAST
Key Stats
Scaling + cents of TB
5x DL380 cluster
Largest cable CSP in US. Comcast is the largest cable
communications company in the United States.
Company serves tens of millions of consumers and Enterprises
customers.
Importance of Data
Challenge
Business Benefits
Network Quality  CEM
• Load 50K+ samples per second
• Query response times of 1-2
seconds
• Annual detail views, not just
weekly
• Deliver at least 10:1 data
compression
• Scale to accommodate 40+
terabytes (TB) of data using
standard hardware
Overperformance minimun
requirements
• From 10:1 compression level
to 10,7:1
• Queries processed bellow
second
• Room for increasing
requirements: Sustained data
insertion rate for more than
130K rows/sec
Comcast’s network has
millions of components, and
there are billions of metrics
that could indicate a
potential service interruption
or other problem.
CEM  Reduce churn
Inserts 46,000 new rows
of SNMP data per
second 24x7
(5.5TB/year)
Competitive
Landscape
Open-source (not
scalable)
Faster than other
column-store DB
technologies
High level of Support
(working on delivering
higher compression)
Facebook
Key Stats
Cents of PB
500 TB / day
Leading social website focused on connecting the world, largest
Database in the world, driving revenue through targeted online
marketing/revenue from data
Importance of Data
Challenge
Business Benefits
“At Facebook, we move
incredibly fast. It's important
for us to be able to handle
massive amounts of data in
a respectful way without
compromising speed, which
is why HP Vertica is such a
perfect fit."
•
•
•
•
6 queries run in 1 minute not
days
• Growth had been fuelled by
advertising income, which
leapt 66 per cent year-onyear.
• Facebook did not have any
mobile advertising revenue 18
months ago.
TIM CAMPOS – CIO
FACEBOOK
6 queries take 1 day in Hadoop
Exadata could not scale-up
Teradata too expensive
To increase revenue from
information, through massive
volume and variety of queries
and profiling people with the
right advertising campaigns
Queries now in minutes
not days
Competitive
Landscape
Ex….. could not scale
Tera…. too expensive
“Hadoop not enough
Zynga
World’s leading social game
provider and growing rapidly
web & mobile 3rd party games
on the Zynga Platform
Key Stats
~60 billion rows/day
~10TB daily semistructured data
~1.5PB source data
Importance of Data
Challenge
Business Benefits
Churn rate of 50% per
month
•
•
•
•
The first thing the Zynga team
did was evaluate graph engines
(dedicated software for graph
analysis), however, none of the
solutions they evaluated would
operate at the necessary scale
or performance.
Viral Coefficient: users likely
to cause their friends to sign
up
Revenue Per User
1. Capture Events
2. What happened
3. Why did it happen
4. Create Advantage
• Make every aspect of the game
more profitable by improving
the player experience
significantly
Largest 230 2U nodes
~260m MAUs
~60m avg DAUs
worldwide
+2500 Customers..
MPP-Columnar DBMS
With A Unique Combination Of Innovations
Leverages existing BI,
ETL, Hadoop /
MapReduce and OLTP
investments
No disk I/O
bottleneck;
simultaneously
load & query
Standard SQL
Interface
High
Availability
Auto
Database
Design
Column
Orientation
MPP Massively
Parallel
Processing
Native DB-aware
clustering on low-cost
x86 Linux nodes
Built-in redundancy
that also speeds up
queries
Advanced
Encoding
Minimize IO using 14+
algorithms
Automatic setup,
optimization, and
DB management
Column Store – Column-Based Disk I/O
I have a table with every test score for every US student for the last twenty years.
How can I provide sub-second query response times?
select avg( Score )
from example
Column Store - Reads 3 columns
where
class = ‘Junior’ and
Junior
94
M
Soph
74
F
gender = ‘F’ and
Junior
86
F
Senior
67
M
grade = ‘A’
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
Row Store - Reads all columns
1245
1453
4454
5654
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
NYASE
NYAASE
NYSE
NYASE
NGGYSE
NYGGGSE
NYSE
NYSE
NYSE
Junior
Soph
Junior
Senior
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
NYSE
94
74
86
67
Column Store – Sort and Encode for Speed
Student_ID
Name
Gender
Class
Score
Grade
1256678
1254038
1278858
1230807
1210466
1249290
1244262
1252490
1267170
1248100
1243483
1230382
1240224
1222781
1231806
1246648
Cappiello, Emilia
Dalal, Alana
Orner, Katy
Frigo, Avis
Stober, Saundra
Borba, Milagros
Sosnowski, Hillary
Nibert, Emilia
Popovic, Tanisha
Schreckengost, Max
Porcelli, Darren
Sinko, Erik
Tarvin, Julio
Lessig, Elnora
Thon, Max
Trembley, Allyson
F
F
F
M
F
F
F
F
F
M
M
M
M
F
M
F
Sophomore
Senior
Junior
Senior
Junior
Freshman
Junior
Sophomore
Freshman
Senior
Junior
Freshman
Sophomore
Junior
Sophomore
Junior
62
92
76
64
90
96
68
59
95
76
67
91
85
63
82
100
D
A
C
D
A
A
D
F
A
C
D
A
B
D
B
A
Original fact table… billions of rows
Column Store – Sort and Encode for Speed
Gender
Class
Grade
Score
Name
Student_ID
F
F
F
M
F
F
F
F
F
M
M
M
M
F
M
F
Sophomore
Senior
Junior
Senior
Junior
Freshman
Junior
Sophomore
Freshman
Senior
Junior
Freshman
Sophomore
Junior
Sophomore
Junior
D
A
C
D
A
A
D
F
A
C
D
A
B
D
B
A
62
92
76
64
90
96
68
59
95
76
67
91
85
63
82
100
Cappiello, Emilia
Dalal, Alana
Orner, Katy
Frigo, Avis
Stober, Saundra
Borba, Milagros
Sosnowski, Hillary
Nibert, Emilia
Popovic, Tanisha
Schreckengost, Max
Porcelli, Darren
Sinko, Erik
Tarvin, Julio
Lessig, Elnora
Thon, Max
Trembley, Allyson
1256678
1254038
1278858
1230807
1210466
1249290
1244262
1252490
1267170
1248100
1243483
1230382
1240224
1222781
1231806
1246648
Columns used in predicates
Correlated values “indexed” by
preceding column values
Column Store – Sort and Encode for Speed
Gender
Class
Grade
Score
Name
Student_ID
F
F
F
F
F
F
F
F
F
F
M
M
M
M
M
M
Freshman
Freshman
Junior
Junior
Junior
Junior
Junior
Senior
Sophomore
Sophomore
Freshman
Junior
Sophomore
Sophomore
Senior
Senior
A
A
A
A
C
D
D
A
D
F
A
D
B
B
C
D
95
96
90
100
76
63
68
92
62
59
91
67
82
85
76
64
Popovic, Tanisha
Borba, Milagros
Stober, Saundra
Trembley, Allyson
Orner, Katy
Lessig, Elnora
Sosnowski, Hillary
Dalal, Alana
Cappiello, Emilia
Nibert, Emilia
Sinko, Erik
Porcelli, Darren
Thon, Max
Tarvin, Julio
Schreckengost, Max
Frigo, Avis
1267170
1249290
1210466
1246648
1278858
1222781
1244262
1254038
1256678
1252490
1230382
1243483
1231806
1240224
1248100
1230807
Columns used in predicates
Correlated values “indexed” by
preceding column values
Column Store – Sort and Encode for Speed
Gender
F
F
F
F
F
F
F
F
F
F
M
stM
M
M
M
M
1 I/O
Reads entire
column
Class
Freshman
Freshman offset
Junior
Junior
Junior
Junior
Junior
Senior
Sophomore
nd
Sophomore
Freshman
Junior
Sophomore
Sophomore
Senior
Senior
2 I/O
Grade
Score
A
A
A
A
C
D
D
A
D
rd F
A
D
B
B
C
D
95
96
90
100
76
63
68
92
62
th59
91
67
82
85
76
64
offset
3 I/O
4 I/O
Example query: select avg( Score ) from example where
Class = ‘Junior’ and Gender = ‘F’ and Grade = ‘A’
Name
Student_ID
Popovic, Tanisha
Borba, Milagros
Stober, Saundra
Trembley, Allyson
Orner, Katy
Lessig, Elnora
Sosnowski, Hillary
Dalal, Alana
Cappiello, Emilia
Nibert, Emilia
Sinko, Erik
Porcelli, Darren
Thon, Max
Tarvin, Julio
Schreckengost, Max
Frigo, Avis
1267170
1249290
1210466
1246648
1278858
1222781
1244262
1254038
1256678
1252490
1230382
1243483
1231806
1240224
1248100
1230807
Advanced Compression
Vertica replaces slower disk IO with fast CPU cycles through aggressive compression
Uses properties of the data like sorting & cardinality
Operates across large numbers of rows
Can be operated upon without decoding first
Implements late materialization
Decoded intelligently but as late as possible
No hidden costs
Encoding Mechanism
Transaction Date
Customer ID
Trade
5/05/2009,
5/05/200916
0000001
0000001
0
0000001
2
0000003
2
0000003
4
0000005
10
10
0000011
19
0000011
25
0000020
49
0000026
0000050
0000051
0000052
100.25
.25
1
100.50
2
100.75
3
1
100.25
3
100.75
4
101.00
5
3
101.25
5
100.75
3
101.25
100.75
100.00
100.50
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
5/05/2009
Few values
Sorted
RLE
Many values
Integer
Maybe sorted
DeltaVal
Just-In-Time Decoding
Disk:
Encoding +
Compression
100
Engine:
Encoded
blocks
Many values
Sorted
Results Decoded
Just-In-Time
GCD
Many Others…
17
Buffer Pool:
De-compress
only
Raw Data
Compressed Data
Network:
Encoded blocks
+ Optional LZO
Massively Parallel Processing (MPP)
Shared-nothing, grid based DB architecture  scalability using Industry standard hardware
Designed to Scale outwards
Automatic replication, failover and recovery
Add nodes ONLINE to optimise capacity and performance
Client Network
Private Data Network
All Nodes are Peers
Node 1
 2 x 6 or 8 Core
 96+GB RAM
Node 2
 2 x 6 or 8 Core
 96+GB RAM
Node 3
 2 x 6 or 8 Core
 96+GB RAM
No specialized nodes
– All nodes are peers
– Query/Load to any node
4-10TB
4-10TB
4-10TB
– Continuous/ real-time load
and query
Native High Availability
RAID like function within database
Projections are distributed amongst nodes for redundancy
No need for manual log-based recovery
Vertica continues to load and query when node down
Missing data is recovered from other nodes within the cluster
Vertica 3-Node Cluster
Node 1
19
Node 3
Node 2
B2
A2
C2
B1
A1
C1
B3
A3
C3
A3
B3
C3
A2
B2
C2
A1
B1
C1
Automatic Design & Administration
Vertica Database Designer
recommend a optimised database design for best Performance for user query needs
Minimize DBA effort for essential physical database design
Database Designer runs and deploys whilst ONLINE and without impacting existing processing
Database Designer Generates
DBA Provides
>
Logical schema
 Create table
>
“Sample set” of
 Typical queries
 Sample data
>
Fault Tolerance Level
 k-safety
>
Physical schema, compression to:
 Make queries in sample set run fast
 Fit within trickle load requirements
 Ensure all SQL queries can be answered
A
B
C
(A B C | A)
20
B
A
C
(B A C | B A)
Standard SQL Interface
Vertica supports ANSI SQL-99 plus Analytics to minimise integration effort with
existing BI and ETL tools
ANSI SQL-99 +Analytics
Simple Integration
Bulk & Trickle
Loads
SQL, ODBC,
JDBC
Vertica’s Hadoop Connector
Database Connectors for
JDBC
ODBC
ADO.NET
21
ETL, Replication, Data Quality
Analytics, Reporting
Real-time Analytics
Real-time Analytics on large volumes of data is Reality for the Vertica Database
Hybrid Storage Structure
Concurrent load / enquiry enabled by an asynchronous “Tuple Mover” Process
Write Optimized
Store (WOS)
Trickle
Load
A
B
C
Read Optimized Store
(ROS)
TUPLE MOVER
Asynchronous
Data Transfer
 Memory based
 Unsorted / Uncompressed
Current
Epoch
Closed Epochs
• On disk
• Sorted / Compressed
• Segmented
• Large data loaded direct
A
B
C
Historical
Queries (no
locks)
Latest
Epoch
Inserts, Deletes, Updates
and Up-to-date queries
 Segmented
 Low latency / Small quick inserts
Data Loads
22
(A B C | A)
Current epoch advanced
monotonically (user defined)
Transactions
SQL Analytics+ - Built for Big Data
Features
•
Time series gap filing and interpolation
•
Event window functions and sessionization
•
Social Graphing
•
Pattern matching
•
Event series join
•
Statistical functions
•
Geospatial functions
Benefits
•
High performance (Keep Data close to CPU)
•
Low cost (Industry Standard building blocks)
•
Ease of use (Automated + Available)
Use Cases
•
Tickstore data cleanups
•
CDR/VOD data analysis
•
Clickstream sessionization
•
Data aggregation and compression
•
Monte Carlo simulation
•
Graph algorithms
•
Sensor Data
•
Process Control Time Series
•
SmartGrid
User Defined Extensions in R
What is R
•
Open source language for statistical computing
•
Wide range of packages available for advanced data mining and statistical analysis
Advantages of UDx in R
Vertica automatically parallelizes the execution of user defined R code
•
Optimized data transfer between Vertica and R
Vertica Cluster
•
Combining the Power of Vertica and Hadoop
Vertica
• Designed for Performance
• Interactive Analytics
• A rich SQL ecosystem
Hadoop
Both Purpose-built
Scalable
Analytics
Platforms
Integration via M/R, HDFS, HCatalog
Read: http://www.vertica.com/2011/09/21/counting-triangles/
• Designed for Faulttolerance
• Batch Analytics
• A rich Programming Model
Vertica - The new Database technology
Vertica Main Features
2
Vertica Enables

Columnar Store
 50x – 1000x faster performance

Advanced Compression

Massively Parallel Processing (MPP)

Automatic Database Designer

Native High Availability
 NO SPOF, Up to 49% resiliency

Standard SQL Interface
 Simple integration with
existing ETL and BI solutions
 1:10 compression rate
 Linear scalability from TBs to PBs
 One DBA can handle PB of data
databases
SQL
Hadoop
tools
store
now
NoSQL
analysis
new
Support
real-time
Big
NOW
Storage column-store
Thanks
compr
ession
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Data
processing
Terabyte
Time
Time
mobile
Petabytes
Vertica
concurrency