Download Large Databases - Mosconi Consulting SRL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
®
Large Databases:
Do’s & Don’ts
•Angelo Sironi
•Executive IT Architect – IBM
•[email protected]
Padoa, 11 June 2008
© 2008 IBM Italia S.p.A.
©
La gestione di grandi database
ovvero...
i volumi mettono in crisi le teorie...
Cosa fare, e cosa non fare, per
garantire performance e scalabilità
alle applicazioni ed alle architetture
che devono gestire terabytes di dati
a.sironi11 June 2008
Padoa,
2
a.sironi
2
©
Contents
 Asilomar, 1998-2008
 XLDB
 Large Databases: How Large?
 Concerns
 Do’s & Don’ts
 A look into the future …
 Concluding remarks
a.sironi11 June 2008
Padoa,
3
a.sironi
3
©
Asilomar, August 1998
 Database system research agenda for
the next decade
Plug & Play Data Base Management Systems
Federate Millions of Database Systems
Rethink Traditional Database System
Architecture
Smart-Data Unify Process and Data in
Database Systems
Integrate Structured and Semi-structured Data
 The Information Utility:
Make it easy for everyone to store,
organize, access, and analyze the majority
of human information online.
a.sironi11 June 2008
Padoa,
4
a.sironi
4
©
XLDB, 2007 - SciDBMS, 2008 - CW, 2008
 1st Workshop on Extremely Large Databases
(SLAC, Oct. 2007)
All of the industry representatives had more than 10
petabytes of data, and their largest individual systems
were all at least 1 petabyte in size.
 SciDBMS Meeting 2008 (Asilomar, Apr. 2008)
SciDBMS must be able to scale to databases of
hundreds of petabytes, with individual tables
measured in trillions of rows
 CW - May 22, 2008
“Size matters: Yahoo claims 2-petabyte database is
world's biggest, busiest” (Computerworld)
 Largest table: “multiple trillions of rows”
 “The 2 Pb database requires fewer than 1,000 PC
servers”
a.sironi11 June 2008
Padoa,
5
a.sironi
5
©
What’s an XLDB for us?
 What’s an XLDB for us?
Total size of all DB’s?
Size of Largest DB?
Size of largest Table / file?
 Larger than One Petabyte?
NO!
So what?
 Or it’s XLDB when it concerns us?
 But … why are we concerned?
a.sironi11 June 2008
Padoa,
6
a.sironi
6
©
Concerns related to VLDB / XLDB
 Sailing unknown oceans…
Technology
Experience
Skills
Methods
 Past negative experience
Fear it might happen again
a.sironi11 June 2008
Padoa,
7
a.sironi
7
©
Concerns related to VLDB / XLDB
 Feasibility / cost
 Sizing
 Engineering
 Performance
 Throughput
 Availability
 Maintenance
 Administration
 ….
Concerns apply to DW Systems as well to OLTP ones
a.sironi11 June 2008
Padoa,
8
a.sironi
8
©
Do’s & Don’ts (1/3)
 Define / identify non-functional requirements
Be sound with definitions
Identify most critical requirements
Don’t confuse business requirements with technical
implementations
Size the environment
 Don’t follow opinions
Base your decisions on proven fact
 Don’t be dogmatic – always understand the impact of
your decisions
Everything must be in Boyce - Codd Normal Form … or …
Must always follow Kimball’s Dimensional Model using Surrogate
Keys
a.sironi11 June 2008
Padoa,
9
a.sironi
9
©
Do’s & Don’ts (2/3)
 Don’t consider non-functional requirements as an
afterthought(1)
Set validation points
Compare with sizing critical specs
Prototype the unknown
Measure
 Don’t rely on the first solution that comes to mind
On most critical issues, be ready with alternatives
 Lateral thinking may help
Prototype, prototype, prototype …
Measure, compare and contrast
(1)
Afterthought = an addition to something already completed
a.sironi11 June 2008
10
Padoa,
a.sironi
10
©
Do’s & Don’ts (3/3)
 Consider parallelizing everything
But don’t forget Amdahl’s Law ..
.. And Hot Spots …
 Automate everything
Reduce complexity
Reduce operating time
 Document
Accurately
With all relevant … details
a.sironi11 June 2008
Padoa,
11
a.sironi
11
©
Current Data Warehouse Challenges
 Total cost of ownership
Large amount of data, large storage cost
Huge hardware management cost
a.sironi11 June 2008
Padoa,
12
a.sironi
12
©
Current Data Warehouse Research Goals
 Dramatic TCO Reduction
Extreme compression
Ride the wave of commodity hardware
a.sironi11 June 2008
Padoa,
13
a.sironi
13
©
Current Data Warehouse Challenges
 Total cost of ownership
Large amount of data, large storage cost
Huge hardware management cost
 Complex BI queries
Planning for reporting queries
 Build index/MQT in advance
Unexpected ad-hoc query performance
 Hinder interactive data analysis
a.sironi11 June 2008
Padoa,
14
a.sironi
14
©
Current Data Warehouse Research Goals
 Dramatic TCO Reduction
Extreme compression
Ride the wave of commodity hardware
 Constant ad-hoc query response time
Exploit enormous parallelism (multi-core)
Exploit large memories (in-memory database)
Scan-based query processing
a.sironi11 June 2008
Padoa,
15
a.sironi
15
©
Blink – a Data Warehouse Accelerator Prototype
 Research prototype underway by IBM Almaden and
Boeblingen labs
Achieve consistent response times for ad hoc queries
Exploit Modern Hardware
 Parallelism in Multicore commodity processors
 Exploit Large Memory (in-memory DB)
DBA relief (no database tuning - no indexes, MQT’s etc)
 Goal
Run ad hoc BI queries with consistent response times
Target: query 1 Billion tuples in 1 second for $10K worth of 2007
hardware
a.sironi11 June 2008
Padoa,
16
a.sironi
16
©
Blink – a Data Warehouse Accelerator Prototype
 Today: Blink Query Engine
Most queries in about 3 secs
1
$4000 box, 2x4 cores, 16GB
3
 Google-like experience for BI
queries!
 Near optimal compression of
relational data
Size of query results
Exploits data skew, column correlations and lack of ordering
 Between 8 and 40x compression
100
 Querying compressed row-store10
1000
10000
Directly perform projections and selections on compressed data
Efficient hash based aggregation
Constant query response time 3 sec/billion tuple (Today)
a.sironi11 June 2008
Padoa,
17
a.sironi
17
©
DW/BI Topology with Data Warehouse Accelerator
Operational
Data
ETL
Data
Warehouse
General Purpose
RDBMS
e.g. DB2
General Purpose
RDBMS
e.g. DB2
OLTP
Applications
BI
Applications
a.sironi11 June 2008
Padoa,
18
DW
Accelerator
a.sironi
18
©
IBM solidDB Acquisition: Key Capabilities
 In-memory, relational database
solidDB optimizes data also for in-memory access, not only on disk
Applications can take advantage of its capability through standard ODBC,
JDBC, SQL interfaces.
 Instant failover
solidDB maintains two copies of the data synchronized at all times
In case of system failure, applications can recover access to solidDB well
under 1 second without loss of data
 Embeddable
solidDB can be deployed in a client/server configuration, or as a library
linked into the application process
 Front-end cache for RDBMS
June 2008, available as a front-end cache for IBM DB2 & IDS
a.sironi11 June 2008
Padoa,
19
a.sironi
19
©
Concluding Remarks
Technology advances don’t stop!
However,
establishing and enforcing methods
based on Best Practices
will always be the best protection against failures,
whatever the technology will be!
a.sironi11 June 2008
Padoa,
20
a.sironi
20
©
a.sironi11 June 2008
Padoa,
21
a.sironi
21
©
References
[1]
[2]
[3]
[4]
[5]
Z. Czech et al., An optimal algorithm for generating minimal perfect hash functions, IPL,
43(5), 1992
V. Raman et al., Constant-time Query Processing, IEEE International Conference of Data
Engineering, 2008
V. Raman and Garret Swart, How to Wring a Table Dry: Entropy Compression of
Relations and Querying of Compressed Relations, VLDB ‘06, September 12-15, 2006
Eric Lai, Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest,
Computerworld, May 22, 2008
T. Westmann et al., The Implementation and Performance of Compressed Databases,
SIGMOD Record, 29(3), 2000
a.sironi11 June 2008
Padoa,
22
a.sironi
22
©
Bi-temporal Model and Surrogate Keys: A Ref. Integrity Issue
Late
A Referential
arrival of Dimension
IntegrityUpdates
issue
CDRs transformed and loaded on 21.08.2001
Calling
Phone Nr
Valid Date
Valid
Time
Trx Date
Called Phone
Nr
Cost
Phone
ID
02-444-111
19.08.2001
20:23:06
21.08.2001
06-132-4326
120
100
02-444-111
20.08.2001
09:10:45
21.08.2001
0331-239-325
320
100
Phone Dimension State after
on 23.08.2001
21.08.2001
Phone
ID
Phone Nr.
Init Valid
Date
End Valid
Date
100
02-444-111
01.01.2000
101
02-444-111
102
02-444-111
a.sironi11 June 2008
Padoa,
End Trx
Date
Cust
Key
31.12.9999 01.01.2000
31.12.9999
22.08.2001
C3
01.01.2000
19.08.2001 23.08.2001
31.12.9999
C3
20.08.2001
31.12.9999 23.08.2001
31.12.9999
C6
23
Init Trx
Date
a.sironi
23
©
Where are Referential Integrity issues coming from?
Calling
Phone Nr
Valid Date
Valid
Time
Trx Date
Called Phone
Nr
02-444-111
19.08.2001
20:23:06
21.08.2001
06-132-4326
120
100
02-444-111
20.08.2001
09:10:45
21.08.2001
0331-239-325
320
100
Phone
ID
Phone Nr.
Init Valid
Date
End Valid
Date
100
02-444-111
01.01.2000
101
02-444-111
102
02-444-111
SELECT
FROM
WHERE
AND
AND
AND
AND
Init Trx
Date
Phone
ID
End Trx
Date
Cust
Key
31.12.9999 01.01.2000
31.12.9999
22.08.2001
C3
01.01.2000
19.08.2001 23.08.2001
31.12.9999
C3
20.08.2001
31.12.9999 23.08.2001
31.12.9999
C6
P.*P.*
CDR C, PHONE P
C.CALLING_PHONE_NR = P.PHONE_NR
’22.08.2001’
INIT_TRX_DATE >= ’23.08.2001’
END_TRX_DATE <=
< ’22.08.2001’
’23.08.2001’
VALID_DATE
>= INIT_VALID_DATE
<= END_VALID_DATE
VALID_DATE
<
a.sironi11 June 2008
Padoa,
Cost
24
RETURNS
RETURNS
PHONE_ID=101
forPHONE_ID=100
FIRST phone call,
forPHONE_ID
BOTH phone
calls
= 102
for SECOND phone call
a.sironi
24
©
Amdahl’s Law (a simplified view)
Serial Processing Elapsed Time
STEP
B
STEP
A
STEP
A
STEP
C
STEP
C
STEP
D
STEP
E
Time Savings
Sp
Parallel Steps
a.sironi11 June 2008
Padoa,
STEP
E
25
1
 + (1 - )
p
when   0
a.sironi
25
©
“Load” Process (Everything Parallel & Automated)
Input Records
Sort
Insert
Current Month-6
a.sironi11 June 2008
Padoa,
Input Records
Input
Records
Input
Records
Sort
Sort
Insert
Insert
Current Month-1
26
Input
Records
Input
Records
Input
Records
Input
Records
Unload
Sort
Sort
Load
Load
Current
Day-2
Current
Day-1
Current Month
a.sironi
26