Download mcglone_chep161_2 - Indico

CHEP 2007 Building a Scalable Event Level Metadata Service for ATLAS Performance and Scalability Tests for Relational TAG Database Jack Cranshaw, Luc Goossens, David Malon, Helen McGlone, Florbela Viegas Introduction • The ATLAS TAG Database • A Challenging Environment • The 1TB Tests • Learning • Performance results • What’s next The ATLAS TAG Database • Defined in ATLAS Computing Model • File and relational database system • Support seamless discovery, identification, selection and retrieval of events of interest to analysis and no others • TAGs written at AOD production • Implemented using LCG POOL Collection infrastructure What is a TAG? • 1kB summary information per event • Attributes selected to support selection of events and navigation within the system – Event ID and Global event properties – Trigger information – Quality information – Temporal information – Physics objects (high level) A Challenging Environment….. • ATLAS data rate – 200Hz – 200 new event TAGs per second for data taking, assume 50K active seconds/day = 58% efficiency for each active day, 10^7 events/day • ATLAS data volume Year % Year data taking 2008 40 2009 60 additional 60 Data Volume 1.42TB 3.65TB 6.09TB Terabytes! A Challenging User! • The ATLAS physicist….. – Fast, efficient, accurate queries – Reliable navigation to event data – Seamless integration with analysis  QUERY TAG DB RETRIEVE ANALYSIS Performance and Scalability Tests for Relational TAG Database • Large scale realistic tests to uncover challenges brought with scale • Optimise and measure performance Management Partitioning Indexing Optimizer Hints Multiple clients Query patterns Why 1TB tests? • We expect that important performance and management phase transitions will be crossed as we scale to billions of events • Queries may be unselective and select across a range of attributes – want to address this challenge at large scale • Memory to disk The 1TB TAG Database • Created one million dummy events based on real TAG attributes • Realistic and varied data types and value distributions • Multiplication and replication, one million to one billion, realistic and with unique ids The 1TB Table Name ID RUNNR EVENTNR GOLDEN1 GOLDEN2 AODFILEFK ESDFILEFK BOOL500CHAR01 BOOL100CHAR01 BOOL10CHAR01 BOOL1CHAR01 BOOL500NUM01 BOOL100NUM01 BOOL10NUM01 BOOL1NUM01 ENUMUNI100VC01 ENUMUNI10VC01 ENUMEXP1000NUM01 ENUMEXP100NUM01 ENUMEXP10NUM01 ENUMUNI1000NUM01 ENUMUNI100NUM01 ENUMUNI10NUM01 ENUMUNI5NUM01 NOR100BF01 NOR10BF01 NOR1BF01 NOR100NUM01 NOR10NUM01 NOR1NUM01 UNINUM01 UNI10KNUM01 UNI1KNUM01 UNI100NUM01 UNI10NUM01 Datatype NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER CHAR CHAR CHAR CHAR NUMBER NUMBER NUMBER NUMBER VARCHAR2 VARCHAR2 NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER BINARY_FLOAT BINARY_FLOAT BINARY_FLOAT NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER Size 36 12 12 12 12 12 12 1 1 1 1 1 1 1 1 10 10 5 5 5 5 5 5 5 4 4 4 12 12 12 12 12 12 12 12 Scale 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 0 0 0 0 0 Nulls? Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Index ? B*Tree Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap Bitmap B*Tree B*Tree B*Tree B*Tree B*Tree B*Tree B*Tree Bitmap Bitmap Bitmap Bitmap  We replicate the 1M rows to produce a 1TB table  Approximately 1kB row size (expected TAG size)  Distribution of values tries to mimic the types of columns and distributions expected for TAG metadata.  This set of attributes is repeated in the same sequence until set 8, but only this set is indexed. 1TB Test system ITRAC04.cern.ch 2 CPUS, 4GB total memory ITRAC05.cern.ch 2 CPUS, 4GB total memory INT8R1 instance 2GB Oracle Cache INT8R2 instance 2GB Oracle Cache ASM – Shared Storage INT8R.CERN.CH ~2TB user space available Partitioning the data • Partitions are an Oracle tool – Performance queries – Data manageability • Multidimensional – Horizontal – Vertical • Choice of partition key – RUN – STREAM Table Architecture STREAM A STREAM B Divide all data HORIZONTALLY into TEN STREAM TABLES RUN a RUN b STREAM N RUN c Partition all STREAM TABLES HORIZONTALLY by RUN – each table has 100 partitions, 40 runs per partition, 1 million rows per partition STREAM C • This is our nth partitioning schema….. Indexing • In our test table we have btree index, bitmap • • index and non indexed attributes Tested extensively to understand performance and management implications of all three We learned • Btree and bitmap index attributes perform differently • Oracle has distinct query plans which are optimal for each • Non indexed attributes are a huge restriction on query performance • So….. • Index everything (challenge) • Pre-process to separate btrees and bitmaps, implement a different query plan for each Optimizing the Optimizer • Oracle has an Optimizer designed to select • optimal query execution plans Performance is sensitive to query plans • Optimizer chooses the plan based on cost estimation • Complex selection • Uses statistical information about data • Optimizer is not perfect • Investigate optimal query plans • Help the Optimizer make the right choice Oracle Hints • Many hints were tested….. PARALLEL - needed for parallelization of full table scan or partition range scan PARALLEL INDEX - needed for parallelization of index access INDEX JOIN - for hash joins with b-tree indexes INDEX COMBINE - for bitmap indexes opt param( INDEX JOIN ENABLED ,false) - Can enable and disable session parameters for a single SQL Queries for testing Count the events with at least two electrons and missing ET >10GeV that are “good for physics” - SUMMARY Give me all the events with at least two electrons and missing ET >10GeV that are “good for physics” - CONTENT • Selection based on both index types • Use INDEX_JOIN for btrees and INDEX_COMBINE for bitmaps, then INTERSECT • Flush buffer cache between queries, so no cache advantage • Increase number of partitions involved as increase rows returned, consistent % rows from each partition Summary queries Time to count events, 1% data per partition, both index types Time to count events, 10% data per partition, both index types 20 30 18 25 16 20 12 time(s) time(s) 14 10 8 15 10 6 4 5 2 0 0 0 10000 20000 30000 40000 rows counted 50000 60000 70000 80000 0 100000 200000 300000 400000 500000 rows returned •Time increases with number of partitions •Linear increase – predict time •10 x data not 10 x time – can predict time with some bounds •Time in order of seconds 600000 700000 800000 Summary queries – an extreme Time to count events, all partitions, increase data returned, both index types 9 8 time(mins) 7 6 5 4 3 2 1 0 0 500000 1000000 1500000 2000000 2500000 3000000 rows returned See value of partition elimination 3500000 4000000 Content queries Time to select, both index types, 1-10 partitions, 10% data per partition 140 140 120 120 100 100 time(s) time(s) Time to select, both index types, 1-10 partitions, 1% data per partition 80 60 80 60 40 40 20 20 0 0 0 20000 40000 60000 80000 100000 120000 0 rows returned 200000 400000 600000 800000 1000000 rows selected Linear increase in time with number of partitions Time overhead is in number of partitions accessed, not data returned from within Time order of seconds Does this linear relation extrapolate indefinitely….. 1200000 Content query – an extreme Time to select, both index types, all partitions, increase data per partition 50 45 40 time(mins) 35 30 25 20 15 10 5 0 -10000000 0 10000000 20000000 30000000 40000000 50000000 60000000 70000000 rows returned If linear relation is constant ~proportional to number of partitions in query, then query from all (100) partitions ~20 mins Not in practice….why? Threshold case where sorts move from all memory to disk – higher performance overhead, same query plan, but with use of disk Faster than full table scan Results Count vs Select - 10% data, 1-10 partitions 140 120 time(s) 100 80 select 60 count The “extreme case” (optimised) upwards of 30 mins for selects 40 20 0 0 200000 400000 600000 800000 1000000 1200000 rows returned • • • • Encourage counts before select Encourage use of partition key Can extrapolate to predict time query likely to take Pre-processing crucial Multiple clients – Stress Tests • Assess performance of system under multi client • environment Simulate a realistic query environment by creating expected typical queries – Counts and retrieves – Vary attributes and filters • Create a sample job of 9 optimised queries, with 60 • • seconds, then 90 seconds of pause between each Sessions on one node of INT8R cluster. Node has 2Gb memory and 2 CPUs Job Frequency - 90 seconds Job Frequency- 60 seconds 120 120 100 100 80 80 N. of Jobs Active N. of Concurrent Jobs 60 Job Time (minutes) 60 Job Time (minutes) Avg Host CPU 12:38 12:29 12:20 12:11 12:02 11:53 11:44 11:35 11:26 11:17 11:08 10:58 10:49 10:40 10:31 10:22 15:11 15:03 14:55 14:47 14:38 14:30 14:22 14:14 0 14:06 0 13:58 20 13:50 20 13:42 40 13:34 40 10:13 Avg Host CPU • Saturation at 60 seconds, queries supported at • • • • 90 seconds 1 Job every 90 seconds generates ~9000 queries per day T0 production database has 6 nodes, each T1 has 2 nodes (3 oracle licences) New hardware in April 2008, will test when new hardware becomes available Need to limit concurrent processes ATLAS Relational TAG Database experience • The Streams Test TAG database • Smaller scale • Introduce users to TAGS • Gather query pattern information http://atlas.web.cern.ch/Atlas/GROUPS/OPERATIONS/dataBases/TAGS/tag_browser.php > 5000 queries captured “Pinch of salt” for query capture….. but positive and useful feedback nonetheless Counts vs Retrievals for Streams test TAG database What’s next? • Assess performance on new hardware • New ‘real’ TAG Database for CSC data • Continue gathering data about likely query • • patterns Assess file vs. relational database to guide users as to when each is appropriate Trigger implementation in TAGs Summary • 1TB tests optimised and measured performance of realistic scale TAG database • Understand challenges and develop a system that can offer good performance • Ongoing challenge Thank you! Scenarios – Complete set of data (or most used part thereof) fits in memory/cache • memory is O(1GB) • query time is limited by cpu and mem speed – Complete set of Indices (or most used part thereof) fits in memory/cache • fast identification of data • slow retrieval from disk – contiguous parts @ O(100MB/s) – random parts @ O(1000 IO/s) = O(1 MB/s) – contiguous IO degrades with parallelism – Neither Indices nor data fit in memory • index reading from disk – usually contiguous – Final and/or intermediate results do not fit in memory • e.g. for sorting, intersection, joins, ... • need to use disk even for these ops HASH JOIN, BITMAP CONVERSION TO ROWIDS, BITMAP INDEX RANGE SCAN, INDEX RANGE SCAN, PARALLEL summary QUERYINDEX (FULL SCAN) OF 'ICMG1_1_ID’, BITMAP INDEX (RANGE SCAN), BITMAP MERGE, BITMAP AND, BITMAP CONVERSION (TO ROWIDS), HASH JOIN, SORT (UNIQUE), QUERY PARALLEL content • Each query scans 1Gb of data, which in logical units, would be around 1 hour and 30 minutes of data at 200Hz speed • Average time of job running alone is 10 minutes • Node has 2Gb memory and 2 CPUs

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download mcglone_chep161_2 - Indico