Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Oracle9i: The Data Warehouse Database OracleWorld 2002 #31348 0011 0010 1010 1101 0001 0100 1011 Ian Abramson IAS Inc. [email protected] 1 2 4 Agenda • The database 0011 0010 1010 1101 0001 0100 1011 – – – – – – Partitioning External Tables Parallel Execution Materialized Views The Optimizer Bit-map join indexes • The SQL – CUBE and ROLLUP – Rolling Windows – Multi-table Inserts • OLAP Overview • Data Mining Overview 1 2 4 Oracle9i for Data Warehousing 0011 0010 1010 1101 0001 0100 1011 Oracle 7.3 Hash Join Bitmap Indexes Parallel-Aware Optimizer Partition Views Partitioned Tables and Indexes Instance Affinity: Function Shipping Partition Pruning Parallel Union All Asynchronous Read-Ahead Parallel Index Scans Histograms Hash and Composite Parallel Insert, Update, Delete Anti-Join Oracle 8.0 Oracle8i Partitioning 1 2 Oracle9i Resource Manager List Partitioning Parallel Bitmap Star Query Progress Monitor Bitmap Join Index Parallel ANALYZE Adaptive Parallel Query Dynamic Aggregation Buffersize Parallel Constraint Enabling Server-based Analytic Functions Materialized Intermediate Results Server Managed Backup/Recovery Materialized Views Grouping Sets Point-in-Time Recovery Transportable Tablespaces Concatenated Grouping Sets Direct Loader API Aggregate Pruning Functional Indexes Partition-wise Joins New Analytic Functions Security Enhancements Self-Tuning Execution Memory 4 System Managed Undo Dynamic Resizing of Buffer Pool ETL Infrastructure and much more ... Oracle9.2i Complete e-Business Intelligence Infrastructure 0011 0010 1010 1101 0001 0100 1011 1 2 4 Oracle9.2i Database 0011 0010 1010 1101 0001 0100 1011 Relational OLAP Data Mining ETL M e t a d a t a 1 2 4 Oracle9.2i Application Server Runs All Your Business Intelligence Applications 0011 0010 1010 1101 0001 0100 1011 Portal M e t a d a t a Query & Reporting 1 BI Components 2 4 Real-time Personalization Hello! We have recommendations for you. Oracle9i R2 DW Architecture 0011 0010 1010 1101 0001 0100 1011 Oracle9iDB Oracle9iAS Data Warehousing ETL OLAP Data Mining Portal M e t a d a t a 1 Query & Reporting 2 4 BI Components Real-Time Personalization Hello! We have recommendations for you. The Old Way: Everything is a Different Product 0011 0010 1010 1101 0001 0100 1011 OLAP Engine Data Sources Data Integration Engine Data Warehouse Engine 1 2 4 Mining Engine The New Way: Oracle9i 0011 0010 1010 1101 0001 0100 1011 Oracle9i Data Warehousing ETL OLAP Data Mining 1 2 4 • All aspects of architecture are integrated The Oracle9i DW Database 0011 0010 1010 1101 0001 0100 1011 1 2 4 Computers are useless. They only give you answers Picasso Partitioning Advantages 0011 0010 1010 1101 0001 0100 1011 • • • • • Separates data in separate pieces Partition key defined at creation Partition pruning Partition-wise joins May partition 1 – Tables – Indexes 2 4 Traditional vs. Partitions 0011 0010 1010 1101 0001 0100 1011 Partition table Approach Single table Approach 1 2 4 More Advantages 0011 0010 1010 1101 0001 0100 1011 • Partitions - separate physical entities – physical attributes (PCTFREE, PCTUSED, INITRANS,MAXTRANS) may vary for different partitions of the same table or index 1 2 4 • Different partitions - different tablespaces – minimizes the impact of data corruption – independent back up and recovery of each partition – balance the I/O load Partition Options 0011 0010 1010 1101 0001 0100 1011 • • • • • Range Hash Composite List Range-List 1 2 4 Range Partitioning 0011 0010 1010 1101 0001 0100 1011 • Range partitioning - maps rows to partitions based on ranges of column values 1 CREATE TABLE Sales_by_department (Department NUMBER, SalesId NUMBER, Amount NUMBER ) PARTITION BY RANGE ( Department ) PARTITION single_digits VALUES LESS THAN (10) TABLESPACE sd_low, PARTITION double_digits VALUES LESS THAN (100) TABLESPACE dd_middle, PARTITION multiple_digits VALUES LESS THAN (maxvalue) TABLESPACE md_high); 2 4 Hash Partitioning 0011 0010 1010 1101 0001 0100 1011 • maps rows to partitions based on a hash value of the partitioning key (oracle internally determines that) CREATE TABLE Sales_by_department (Department NUMBER, SalesId NUMBER, Amount NUMBER ) PARTITION BY HASH ( SalesID ) 1 2 4 PARTITION hash_name1 TABLESPACE hash_name1_tbls, PARTITION hash_name2 TABLESPACE hash_name2_tbls, PARTITION hash_name3 TABLESPACE hash_name3_tbls); Composite Partitioning 0011 0010 1010 1101 0001 0100 1011 • Partitions data using the range method, and within each partition sub-partitions it using the hash method CREATE TABLE Sales_by_department (Department NUMBER, SalesId NUMBER, Amount NUMBER ) PARTITION BY RANGE ( Department ) SUBPARTITION BY HASH (SalesID) SUPPARTITIONS 2 STORE IN (sub1_tbls, sub2_tbls) (PARTITION single_digits VALUES LESS THAN (10) TABLESPACE sd_tbls, PARTITION double_digits VALUES LESS THAN (100) TABLESPACE dd_tbls, PARTITION multiple_digits VALUES LESS THAN (maxvalue) TABLESPACE md_tbls) ); 1 2 4 List Partitions 0011 0010 1010 1101 0001 0100 1011 CREATE TABLE sales_list ( salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_state VARCHAR2(20), sales_amount NUMBER(10), sales_date DATE) PARTITION BY LIST(sales_state) (PARTITION sales_west VALUES ('California', 'Hawaii'), PARTITION sales_east VALUES ('New York', 'Virginia', 'Florida'), PARTITION sales_central VALUES('Texas', 'Illinois') PARTITION sales_other VALUES(DEFAULT) ); 1 2 4 Range-List Partitions 0011 0010 1010 1101 0001 0100 1011 CREATE TABLE quarterly_regional_sales (deptno NUMBER, item_no VARCHAR2(20), txn_date DATE, txn_amount NUMBER, state VARCHAR2(2)) PARTITION BY RANGE (txn_date) SUBPARTITION BY LIST (state) (PARTITION q1_2002 VALUES LESS THAN(TO_DATE('1-APR-2002','DD-MON-YYYY')) (SUBPARTITION q1_2002_northwest VALUES ('OR', 'WA'), SUBPARTITION q1_2002_southwest VALUES ('AZ', 'UT', 'NM'), SUBPARTITION q1_2002_northeast VALUES ('NY', 'VM', 'NJ'), SUBPARTITION q1_2002_southeast VALUES ('FL', 'GA'), SUBPARTITION q1_2002_northcentral VALUES ('SD', 'WI'), SUBPARTITION q1_2002_southcentral VALUES ('NM', 'TX')), PARTITION q2_2002 VALUES LESS THAN(TO_DATE('1-JUL-2002','DDMON-YYYY')) 1 2 4 Partition Maintenance Functions 0011 0010 1010 1101 0001 0100 1011 • • • • • • • Add Drop Exchange Move Split and Merge Truncating Coalesce (Hash only) 1 2 4 Partition Exchange 0011 0010 1010 1101 0001 0100 1011 • Allows you to create data in a separate table and then replace a partition with the table • Validate or don’t validate is the question • Actually exchanges table and partition • Nice for archiving SQL> 1 2 3 4 5* run alter table sales_transactions exchange partition sales_feb_2000 with table load_sales including indexes without validation Table altered. 1 2 4 Partition Notes 0011 0010 1010 1101 0001 0100 1011 • Separate partitions - separate tablespaces • Beware of MAXVALUE in rolling-window • Use naming conventions – range partitions --> table_name_YYYY_MM_DD table_name_tbls_YYYY_MM_DD 1 2 4 • No global indexes for hash partitions External Tables 0011 0010 1010 1101 0001 0100 1011 • Data resides on operating system • Table definition resides in database (SYSTEM tablespace) • Need to define: – Directory for files – Table definition 1 2 4 • Read-only • No indexes • Problems exhibit themselves at SELECT time Oracle9i: ETL Scenario Oracle8i: Multiple staging tables and SQL statements Staging Table 0011 0010 1010 1101 0001 0100 1011 FLAT FILES Step 1: Load into staging table Step 2: Transform data using function TRANSFORM Oracle9i: Single SQL statement Staging Table 1 Step 3: Insert and update into target table 2 TARGET 4 Oracle9i: Parallel pipelining of data External Tables the SQL 0011 0010 1010 1101 0001 0100 1011 • Create directory: CREATE DIRECTORY data_dir AS 'd:\wkdir'; CREATE DIRECTORY log_dir AS 'c:\TEMP'; 1 2 4 External Tables the SQL (2) 0011 0010 1010 1101 0001 0100 1011 CREATE TABLE products_delta ( PROD_ID NUMBER(6), PROD_NAME VARCHAR2(50), PROD_DESC VARCHAR2(4000), PROD_SUBCATEGORY VARCHAR2(50), PROD_SUBCAT_DESC VARCHAR2(2002), PROD_CATEGORY VARCHAR2(50), PROD_CAT_DESC VARCHAR2(2002), PROD_WEIGHT_CLASS NUMBER(2), PROD_UNIT_OF_MEASURE VARCHAR2(20), PROD_PACK_SIZE VARCHAR2(30), SUPPLIER_ID NUMBER(6), PROD_STATUS VARCHAR2(20), PROD_LIST_PRICE NUMBER(8,2), PROD_MIN_PRICE NUMBER(8,2) ) ORGANIZATION external (TYPE oracle_loader DEFAULT DIRECTORY data_dir ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE log_dir:'prod_delta.bad_xt‘ LOGFILE log_dir:'prod_delta.log_xt‘ FIELDS TERMINATED BY "|" LDRTRIM ) Location ('prodDelta.dat') ) REJECT LIMIT UNLIMITED NOPARALLEL; 1 2 4 Parallel Execution 0011 0010 1010 1101 0001 0100 1011 • Do more work at the same time • Best for: – – – – – Large table scans Creation of large indexes Partition table scans Bulk inserts, updates and deletes Aggregations and summarizations • System characteristics – – – – SMP, MPP, Clusters Sufficient I/O bandwidth Under-utilized CPU Sufficient memory 1 2 4 Getting Parallel to Work • In the init.ora 0011 0010 1010 1101 0001 0100 1011 parallel_automatic_tuning=TRUE parallel_max_servers=n • 2 * DOP * # of concurrent users parallel_min_servers=n • 0 or ??(your choice) large_pool_size & shared_pool_size 1 2 4 mem in bytes = (3 x size x users x groups x connections) • SIZE = PARALLEL_EXECUTION_MESSAGE_SIZE • USERS = the number of concurrent parallel execution users that you expect to have running with the optimal DOP • GROUPS = the number of query server process groups used for each query • A simple SQL statement requires only one group. However, if your queries involve subqueries which will be processed in parallel, then Oracle uses an additional group of query server processes. • CONNECTIONS = (DOP2 + 2 x DOP) Materialized Views 0011 0010 1010 1101 0001 0100 1011 • • • • Provide summary tables A “Physical” view Performance gains are significant Similar to snapshots 1 – Refresh may be FAST (need log on master) – COMPLETE – FORCE 2 4 • Oracle packages provide help and guidance DBMS_MVIEW.EXPLAIN_MVIEW ('SH.CAL_MONTH_SALES_MV'); Materialized Views 0011 0010 1010 1101 0001 0100 1011 • Enable it in the init.ora • Query rewrite 1 2 QUERY_REWRITE_ENABLED = TRUE QUERY_REWRITE_INTEGRITY = TRUSTED 4 • May enable at session level as well Creating a Materialized View 0011 0010 1010 1101 0001 0100 1011 CREATE MATERIALIZED VIEW cust_sales_mv PCTFREE 0 STORAGE (initial 8k next 8k pctincrease 0) BUILD IMMEDIATE REFRESH FORCE ENABLE QUERY REWRITE AS SELECT c.cust_id, SUM(amount_sold) AS dollar_sales FROM sales s, customers c WHERE s.cust_id= c.cust_id GROUP BY c.cust_id ; 1 2 4 Collecting the Details 0011 0010 1010 1101 0001 0100 1011 • Use the DBMS_MVIEW supplied package • Has many other packages to help you with your materialized views. • Today you get one! 1 2 4 EXECUTE DBMS_MVIEW.EXPLAIN_MVIEW('SH.CAL_MONTH_SALES_MV'); Verifying a Materialized View SELECT capability_name, possible, SUBSTR(related_text,1,8) AS rel_text, SUBSTR(msgtxt,1,60) AS msgtxt FROM MV_CAPABILITIES_TABLE ORDER BY seq; 0011 0010 1010 1101 0001 0100 1011 CAPABILITY_NAME --------------PCT REFRESH_COMPLETE REFRESH_FAST REWRITE PCT_TABLE PCT_TABLE REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ONETAB_DML P N Y N Y N N N N N N N N N N REL_TEXT -------- REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ANY_DML N N N REFRESH_FAST_AFTER_ANY_DML REFRESH_FAST_AFTER_ANY_DML REFRESH_PCT N SH.TIMES N SH.SALES N REWRITE_FULL_TEXT_MATCH REWRITE_PARTIAL_TEXT_MATCH REWRITE_GENERAL REWRITE_PCT Y Y Y N MSGTXT -----(Partition Change Tracking) SALES TIMES SH.TIMES SH.TIMES SH.TIMES SH.SALES SH.SALES SH.SALES DOLLARS no partition key or PMARKER in select list relation is not a partitioned table mv log must have new values mv log must have ROWID mv log does not have all necessary columns mv log must have new values mv log must have ROWID mv log does not have all necessary columns SUM(expr) without COUNT(expr) see the reason why REFRESH_FAST_AFTER_INSERT is disabled COUNT(*) is not present in the select list SUM(expr) without COUNT(expr) see the reason why REFRESH_FAST_AFTER_ONETAB_DML is disabled mv log must have sequence mv log must have sequence PCT is not possible on any of the detail tables in the materialized view 1 2 4 PCT is not possible on any detail tables Dimensions 0011 0010 1010 1101 0001 0100 1011 • • • • • • Categorizes data Provides hierarchy guidance Allows roll-up and roll-down Needed for query rewrite Needed for materialized views OEM has Dimension Wizard Category 1 2 Sub-Category 4 Product Dimensions the SQL 0011 0010 1010 1101 0001 0100 1011 CREATE DIMENSION products_dim LEVEL product IS (products.prod_id) LEVEL subcategory IS (products.prod_subcategory) LEVEL category IS (products.prod_category) HIERARCHY prod_rollup ( product CHILD OF subcategory CHILD OF category ) ATTRIBUTE product DETERMINES (products.prod_name, products.prod_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size,prod_status, prod_list_price, prod_min_price) ATTRIBUTE subcategory DETERMINES (prod_subcategory, prod_subcat_desc) ATTRIBUTE category DETERMINES (prod_category, prod_cat_desc); 1 2 4 The Optimizer is a Star 0011 0010 1010 1101 0001 0100 1011 • Star Transform must be enabled STAR_TRANSFORMATION_ENABLED=TRUE • Requires Bit Mapped indexes or bit mapped join index on foreign key columns CREATE BITMAP INDEX sales_c_state_bjix ON sales(customers.cust_state_province) FROM sales, customers WHERE sales.cust_id = customers.cust_id LOCAL NOLOGGING COMPUTE STATISTICS; 1 2 4 • Cost-based optimizer must be used – Optimizer looks for small set of dimensions to satisfy query, even if large number of rows in fact The Star SQL 0011 0010 1010 1101 0001 0100 1011 SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c, channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = 'CA' AND ch.channel_desc in ('Internet','Catalog') AND t.calendar_quarter_desc IN (‘2001-Q1',‘2001-Q2') GROUP BY ch.channel_class, c.cust_city, t.calendar_quarter_desc; 1 2 4 How the Optimizer Sees It 0011 0010 1010 1101 0001 0100 1011 SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales WHERE time_id IN (SELECT time_id FROM times WHERE calendar_quarter_desc IN(‘2001-Q1',‘2001-Q2')) AND cust_id IN (SELECT cust_id FROM customers WHERE cust_state_province='CA') AND channel_id IN (SELECT channel_id FROM channels WHERE channel_desc IN('Internet','Catalog')); 1 2 4 Star Transform Restrictions 0011 0010 1010 1101 0001 0100 1011 • The star transform use is restricted: – – – – A hint tells optimizer to not use a bitmap index Query contains bind variables Not enough bitmap indexes The fact table is a remote table 1 – Tables have a single access path – Tables are too small to be useful – Database is in read-only mode 2 4 • The optimizer may not choose the star: Bitmap Join Indexes 0011 0010 1010 1101 0001 0100 1011 Sales Customer CREATE BITMAP INDEX cust_sales_bji ON Sales(Customer.state) FROM Sales, Customer WHERE Sales.cust_id = Customer.cust_id; 2 4 Index key is Customer.State Indexed table is Sales 1 Resumable Transactions 0011 0010 1010 1101 0001 0100 1011 • Suspend transactions • Ability to resume these transactions • Only errors currently handled: – – – – – 1 SQL statements that run out of TEMP space DML/Export/CREATE TABLE as … Space limits Out-of-space transaction Exceed space quota 2 4 • Full support of locally managed tablespaces • Great for large DW queries Resumable the SQL • Enable Resumable transactions ALTER SESSION ENABLE RESUMABLE TIMEOUT 1200; 0011 0010 1010 1101 0001 0100 1011 • The Transaction: CREATE TABLE sales_prod_dept ( prod_category, prod_subcategory, cust_id, time_id,channel_id,promo_id, quantity_sold, amount_sold ) NOLOGGING TABLESPACE transfer PARTITION BY LIST (prod_category) ( PARTITION boys_sales values ('Boys'), PARTITION girls_sales values ('Girls'), PARTITION men_sales values ('Men'), PARTITION women_sales values ('Women') ) AS SELECT p.prod_category, p.prod_subcategory, s.cust_id, s.time_id, s.channel_id, s.promo_id, SUM(s.amount_sold) amount_sold, SUM(s.quantity_sold) quantity_sold FROM sales s, products p, times t WHERE p.prod_id=s.prod_id AND s.time_id = t.time_id AND t.fiscal_year= 2002 GROUP BY prod_category, prod_subcategory,cust_id, s.time_id, channel_id, promo_id ; 1 2 4 The Proof it Stopped 0011 0010 1010 1101 0001 0100 1011 SELECT name, status, error_msg FROM dba_resumable; 1 2 4 The SQL 0011 0010 1010 1101 0001 0100 1011 1 2 4 The SQL 0011 0010 1010 1101 0001 0100 1011 • Oracle provides many analytical functions – – – – – – – Cross tabular reports Ranking functions Percentile functions Regression analysis Moving windows Ratios within a report CASE statements 1 2 4 CUBE, ROLLUP and GROUP functions 0011 0010 1010 1101 0001 0100 1011 • • • • • Efficient data access Creates matrix of totals Crosstabs are computed for you CUBE provides all totals GROUPING provides guidance of totals 1 2 4 ROLLUP the SQL 0011 0010 1010 1101 0001 0100 1011 SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 1 2 FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2002-09', '2002-10') AND country_id IN ('CA', 'US') 4 GROUP BY ROLLUP (channel_desc,calendar_month_desc,country_id); ROLLUP Results 0011 0010 1010 1101 0001 0100 1011 CHANNEL_DESC ------------Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Internet Internet Internet Internet Internet Internet Internet CALENDAR -------2002-09 2002-09 2002-09 2002-10 2002-10 2002-10 CO -CA US 2002-09 2002-09 2002-09 2002-10 2002-10 2002-10 CA US CA US CA US SALES$ -------------1,378,126 2,835,557 4,213,683 1,388,051 2,908,706 4,296,757 8,510,440 911,739 1,732,240 2,643,979 876,571 1,893,753 2,770,324 5,414,303 13,924,743 BY Channel and Month BY Month BY Channel 1 2 4 BY Channel and Month BY Month BY Channel For Everything CUBE the SQL 0011 0010 1010 1101 0001 0100 1011 SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 1 2 FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2002-09', '2002-10') AND country_id IN ('CA', 'US') 4 GROUP BY CUBE (channel_desc,calendar_month_desc,country_id); CUBE the Results CHANNEL_DESC -------------------Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Internet Internet Internet Internet Internet Internet Internet Internet Internet -------2002-09 2002-09 2002-09 2002-10 2002-10 2002-10 CALENDAR -CA US CO ---------1,378,126 2,835,557 4,213,683 1,388,051 2,908,706 4,296,757 2,766,177 5,744,263 8,510,440 911,739 1,732,240 2,643,979 876,571 1,893,753 2,770,324 1,788,310 3,625,993 5,414,303 2,289,865 4,567,797 6,857,662 2,264,622 4,802,459 7,067,081 4,554,487 9,370,256 13,924,743 SALES$ 0011 0010 1010 1101 0001 0100 1011 CA US CA US 2002-09 2002-09 2002-09 2002-10 2002-10 2002-10 CA US CA US CA US 2002-09 2002-09 2002-09 2002-10 2002-10 2002-10 CA US CA US CA US BY Channel and Month BY Channel and Month BY Channel and Country BY Channel BY Channel and Month BY Channel BY Month and Country BY Month Everything 1 2 4 BY Channel and Month BY Channel and Country The GROUPING function 0011 0010 1010 1101 0001 0100 1011 SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc) AS Channel, DECODE(GROUPING(country_id), 1, 'All Countries', country_id) AS Country, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc= '2002-09' AND country_id IN (‘CA', 'US') GROUP BY CUBE(channel_desc, country_id); 1 2 4 GROUPING Results 0011 0010 1010 1101 0001 0100 1011 CHANNEL ----------------Direct Sales Direct Sales Direct Sales Internet Internet Internet All Channels All Channels All Channels COUNTRY ----------CA US All Countries CA US All Countries CA US All Countries SALES$ -------------1,378,126 2,835,557 4,213,683 911,739 1,732,240 2,643,979 2,289,865 4,567,797 6,857,662 1 2 4 Rolling Windows 0011 0010 1010 1101 0001 0100 1011 • Allow analysis – Cumulative aggregates – Moving aggregations – Centred aggregations – Logical offsets 1 2 4 Rolling Windows the SQL 0011 0010 1010 1101 0001 0100 1011 SELECT cust_id, t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (PARTITION BY s.cust_id ORDER BY t.time_id RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG FROM sales s, times t WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51) AND calendar_year=2001 AND cust_id IN (6380, 6510) GROUP BY cust_id, t.time_id ORDER BY cust_id, t.time_id; 1 2 4 The Rolling Window Result 0011 0010 1010 1101 0001 0100 1011 CUST_ID --------6380 6380 6380 6380 6380 6380 6380 6510 6510 6510 6510 6510 6510 6510 TIME_ID SALES CENTERED_3_DAY --------- --------- -------------20-DEC-01 2,240 1,136 21-DEC-01 32 873 22-DEC-01 348 148 23-DEC-01 64 302 24-DEC-01 493 212 25-DEC-01 80 423 26-DEC-01 696 388 20-DEC-01 196 106 21-DEC-01 16 155 22-DEC-01 252 143 23-DEC-01 160 305 24-DEC-01 504 240 25-DEC-01 56 415 26-DEC-01 684 370 1 2 4 Ranking Data 0011 0010 1010 1101 0001 0100 1011 SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold),-6), '9,999,999,999') SALES$, RANK() OVER (ORDER BY trunc(SUM(amount_sold),-6) DESC) AS RANK, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-6) DESC) AS DENSE_RANK FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2002-09', '2002-10') AND channels.channel_desc<>'Tele Sales' GROUP BY channel_desc, calendar_month_desc; 1 2 4 Rank Results 0011 0010 1010 1101 0001 0100 1011 CHANNEL_DESC CALENDAR ------------ --------Direct Sales 2002-10 Direct Sales 2002-09 Internet 2002-09 Internet 2002-10 Catalog 2002-09 Catalog 2002-10 Partners 2002-09 Partners 2002-10 SALES$ RANK DENSE_RANK -------------- --------- ---------10,000,000 1 1 9,000,000 2 2 6,000,000 3 3 6,000,000 3 3 3,000,000 5 4 3,000,000 5 4 2,000,000 7 5 2,000,000 7 5 1 2 4 RATIO_TO_REPORT function 0011 0010 1010 1101 0001 0100 1011 • Computes a ratio value to a sum of values • Deals with NULLs accurately • Data can be partitioned by values 1 2 4 • select product_key, sum(sales_amount) Sales, • sum(sum(sales_amount)) over () Tot_Sales, • ratio_to_report(sum(sales_amount)) over () • report_ratio • from monthly_sales • group by product_key; RATIO_TO_REPORT Result 0011 0010 1010 1101 0001 0100 1011 • Product_key sales Tot_sales • ----------- -----------------• A123 210 1104 • B9837 112 • C8743 90 • C9662 472 • R4300 100 • T0843 120 Report_ratio -----------0.19 1104 1104 1104 1104 1104 1 0.10 0.08 0.43 0.09 0.11 2 4 100% Multi-table Inserts 0011 0010 1010 1101 0001 0100 1011 • Conditional insert of data • Need to have source data in a table • Insert options 1 2 4 – ALL (all conditions that are true) – FIRST (first condition that is true) Muti-table Insert SQL 0011 0010 1010 1101 0001 0100 1011 INSERT ALL INTO sales VALUES (product_id, customer_id,weekly_start_date,'P', 501,q_sun,sales_sun) INTO sales VALUES (product_id, customer_id,weekly_start_date+1,'P', 501,q_mon,sales_mon) INTO sales VALUES (product_id, customer_id,weekly_start_date+2,'P', 501,q_tue,sales_tue) INTO sales VALUES (product_id, customer_id,weekly_start_date+3,'P', 501,q_wed,sales_wed) INTO sales VALUES (product_id, customer_id,weekly_start_date+4,'P', 501,q_thu,sales_thu) INTO sales VALUES (product_id, customer_id,weekly_start_date+5,'P', 501,q_fri,sales_fri) INTO sales VALUES (product_id, customer_id,weekly_start_date+6,'P', 501,q_sat,sales_sat) SELECT * FROM sales_input_table; 1 2 4 Conditional Multi-Table Insert 0011 0010 1010 1101 0001 0100 1011 INSERT ALL WHEN order_total < 1000000 THEN INTO small_orders 2 WHEN order_total > 1000000 AND order_total < 2000000 THEN INTO medium_orders WHEN order_total > 2000000 THEN INTO large_orders 1 4 SELECT order_id, order_total, sales_rep_id, customer_id FROM orders; The Upsert 0011 0010 1010 1101 0001 0100 1011 • Perform and insert or update • Touch data once • Data is always SELECT’ed from source 1 2 4 Upsert the SQL 0011 0010 1010 1101 0001 0100 1011 MERGE INTO products t USING products_delta s ON ( t.prod_id=s.prod_id ) WHEN MATCHED THEN UPDATE SET t.prod_list_price=s.prod_list_price, t.prod_min_price=s.prod_min_price WHEN NOT MATCHED THEN INSERT (prod_id, prod_name, prod_desc, prod_subcategory, prod_subcat_desc, prod_category, prod_cat_desc, prod_status, prod_list_price, prod_min_price) VALUES (s.prod_id, s.prod_name, s.prod_desc, s.prod_subcategory, s.prod_subcat_desc, s.prod_category, s.prod_cat_desc, s.prod_status, s.prod_list_price, s.prod_min_price); 1 2 4 The Table Function 0011 0010 1010 1101 0001 0100 1011 • • • • Reduces need for staging of large data sets Pipe data into a select statement Parallel execution supported Also known as pipelining 1 2 4 Table Functions: The SQL • Create TYPE definition for targets 0011 0010 1010 1101 0001 0100 1011 CREATE TYPE city_populations_row AS OBJECT (city_name VARCHAR2(9), census_year NUMBER, population NUMBER ); CREATE TYPE city_populations_table AS TABLE OF city_populations_row; • Create Cursor package • Create TYPE definition for targets CREATE OR REPLACE PACKAGE census_package AS TYPE pop_cursor_type IS REF CURSOR RETURN city_populations_ext%ROWTYPE; FUNCTION census_transform (indata IN pop_cursor_type) RETURN city_populations_table PARALLEL_ENABLE (PARTITION indata BY ANY) PIPELINED; END; 1 2 4 Select Using the Table Function 0011 0010 1010 1101 0001 0100 1011 ALTER SESSION ENABLE PARALLEL DML; INSERT /*+ APPEND PARALLEL (t,4) */ INTO city_populations t SELECT * FROM TABLE (census_package.census_transform (CURSOR(SELECT city_name, pop_1990, pop_2000 FROM city_populations_ext))); 1 2 4 The OLAP Engine Computers are composed of nothing more than logic 0011 0010 1010 1101 0001 0100 1011 gates stretched out to the horizon in a vast numerical irrigation system Stan Augarten 1 2 4 OLAP Definition 0011 0010 1010 1101 0001 0100 1011 • The FASMI: F: A: M: S: I: Fast Analytical Multi Dimensional Shared Information About the Data 1 2 4 Codd’s Rules and Features of OLAP 0011 0010 1010 1101 0001 0100 1011 • Basic Features F1: Multidimensional Views F2: Intuitive Data Manipulation F3: Accessible F4: Batch and Interpretive Extraction F5: OLAP Analysis Model F6: Client Server Access/Web Access F7: Transparency F8: Multi-User Support 1 2 4 Codd’s Rules and Features of OLAP 0011 0010 1010 1101 0001 0100 1011 • Special Features: F9: Treatment of Non-Normalized Data F10: Storing OLAP Results (not w/ Source) F11: Standardization of Missing Values F12: Possibility of Ignoring Missing Values • Reporting Features F13: Flexible Reporting F14: Uniform Reporting Performance F15: Adjustment to Type of Models 1 2 4 Codd’s Rules and Features of OLAP 0011 0010 1010 1101 0001 0100 1011 • Dimension Control F16: Generic Dimensionality F17: Unlimited dimensions and aggregations F18: Unrestricted Cross-dimension operations 1 2 4 Platform for Business Intelligence: OLAP 0011 0010 1010 1101 0001 0100 1011 Data Warehousing ETL OLAP Data Mining Oracle9i Oracle OLAP Analysis-ready Oracle database Support for complex, multidimensional queries 1 2 Development platform for Internet-ready analytical applications 4 Java OLAP API Business Intelligence Beans and JDeveloper OLAP Application Platform 0011 0010 1010 1101 0001 0100 1011 BI Beans Rapid application development Analysis ready JDeveloper Oracle9i Application Server and Dev Suite 1 2 Java OLAP API Predictive analysis functions Oracle OLAP Oracle9i Database 4 Scaleable data store Integrated metadata Summary management SQL analytic functions Key Concepts 0011 0010 1010 1101 0001 0100 1011 • OLAP in the RDBMS – – – – – – – – Updated version of Oracle Express Single RDBMS-MDDS process Single data storage Single security model Single metadata repository Single set of management tools SQL based metadata APIs SQL access and OLAP API access to relational tables and analytic workspaces 1 2 4 OLAP in Oracle9i Release 2 0011 0010 1010 Java OLAP API Application 1101 0001 0100 1011 Generic SQL Application SQL OLAP API ‘Direct’ OLAP application PL/SQL OLAP Process SQL via Table Function Relational Tables Analytic Workspace Oracle9i Database Source: Vlamis Software 1 2 4 Data Mining 1 The beginning of knowledge is the discovery of something we do not understand Frank Herbert1 0011 0010 1010 1101 0001 0100 1011 2 4 Data Mining Overview 0011 0010 1010 1101 0001 0100 1011 Data Mining is a decision support process in which we search for patterns of information in data Two Types of Traditional Analysis: Confirmatory Exploratory 1 2 4 The Data Mining Process 0011 0010 1010 1101 0001 0100 1011 Trends and Validation Affiliations & Associations 1 Discovery Data Mining Predictive Modeling Forensic Analysis 2 Conditional Logic Outcome Predictive 4 Forecasting Deviation Detection Link Analysis Data Mining Techniques 0011 0010 1010 1101 0001 0100 1011 Nearest Neighbor Data Retained Case-Based Reasoning 1 Rules Data Mining Approaches Logical Decision Trees Distilled Data Cross Tabulation Belief Nets Equational 2 4 Agents Statistics Neural Nets Data Retention Techniques 0011 0010 1010 1101 0001 0100 1011 • New data compared to existing information • Proximity comparison between new and old Compare Entire Database New Record 1 2 4 Results: Top K Neighbors Pattern Distillations 0011 0010 1010 1101 0001 0100 1011 • You are looking for patterns in the data • What patterns? How should they be represented? Y Y Y X X Regression Line Logical Representation 1 2 4 X Universal Approximation Decision Trees 0011 0010 1010 1101 0001 0100 1011 NV NY US NYC Other FL New Record BC CAN ON Toronto Other QC High High Average Low 1 2 4 High High Average Low Neural Nets 0011 0010 1010 1101 0001 0100 1011 • Works like the brain • Decisions are learned • Inputs trigger neurons and they lead to decisions 1 Age Accept Loc Sal Decision Emp Inputs Neurons 2 4 Decline Oracle9i Data Mining 0011 0010 1010 1101 0001 0100 1011 • Data mining completely embedded in Oracle9i database – Simplified data-mining process – Eliminates need for data-movement and redundant data • Java-based API – Supports application integration 1 2 4 • Will comply with emerging standard API (JDM) Oracle9i Data Mining 0011 0010 1010 1101 0001 0100 1011 • Key capabilities: – Multiple algorithms – Executed within the database 1 2 • Transactional Naïve Bayes • Predictive Assocation rules • Decision Trees via Adaptive Bayesian Network (Oracle9i, Release 2) • Clustering (Oracle9i, Release 2) – Multiple prediction types • Probability of specific outcome • Most probable outcome 4 Thanks! Questions and Comments Presentation #31348 0011 0010 1010 1101 0001 0100 1011 Ian Abramson Toronto, Ontario 416-407-2448 [email protected] 1 2 4