Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Session id: The Oracle9i Multi-Terabyte Data Warehouse Jeff Parker Manager Data Warehouse Development Amazon.com The Challenges • • • Rapidly evolving business Growing data volumes Do more with less The Challenges • Rapidly evolving business – New international markets – Continual innovation of features on Amazon • Buy it used • Magazine subscriptions – Marketplace Partnerships – Toys R Us, Target • • Growing data volumes Do more with less The Challenges Rapidly evolving business • Growing data volumes – 2X growth yearly over the • past 5 years – Currently 10 Terabytes of raw data Do more with less Data Growth Terabytes • 35 30 25 20 15 10 5 0 1999 2000 2001 2002 2003 The Challenges • • Rapidly evolving business Growing data volumes • Do more with less – Innovative use of technology and resources – Throwing money and people at the problem is not an option – Leverage existing investment in Oracle Addressing the issues • Rapidly evolving business – Denormalize only for performance reasons – Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs • Growing data volumes • Do more with less Addressing the issues • Rapidly evolving business • Growing data volumes – Dual database approach to ETL • Staging database for efficient transformation of large datasets. • SQL and hash-joins allow transforms to scale in a non-linear fashion Second database optimized for analytics – Oracle as an API • Simplifies ETL architecture • Better scalability than traditional ETL tools • Do more with less Addressing the issues • Rapidly evolving business • Growing data volumes • Do more with less – One DW schema supports all countries – Cut costs by eliminating unneeded software – Data driven Load functionality The ETL Process • Extract data from source • The Load process • Dimensional Transforms The ETL Process • Extract data from source – Can create one or more files to be loaded – Must produce Metadata upon which the Load process can depend • The Load Process • Dimensional Transforms Extract produced Metadata • Describes each field in database type • • • • terms Changes as the dataset changes Can reference multiple files Very reliable No additional overhead XML Based Metadata <DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2" DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE" DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2" DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES> </DATA> The ETL Process • Extract data from source • The Load Process – Makes extensive use of External Tables – MERGE and Bulk Insert – Contains integrated DBA tasks – Every load is tracked in an operational database • Dimensional Transforms The Load Process The Load Process • External Tables – access to files on the operating system – Is a building block in a broader ETL process • MERGE & Bulk Insert • Integrated DBA tasks The External Table • • • • Created by using Metadata from the Extract process Data is read-only No indexes Use DBMS_STATS to set number of rows External Table DATA_SETS dataset_id NUMBER dataset_name VARCHAR(80) creation_date DATE created_by VARCHAR(8) Data Files Example External Table 1. Copy the data to the database server Data must reside in a file system location specified by the DBA’s. - create directory DAT_DIR as ‘/stage/flat’ Example External Table 2. Create the external table using the DML from the extract. CREATE TABLE XT_datasets_77909 ( dataset_id NUMBER , dataset_name VARCHAR2(80) , creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' ) The External Table • • • • No pre-staging of data Ability to describe a flat file to Oracle Handles horizontally partitioned files Good error messaging The Load Process • External Tables • MERGE – Can be run in parallel – Combined with external table provides a powerful set of ETL tools • Integrated DBA tasks MERGE • Allows for update or insert in a single statement – If key value already exists • Yes, update row • No, insert row • MERGE statement is auto-generated • Row level column transforms are supported MERGE MERGE example MERGE into DATASETS ds USING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) src On ( xt.dataset_id = ds.dataset_id ) When matched then UPDATE SET ds.dataset_name = src.dataset_name ,ds.creation_date = src.dataset_name ,ds.created_by = src.created_by ,ds.last_updated = sysdate when not matched then INSERT( dataset_name, creation_date, created_by, last_updated ) VALUES( dataset_name, creation_date, created_by, sysdate ) MERGE • Issues we faced – Duplicate records in the dataset – NESTED-LOOPS from external table – Parallelism is not enabled by default – Bulk Load partition determination The Load Process • External Tables • MERGE • Integrated DBA tasks – Reduces workload required by the DBA team – Streamlines the load process – Eliminates human error Integrated DBA Tasks • Provided by the DBA team – Managed by the DBA team – ETL team does not need special knowledge of table layout Integrated DBA Tasks • Truncate Partition developer makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes alter table TABLE-NAME drop partition dbi20020930_101; Integrated DBA Tasks • Analyze Partition developer makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes dbms_stats.gather_table_stats(ownn ame , tabname , partname , cascade , estimate_percent, granularity); Integrated DBA Tasks • Return Partition Name developer makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables. Integrated DBA Tasks • Partitioning utilities – Helps to streamline the process – Reduces workload of DBA team – Helps to eliminate the problem of double loads for Snapshot tables and partitions The Load Process • External Tables • MERGE • Integrated DBA tasks • Loads are repeatable processes – Provides access to flat files outside the database – Parallel “upsert” simplifies ETL – Row level transforms can be performed in SQL – Reduces workload required by the DBA team – Streamlines the load process – Eliminates human error Summary • • • Reduction in time to integrate new subject areas Oracle parallelism scales well Eliminated unneeded software Summary • Oracle has delivered on the DW promise – Oracle External table combined with MERGE is a viable alternative to other ETL tools – ETL tools are ready today QUESTIONS ANSWERS Reminder – please complete the OracleWorld session survey Thank you.