* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Census Bureau
Survey
Document related concepts
Transcript
Census Bureau DRIS Date: 01/23/2007 Oracle conversion example: Initial approaches for UCM source data Index Data Modeling Current Datafile Current Dataload Data Overlook Two Approaches First Approach Data Distribution Advantages Disadvantages 3 Second Approach Basic Modeling Advantages Advance Work Care needed Our Recommendation Tasks Next steps ? Questions 4 Data modeling Conversion of data from Legacy (Fortran) to RDBMS (Oracle) Hardware/software Sun E6900, OS Solaris 5.10/12 cpu/96 G RAM Database - Oracle 10g Oracle designer / Erwin 5 Current datafile Geo Base Data Legacy process Big datafile Census Data modeling Data updates Reports Oracle db Pl/SQL, Shell, C, ETL tool Data Feeds 6 Current Dataload UCM data Fortran format One big file w/ 180 M records Record length is 1543 bytes Most of the fields are varchar2 Many fields are blank/no data Performance in Oracle inadequate without schema redesign to leverage RDMS capabilities 7 Data Overview (approx) State of NY State of CA State of TX District of Columbia Delaware Connecticut 20 M 34 M 25 M 31 G 52 G 38 G 500 K 1 M 1 M 750 M 1.5 G 1.5 G 8 Two approaches First Approach Break datafile on the basis of data E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc. Second Approach Break datafile into multiple tables with change in field definitions using relational model 9 First approach Break datafile on the basis of data Current datafile Table_CA Table_NY Table_XX Table_YY Table_54 10 Data distribution Uneven data distribution Big data tables will be 30+ G Small data tables will be close to < 1 G 11 Advantages of this kind of segmenting/partitioning: State level queries will be faster than current If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56) 12 Disadvantages Too many tables Many fields are empty and varchar2(100) No normalization Existing queries need to be changed a lot No normalization technique is used. For small tables, query will run fast but for big tables, there will be a lot of overhead Operational tables will be same in number Too complicated to run queries, may confuse users while joining main and operational tables 13 Second approach Break datafile into few relational tables with change in field definitions Current datafile MAFID MAFID Table2 Table1 MAFID MAFID MAFID Table3 MAFID Table4 14 Basic Modeling Database design/logical and physical Relations will be defined based on a primary key In this case, it will be MAFID, which is unique varchar2(100) field could be converted to smaller fields based on actual field lengths All fields will be mapped with at least one of the fields in the new tables Data will be inserted in multiple efficient tables based on updated data model using relational database design principles 15 Advantages Faster Queries Updates Deletes Additions Less maintenance Same approach can be used for transactional/operational data 16 Advance work Identify each and every field of UNM data Check/Define field lengths of each field Map every field to new schema Can some fields be merged together? Identify and remove duplicate data elements in model Define tables and relationships and create new schema Break and load data into these tables 17 Care needed Current datafile will be broken into multiple datafiles for data processing Load one by one datafile into tables Test and demonstrate completeness of new model Craft comparison to prove source and new schema properly include all Census data 18 Our Recommendation ** Second Approach ** Why ? Data distribution will be uniform Less unwanted data is moved to separate tables This will reduce overhead on the queries of any updates Existing queries can be used with little modifications Ongoing data maintenance will be more efficient in RDBMS Additional data like RPS can be easily uploaded using same queries 19 Tasks Design database using data modeling tool/ Oracle designer / Erwin etc. Create test data from original datafile Load test data into database tables Create test scripts to check data consistency Check indexes for required queries Test old data vs. new data 20 Continued… Break data into small files Load full data into tables Unit test on data for consistency Run queries on the database If needed, fine tune database Use same approach for transactional data like RPS data 21 Next steps… Continued collaboration with Census team to improve domain understanding for new team members Access to Oracle database tools on team’s workstations Access to operational Oracle instance to begin development of approach 22 ? Questions 23