Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data modelling patterns used for integration of operational data stores A D|ONE insight brought to you by Catalin Ehrmann Abstract Modern enterprise software development and data processing approaches have been divided into separate paths in the last decade. Recently, the two communities have realized the benefit of sharing expertise across domains. In this paper, we will explain how a clever mix of data warehouse and OLTP (Online Transaction Processing) patterns create a robust operational system and discuss the advantages and disadvantages of this approach. Introduction There is no shortage of design patterns for data processing or software development, but each pattern has its own set of tradeoffs. Developers and database administrators often have to weigh the pros and cons of several options. Before choosing a pattern, it is important to understand business requirements and the data model. What is the tolerance for failure of an operation? What are the legal requirements? Is the data used globally or only locally? What kind of analysis will be done on the data? We also have to think about connected systems and what hardware our database and software must operate on. However, if we can design a pattern that has only a few minor tradeoffs, we can meet more crossorganization business requirements without slowing down the business or increasing error rates. DWH Patterns Before we can design an improved enterprise DWH pattern, we must understand two basic patterns important to DWH design and implementation: Slowly Changing Dimensions (SCD) and Change Data Capture (CDC). SCD Type 1 SCD1 updates data by overwriting existing values (See Figures 1 and 2). It is used about half of the time because it is easy to implement and use. This is a good approach for error fixes, but compliance laws could be violated since all historical values are lost. This approach should not be used when a data value is being updated because the information has changed for example, an organization moving to a new location. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 1/7 Figure 1: SCD1 Sample Code update single record in Vendors table UPDATE dbo.Vendors SET CITY = 'BERLIN' WHERE UID = 1234 Figure 2: SCD1 Example VENDOR TABLE: BEFORE SCD1: UID TAX_ID 1234 AFTER SCD1: UID 1234 VENDOR CITY COUNTRY 46857832 ACME, INC MUNICH GERMANY TAX_ID VENDOR CITY COUNTRY 46857832 ACME, INC BERLIN GERMANY Additionally, analysis cubes and precomputed aggregates must be rebuilt any time a data point is changed using SCD1. If there are distributed copies of the data, the change will have to be implemented on the copies as well. Calculations must also be rebuilt on each copy. As compliance requirements grow, SCD1 will likely be used less as time goes on. Organizations will be forced to choose another method to maintain good standing with compliance enforcement agencies. SCD Type 2 is a bit more complex than SCD1, but has some important advantages. SCD Type 2 In SCD2, the current record is expired and a new row is added to take its place using SQL Server’s MERGE functionality (See Figure 3). SCD Type 2 is a bit more difficult to implement, but has the advantage of preserving historical data. This method is an excellent choice when the law requires history preservation. The disadvantage of SCD2 is that database storage and performance could quickly become a concern since new rows are being added for every update. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 2/7 Figure 3: SCD1 Example VENDOR TABLE: BEFORE SCD2: UID TAX_ID VENDOR 1234 46857832 ACME, INC AFTER SCD2: UID TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL MUNICH GERMANY 26-05-2001 31-12-9999 Y CITY COUNTRY VALID_FROM VALID_TO CURR_FL 1234 46857832 ACME, INC MUNICH GERMANY 26-05-2001 15-04-2009 N 1234 46857832 ACME, INC BERLIN GERMANY 15-04-2009 31-12-9999 Y When implementing SCD2, it is important to include metadata columns so users are able to determine which record is current and which are historical (See Figure 3). Administrators should also make end users aware of the metadata columns and their meaning. A current flag is not absolutely necessary, but it does make querying for current or historical records easier. It is sometimes useful to include a reason flag or description to note why the data was updated to distinguish error fixes from information changes. Administrators should also keep in mind that updates to downstream systems may not be made properly when a natural key is updated and no surrogate key is present. It is recommended that surrogate keys always be present in data updated using SCD2 (See Figure 4). Figure 4: Changing a Natural Key (Tax ID) With No Surrogate Key (UID) Present is Not Recommended VENDOR TABLE: BEFORE SCD2: TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL 46857832 ACME, INC MUNICH GERMANY VALID_TO CURR_FL AFTER SCD2: 26-05-2001 31-12-9999 Y TAX_ID VENDOR CITY COUNTRY VALID_FROM 46857832 ACME, INC MUNICH GERMANY 26-05-2001 15-04-2009 N 56857833 ACME, INC BERLIN GERMANY 15-04-2009 31-12-9999 Y Change Data Capture CDC is a method to extract data for ETL (extract, transform and load) processing. CDC isolates changed data for extract rather than performing a full refresh. CDC captures all inserts, updates and deletes from all systems that interface with the database, including frontend applications and database processes such as triggers. If metadata is present, CDC also follows compliance regulations. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 3/7 Changes can be detected in four different ways: via audit columns, database log scraping, timed extracts and a full database difference comparison. See Figure 5 for a comparison of CDC methods. Audit columns can be very easy to use, but they can also be unreliable. If frontend applications modify data or there are null values in the data, audit columns should not be used to detect changes. If the administrator is certain that only database triggers are used to update metadata, audit columns may be a good option. Database log scraping should be used as a last resort. While this method is slightly more reliable than using audit columns to find changes, it can be very errorprone and tedious to build a system that takes a snapshot of a log, then extracts useful information from that log, and finally acts on that information accordingly. Furthermore, log files tend to be the first thing to be erased when database performance and storage volumes are suffering, resulting in missed change captures. Figure 5: Comparison of CDC Methods IMPLEMENTATION & SPEED ACCURACY AUDIT COLUMNS Fast, easy implementation If database triggers are used to modify metadata, highly accurate DATABASE LOG SCRAPING Tedious and time-consuming Highly prone to error due to nature of log file scraping; Must have alternative method if DBA empties log files to ensure database performance TIMED EXTRACTS Fast, but manual cleanup is often required and can be time-consuming Very unreliable. Mid-job failures or job skips can cause large amounts of data to be missed FULL DIFFERENCE COMPARISON Somewhat easy to implement, but highly resource intensive Highly accurate Timed extracts are notoriously unreliable, but novice DBAs often mistakenly choose this technique. Here, an extract of data is taken at a specific time. Data is captured within a particular timeframe. If the process fails before it completes all steps, duplicate rows could be introduced into the table. A failed or stopped process will cause entire sets of data to be missed. In the case of failed or skipped processes, an administrator will have the tedious task of cleaning up duplicate rows and identifying which rows should be included in the next CDC process and which should be excluded. A full database difference comparison is the only method that is guaranteed to find all changes. Unfortunately, it can be very resource intensive to run a full diff compare since snapshots are compared record by record. To improve performance, a checksum can be used to quickly determine if a record has changed. This method is a good choice for environments where reliability and accuracy are the primary concerns. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 4/7 The OLTP Pattern OLTP (Online Transaction Processing) is a very popular method for processing data. Data input is gathered, then the information is processed and the data is updated accordingly—all in real time. Most frontend applications that allow users to interface with the database use OLTP. If you’ve ever used a credit card to pay for something, you’ve used OLTP (See Figure 6). You swiped your card (input), then the credit card machine or website sent the data to your card company (information gathering), and your card was charged according to your purchase (data update). OLTP is an allornothing process. If any step in the process fails, the entire operation must fail. If you swiped your card, funds were verified, but the system failed to actually charge you for your purchase the vendor would not get his money; therefore the process must fail. Figure 6: Sample OLTP Process OLTP’s design makes it ideal for realtime, missioncritical operations. If any process has zero tolerance for error, OLTP is the pattern of choice. Additionally, OLTP supports concurrent operations. You and other customers can all make purchases from the same vendor at the same time without having to wait for another transaction to finish. This is another reason OLTP is a good choice for frontend applications. When implementing an OLTP system, it is important to keep your queries highly selective and highly optimized. Query operation times must be kept to a minimum or users will get tired of waiting and abandon the task at hand. To improve performance, the data used by an OLTP system should be highly normalized and the transaction distributed across multiple machines or networks (if anticipated traffic requires additional processing power and memory). Traditionally, OLTP updates have followed the SCD1 pattern and history is erased when an update is made. In the next section, we learn how to preserve historical data and use SCD2 in an OLTP environment. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 5/7 Business Case: Using SCD and CDC in an OLTP Environment Our client, a newly formed company, required infrastructure setup that would support multisystem integrations between its customer and partner systems. They were using a Microsoft stack, so SQL Server 2012 was the database of choice while SQL Server Integration Services (SSIS), Analysis Services (SSAS) and Reporting Services (SSRS) were chosen as supporting applications. Code was written in TSQL and C# and managed using Team Foundation Server (TFS). The Process A process was designed that would reduce impacts on performance while making historical and current data easily available to customers and partners (See Figure 7). First, the customer sends three files to our client via a secure inbox containing deletes, updates and inserts to their database. That data was then imported to an operational data store (ODS). If the customer had not yet configured partner system credentials and integration parameters, they could log in to a customer portal to do so. Figure 7: SCD1 & SCD2 Mix in OLTP Environment After data is imported to the ODS, the unmodified data from both the partner systems and ODS is loaded into a prestaging environment. The data is then enriched with SCD2 metadata elements including valid to and from dates and a current record flag. The enriched data is imported into a persistent storage staging environment. Any changes to the data are then made in the core database using SCD1 in SSIS. When changes are completed in the core database, CDC is used to detect those changes, which are then sent to the other connected databases, excluding the database the change originated from. No deletes are made in the ODS. Rather, the record is marked as inactive. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 6/7 Advantages The biggest benefit of using this mixed approach is preservation of historical data without rapidly expanding the size of the production database. As a result, we can comply with the law and keep the production database stable and responsive. Because history is intact, analysis can be performed on the staging database – in fact, an analysis component is planned for the project discussed above. With analysis computations taking place on the staging database, we are able to preserve resources on the production database for OLTP operations. If any analysis is performed on the production database, queries will perform better since historical records do not need to be filtered from current records. Additionally, users of the data are more clear on the data they are querying and do not need to decipher what metadata columns such as current flag and valid dates are used for and how they might impact their query. Last, downstream systems and the DWH will integrate changes more smoothly since each record will have its own primary key that will not change. Database triggers will also perform reliably. Disadvantages Compared to most patterns, there are few disadvantages to this mixed approach. The main concern is that multiple steps in the process will introduce more potential points of failure, as with any process. Because there are more steps, troubleshooting failures and errors will be more timeconsuming than a single pattern approach. If any analysis computations are performed on production data, they will need to be rebuilt any time there is an update. However, using the staging environment for any complex analytical functions will negate this issue. Conclusion In an OLTP environment, reliability and speed are paramount. Combining the SCD1 and SCD2 approaches allows us to benefit from the advantages of both patterns without suffering many of the disadvantages. When this mixed approach is implemented, it can be an ideal solution for the legal, IT, marketing and analytics departments without sacrificing customer and user workflows. D|ONE Insight | Data modelling patterns used for integration of operational data stores March 2015 | Catalin Ehrmann | D|ONE | www.d1solutions.com 7/7