Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UNIT II DATA WAREHOUSING Data ware house – characteristics and view - OLTP and OLAP - Design and development of data warehouse, Meta data models, Extract/ Transform / Load (ETL) design Data Warehousing – Overview The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions. A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining. Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now become an important platform for data analysis and online analytical processing. Understanding a Data Warehouse A data warehouse is a database, which is kept separate from the organization's operational database. There is no frequent updating done in a data warehouse. It possesses consolidated historical data, which helps the organization to analyze its business. A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. Data warehouse systems help in the integration of diversity of application systems. A data warehouse system helps in consolidated historical data analysis. Why a Data Warehouse is Separated from Operational Databases A data warehouses is kept separate from operational databases due to the following reasons: An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data. Data Warehouse Features The key features of a data warehouse are discussed below: Subject Oriented - A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making. Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse. Note: A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database. Data Warehouse Applications As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a planexecute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields: Financial services Banking services Consumer goods Retail sectors Controlled manufacturing Types of Data Warehouse Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below: Information Processing - A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. Analytical Processing - A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting. Data Mining - Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools. Sr.No. Data Warehouse (OLAP) Operational Database(OLTP) 1 It involves historical processing of information. It involves day-to-day processing. 2 OLAP systems are used by knowledge workers such as executives, managers, and analysts. OLTP systems are used by clerks, DBAs, or database professionals. 3 It is used to analyze the business. It is used to run the business. 4 It focuses on Information out. It focuses on Data in. 5 It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. It is based on Entity Relationship Model. 6 It focuses on Information out. It is application oriented. 7 It contains historical data. It contains current data. 8 It provides summarized and consolidated data. It provides primitive and highly detailed data. 9 It provides summarized and multidimensional view of data. It provides detailed and flat relational view of data. 10 The number of users is in hundreds. The number of users is in thousands. 11 The number of records accessed is in millions. The number of records accessed is in tens. 12 The database size is from 100GB to 100 The database size is from 100 MB to 100 TB. GB. 13 These are highly flexible. It provides high performance. Using Data Warehouse Information There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains: Tuning Production Strategies - The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly. Customer Analysis - Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc. Operations Analysis - Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations. Integrating Heterogeneous Databases To integrate heterogeneous databases, we have two approaches: Query-driven Approach Update-driven Approach Query-Driven Approach This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators. Process of Query-Driven Approach When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved. Now these queries are mapped and sent to the local query processor. The results from heterogeneous sites are integrated into a global answer set. Disadvantages Query-driven approach needs complex integration and filtering processes. This approach is very inefficient. It is very expensive for frequent queries. This approach is also very expensive for queries that require aggregations. Update-Driven Approach This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In updatedriven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis. Advantages This approach has the following advantages: This approach provide high performance. The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance. Query processing does not require an interface to process data at local sources. Functions of Data Warehouse Tools and Utilities The following are the functions of data warehouse tools and utilities: Data Extraction - Involves gathering data from multiple heterogeneous sources. Data Cleaning - Involves finding and correcting the errors in data. Data Transformation - Involves converting the data from legacy format to warehouse format. Data Loading - Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions. Refreshing - Involves updating from data sources to warehouse. Note: Data cleaning and data transformation are important steps in improving the quality of data and data mining results. Data Warehousing - Terminologies Metadata Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse, we can define metadata as following: Metadata is a road-map to data warehouse. Metadata in data warehouse defines the warehouse objects. Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse. Metadata Repository Metadata repository is an integral part of a data warehouse system. It contains the following metadata: Business metadata - It contains the data ownership information, business definition, and changing policies. Operational metadata - It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means history of data migrated and transformation applied on it. Data for mapping from operational environment to data warehouse - It metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules. The algorithms for summarization - It includes dimension algorithms, data on granularity, aggregation, summarizing, etc. Data Cube A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records. Illustration of Data Cube Suppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, "item" dimension table may have attributes such as item_name, item_type, and item_brand. The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions. But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below: The above 3-D table can be represented as 3-D data cube as shown in the following figure: Data Mart Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Points to Remember About Data Marts Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers. The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks rather than months or years. The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide. Data marts are small in size. Data marts are customized by department. The source of a data mart is departmentally structured data warehouse. Data marts are flexible. The following figure shows a graphical representation of data marts. Virtual Warehouse The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. OLTP vs OLAP One of the most important questions regarding information systems is the difference between OLAP and OLTP. Based on that, we built this article to explain further on these ideas and to solidify your knowledge of them. To fully understand and compare these two types of systems you have to know what they are, and how they work individually. So, first we prepared a lot of information about OLAP and O OLTP vs OLAP One of the most important questions regarding information systems is the difference between OLAP and OLTP. Based on that, we built this article to explain further on these ideas and to solidify your knowledge of them. To fully understand and compare these two types of systems you have to know what they are, and how they work individually. So, first we prepared a lot of information about OLAP and OLTP, concluding the resource with acomparative analysis between them. Let’s jump right away to the learning process! In the next chapters, we’ll be describing each topic in a complete, yet simple way. Before going any further on these topics, we show you a simple infographic comparing the two approaches: What is OLAP? Online analytical processing is a computer technology term referring to systems focused on analysing data in a specific database. This kind of systems are characterized for their analytical capabilities, addressing multi-dimensional or one dimension data, processing all the information. The standard applications of OLAP are bussiness intelligence, data writing and reporting, throught data mining processes. OLAP operations and databases On the database level, these systems operation is defined by a low level of transactions, dealing with archived and historical information. This data is seldom updated, identifying theSELECT database operation as the key feature of the system. Therefore, this kind of databases are based on READ operations, aggregating all available information. Databases that work as data warehouses apply this methodology, optimizing reading and aggregation operations of its multidimensional data model. Thus providing a great support for data analysis and reporting operations, critical in these kind of databases. Data cube The main component of these systems is a OLAP cube. A cube consists in combining data warehouse’s structures like facts and dimensions. Those are organized as schemas: star schema, snowflake schema and fact constellation. The merging of all the cubes creates a multidimensional data warehouse. System types There are many types of OLAP systems, depending on it’s structure characteristics. The most common ones are: MOLAP, ROLAP and HOLAP. The most important real world applications of these systems are: bussiness management and reporting, financial reporting, marketing, research and another data related issues. These processes are growing faster on these days, making them absolutely critical in a world that is becoming dependent of data. In the next paragraph we will provide a real world example of what we described before. Real World Example: In a hospital there is 20 years of very complete patient information stored. Someone on the administration wants a detailed report of the most common deseases, sucess rate of treatment, intership days and a lot of relevant data. For this, we apply OLAP operations to our data warehouse with historical information, and throught complex queries we get these results. Then they can be reported to the administration for further analysis. What is OLTP? Online Transaction Processing is a information system type that prioritizes transaction processing, dealing with operational data. This kind of computer systems are identified by the large number of transactions they support, making them the best to address online application. The main applications of this method are all kind of transactional systems like databases, commercial, hospital applications and so on. In a simple way, these systems gather input information and store them on a database, in a large scale. Most of today’s applications are based on this interaction methodology, with implementations of centralized or descentralized systems. OLTP database and operations On the database level, these transactional systems base their operation on multi-access, fast and effective querys to the database. The most used operations are INSERT, UPDATE and DELETE, since they are directly modifying the data, providing new information on new trasactions. So, in these systems, data is frequently updated, requiring a effective write operations support. One special characteristic of those databases is the normalization of it’s data. This happens because data normalization provides a faster and more effective way to perform database writes. The main concern is the atomicity of the trasanctions and ensuring that concurrent accesses don’t damage data and also don’t degradate system’s performance. Other systems OLTP is not only about databases, but also other types of interaction mecanisms. All clientserver architectures are based on these processes, taking benefit of the fast transaction and concurrent models. Descentralized systems are also online transaction processing, as all broker programs and web servervices are transaction oriented. Real World Example: A banking transaction system is a classic example. There are many users executing operations into their accounts and the system must guarantee the completeness of the actions. In this case there are several concurrent transactions at the same time, being data coherence and efficient operations the main goal. Comparing OLTP vs OLAP OLTP, also known as Online Transaction Processing, and OLAP which stands for Online Analytical Processing, are two distinct kinds of information systems technologies. Both are related to information databases, which provide the means and support for these two types of functioning. Each one of the methods creates a different branch on data management system, with it’s own ideas and processes, but they complement themselves. To analyse and compare them we’ve built this resource! Basically, OLAP and OLTP are very different approaches to the use of databases, but not only. In one hand online analytical processing is more focused on data analysis and reporting, on the other hand online trasaction processing target a transaction optimized system, with a lot of data changes. For someone learning about data sciences, related to IT methods, it is important to know the difference between these two approaches to information. This is the base idea to systems like business intelligence, data mining, data warehousing, data modelling, etl processes and big data. Regarding the previous descriptions of the systems, we can compare them in a lot of distinct categories. The review is detailed in the next table. Then we have a further discussion on each compared item which could evoke some doubts, to ensure you understood. LTP, concluding the resource with acomparative analysis between them. Let’s jump right away to the learning process! In the next chapters, we’ll be describing each topic in a complete, yet simple way. Before going any further on these topics, we show you a simple infographic comparing the two approaches: What is OLAP? Online analytical processing is a computer technology term referring to systems focused on analysing data in a specific database. This kind of systems are characterized for their analytical capabilities, addressing multi-dimensional or one dimension data, processing all the information. The standard applications of OLAP are bussiness intelligence, data writing and reporting, throught data mining processes. OLAP operations and databases On the database level, these systems operation is defined by a low level of transactions, dealing with archived and historical information. This data is seldom updated, identifying theSELECT database operation as the key feature of the system. Therefore, this kind of databases are based on READ operations, aggregating all available information. Databases that work as data warehouses apply this methodology, optimizing reading and aggregation operations of its multidimensional data model. Thus providing a great support for data analysis and reporting operations, critical in these kind of databases. Data cube The main component of these systems is a OLAP cube. A cube consists in combining data warehouse’s structures like facts and dimensions. Those are organized as schemas: star schema, snowflake schema and fact constellation. The merging of all the cubes creates a multidimensional data warehouse. System types There are many types of OLAP systems, depending on it’s structure characteristics. The most common ones are: MOLAP, ROLAP and HOLAP. The most important real world applications of these systems are: bussiness management and reporting, financial reporting, marketing, research and another data related issues. These processes are growing faster on these days, making them absolutely critical in a world that is becoming dependent of data. In the next paragraph we will provide a real world example of what we described before. Real World Example: In a hospital there is 20 years of very complete patient information stored. Someone on the administration wants a detailed report of the most common deseases, sucess rate of treatment, intership days and a lot of relevant data. For this, we apply OLAP operations to our data warehouse with historical information, and throught complex queries we get these results. Then they can be reported to the administration for further analysis. What is OLTP? Online Transaction Processing is a information system type that prioritizes transaction processing, dealing with operational data. This kind of computer systems are identified by the large number of transactions they support, making them the best to address online application. The main applications of this method are all kind of transactional systems like databases, commercial, hospital applications and so on. In a simple way, these systems gather input information and store them on a database, in a large scale. Most of today’s applications are based on this interaction methodology, with implementations of centralized or descentralized systems. OLTP database and operations On the database level, these transactional systems base their operation on multi-access, fast and effective querys to the database. The most used operations are INSERT, UPDATE and DELETE, since they are directly modifying the data, providing new information on new trasactions. So, in these systems, data is frequently updated, requiring a effective write operations support. One special characteristic of those databases is the normalization of it’s data. This happens because data normalization provides a faster and more effective way to perform database writes. The main concern is the atomicity of the trasanctions and ensuring that concurrent accesses don’t damage data and also don’t degradate system’s performance. Other systems OLTP is not only about databases, but also other types of interaction mecanisms. All clientserver architectures are based on these processes, taking benefit of the fast transaction and concurrent models. Descentralized systems are also online transaction processing, as all broker programs and web servervices are transaction oriented. Real World Example: A banking transaction system is a classic example. There are many users executing operations into their accounts and the system must guarantee the completeness of the actions. In this case there are several concurrent transactions at the same time, being data coherence and efficient operations the main goal. Comparing OLTP vs OLAP OLTP, also known as Online Transaction Processing, and OLAP which stands for Online Analytical Processing, are two distinct kinds of information systems technologies. Both are related to information databases, which provide the means and support for these two types of functioning. Each one of the methods creates a different branch on data management system, with it’s own ideas and processes, but they complement themselves. To analyse and compare them we’ve built this resource! Basically, OLAP and OLTP are very different approaches to the use of databases, but not only. In one hand online analytical processing is more focused on data analysis and reporting, on the other hand online trasaction processing target a transaction optimized system, with a lot of data changes. For someone learning about data sciences, related to IT methods, it is important to know the difference between these two approaches to information. This is the base idea to systems like business intelligence, data mining, data warehousing, data modelling, etl processes and big data. Regarding the previous descriptions of the systems, we can compare them in a lot of distinct categories. The review is detailed in the next table. Then we have a further discussion on each compared item which could evoke some doubts, to ensure you understood. Design methods Bottom-up design In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes. These data marts can then be integrated to create a comprehensive data warehouse. The data warehouse bus architecture is primarily an implementation of "the bus", a collection ofconformed dimensions and conformed facts, which are dimensions that are shared (in a specific way) between facts in two or more data marts.[15] Top-down design The top-down approach is designed using a normalized enterprise data model. "Atomic" data, that is, data at the greatest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse.[16] Hybrid design Data warehouses (DW) often resemble the hub and spokes architecture. Legacy systems feeding the warehouse often include customer relationship managementand enterprise resource planning, generating large amounts of data. To consolidate these various data models, and facilitate the extract transform load process, data warehouses often make use of an operational data store, the information from which is parsed into the actual DW. To reduce data redundancy, larger systems often store the data in a normalized way. Data marts for specific reports can then be built on top of the DW. The DW database in a hybrid solution is kept on third normal form to eliminate data redundancy. A normal relational database, however, is not efficient for business intelligence reports where dimensional modelling is prevalent. Small data marts can shop for data from the consolidated warehouse and use the filtered, specific data for the fact tables and dimensions required. The DW provides a single source of information from which the data marts can read, providing a wide range of business information. The hybrid architecture allows a DW to be replaced with a master data management solution where operational, not static information could reside. The Data Vault Modeling components follow hub and spokes architecture. This modeling style is a hybrid design, consisting of the best practices from both third normal form and star schema. The Data Vault model is not a true third normal form, and breaks some of its rules, but it is a top-down architecture with a bottom up design. The Data Vault model is geared to be strictly a data warehouse. It is not geared to be end-user accessible, which when built, still requires the use of a data mart or star schema based release area for business purposes. What Is A Metadata Model? This article is about creating metadata models for digital asset management. In simple terms, a metadata model is how you will represent the metadata stored about your digital assets. It is like the blueprint or DNA that will be used each time a DAM user catalogues an asset. Why Do You Need A Metadata Model? Metadata models define the essential characteristics of your assets in a way that is unique to you and your organisation. They describe a series of key entities or classifications. As well as cataloguing, metadata models can get populated by other activity on a DAM system, for example, workflow to request approval to use an asset. Any activity on a DAM system where users or processes interact with assets takes place within the framework of the metadata model. You will find it touches nearly every element of a DAM implementation – which is why it is important you give it sufficient consideration when planning for digital asset management. What Goes Into A Metadata Model? There are many different ways to describe metadata models, but providing all the required information is captured, the simpler they are the better. A list of the key items of data you need to store such as the one shown in the previous section is the starting point, but you will probably want to expand that to define how users will enter metadata. For example, will it be from a fixed list (e.g. a controlled vocabulary) or perhaps free text, maybe numbers or dates. If allowing users to choose from pre-determined selections, will you allow them one option or many? You can record these decisions in a spreadsheet or build simple prototypes using the built-in capabilities of the system. Screen mock-ups of what the interface will look like are another technique. An issue you can get into when discussing metadata models with colleagues is they may tend to concentrate too much on the content or ‘ingredients’ that might go into the fields. For example, if your DAM system will hold marketing materials about your firm’s products, they might reel off lists of product brand names or model numbers. These are important and keeping records of them is a good idea, but when devising metadata models, you are more interested in the range of potential classifications – the breadth rather than the depth if you want to think about it in spatial terms. Information architects and other DAM experts might refer to this as the ‘schema’ and that description should give you a clue that this about overall design decisions and metadata strategy rather than specific values. One area where analyzing the range of data that might need to be held in a metadata model is important is in assessing the quantity of different values that might need to be held in a given field. This will assist to determine what kind of interface controls are best suited for it. For example, if every entry is totally different, a free text field would be a good idea. For a small number of mutually exclusive options, radio buttons more suitable. On other occasions, you might use a hierarchical taxonomy which links to a faceted search. The number of items used can make some interface choices more or less appropriate than others. WHAT ARE METADATA? This Book Elements Objects: Objects: "Entity Objects: "Table" Objects: (Metametadata of "Entity Class" "Attribute" "Column" "Program ) metadata Class" "Role" module" (metadat "Attribute" "Language" a model) Data Data Entity Entity class: Table: Program Management about a class: "Branch" "CHECKING_ module: "Employee" ACCOUNT" ATM (Metadata) database "Customer (a data model) " Attributes: Columns: Controller Attributes: "Employee.Address "Account_number Language: "Name" " "Employee.Name" " Java "Birthdate" Role: "Each branch "Monthly_charge" must be managed by exactly one Employee" IT Operations Data Customer Branch Address: CHECKING_ ATM (Instance Data) about Name: "111 Wall Street" ACCOUNT. Controller: real- "Julia Branch Manager: Account_number: Java code world Roberts" "Sam Sneed" = "09743569" things (a Customer CHECKING_ database) Birthdate: ACCOUNT. "10/28/67 Montly_charge: " "$4.50" Realworl Julia d things Roberts Wall Street branch Checking account #09743569 ATM Withdrawa l Extract/ Transform / Load (ETL) design Data extraction – extracts data from homogeneous or heterogeneous data sources Data transformation – transforms the data for storing it in the proper format or structure for the purposes of querying and analysis Data loading – loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse) Since the data extraction takes time, it is common to execute the three phases in parallel. While the data is being extracted, another transformation process executes. It processes the already received data and prepares it for loading. As soon as there is some data ready to be loaded into the target, the data loading kicks off without waiting for the completion of the previous phases. ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example, a cost accounting system may combine data from payroll, sales, and purchasing. Extract The first part of an ETL process involves extracting the data from the source system(s). In many cases this represents the most important aspect of ETL, since extracting data correctly sets the stage for the success of subsequent processes. ETL Architecture Pattern Most data-warehousing projects combine data from different source systems. Each separate system may also use a different data organization and/or format. Common data-source formats include relational databases, XML and flat files, but may also include non-relational database structures such as Information Management System (IMS) or as Virtual other data Storage (VSAM) orIndexed structures Access Sequential such Method Access Method (ISAM), or even formats fetched from outside sources by means such asweb spidering or screen-scraping. The streaming of the extracted data source and loading on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the extraction phase aims to convert the data into a single format appropriate for transformation processing. An intrinsic part of the extraction involves data validation to confirm whether the data pulled from the sources has the correct/expected values in a given domain (such as a pattern/default or list of values). If the data fails the validation rules it is rejected entirely or in part. The rejected data is ideally reported back to the source system for further analysis to identify and to rectify the incorrect records. In some cases the extraction process itself may have to do a data-validation rule in order to accept the data and flow to the next phase. Transform In the data transformation stage, a series of rules or functions are applied to the extracted data in order to prepare it for loading into the end target. Some data does not require any transformation at all; such data is known as "direct move" or "pass through" data. An important function of transformation is the cleaning of data, which aims to pass only "proper" data to the target. The challenge when different systems interact is in the relevant systems' interfacing and communicating. Character sets that may be available in one system may not be so in others. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the server or data warehouse: Selecting only certain columns to load: (or selecting null columns not to load). For example, if the source data has three columns (aka "attributes"), roll_no, age, and salary, then the selection may take only roll_no and salary. Or, the selection mechanism may ignore all those records where salary is not present (salary = null). Translating coded values: (e.g., if the source system codes male as "1" and female as "2", but the warehouse codes male as "M" and female as "F") Encoding free-form values: (e.g., mapping "Male" to "M") Deriving a new calculated value: (e.g., sale_amount = qty * unit_price) Sorting or ordering the data based on a list of columns to improve search performance Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data Aggregating (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.) Generating surrogate-key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns) Disaggregating repeating columns Looking up and validating the relevant data from tables or referential files Applying any form of data validation; failed validation may result in a full rejection of the data, partial rejection, or no rejection at all, and thus none, some, or all of the data is handed over to the next step depending on the rule design and exception handling; many of the above transformations may result in exceptions, e.g., when a code translation parses an unknown code in the extracted data Load The load phase loads the data into the end target that may be a simple delimited flat file or a data warehouse. Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals—for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the data warehouse. As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process. For example, a financial institution might have information on a customer in several departments and each department might have that customer's information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by number. ETL can bundle all of these data elements and consolidate them into a uniform presentation, such as for storing in a database or data warehouse. Another way that companies use ETL is to move information to another application permanently. For instance, the new application might use another database vendor and most likely a very different database schema. ETL can be used to transform the data into a format suitable for the new application to use. An example would be an Expense and Cost Recovery System (ECRS) such as used by accountancies, consultancies, and legal firms. The data usually ends up in the time and billing system, although some businesses may also utilize the raw data for employee productivity reports to Human Resources (personnel dept.) or equipment usage reports to Facilities Management. Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: 1. Cycle initiation 2. 3. 4. 5. 6. 7. 8. 9. Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive