Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANJUMAN-I-ISLAM‘S ALLANA INSTITUTE OF MANAGEMENT STUDIES Management Information System Data Mining and Data Warehousing Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image. Data warehousing A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. 1 Data Warehousing & Data Mining Submitted by: Name Roll No Mohsin Sayed Shafat Ali Arshad Shaikh Maqsud Shaikh Saif Shaikh 43 44 46 47 48 Academic Year: 2012-2013 Under the guidance of Prof. Awesh Bhornya Date of Submission: 16th February ’2013 Anjuman-I-Islam’s Allana Institute of Management Studies BadruddinTyabjiMarg, Off. 92, Dr.D.N.Road, Opp. CST., Mumbai – 400 001‘ 2 CERTIFICATE This is to certify that Students from ‗A‘ division of Anjuman-I-Islam‘s Allana Institute of Management Studies (AIAIMS) pursuing first year in MMS has completed the dissertation project on “Data Warehousing and Data Mining” in the Academic Year 2013-2014. Date: ______________ Place: ______________ Dr.Lukman Patel Prof.Awesh Bhornya Director – AIAIMS Project Guide 3 ACKNOWLEDGEMENT A project cannot be said to be the work of an individual. A project is a combination of views and ideas, suggestions and contributions of many people. We are extremely thankful to our project guide Prof. Awesh Bhornya for giving us the valuable guidance and helping me throughout this project and for his special attention on me. We wish to thank to all the people who had help and assisted us wherever and whenever we needed their help by giving their precious time and valuable suggestion to us. Also I wish to thank all the respondents who gave me some of their valuable time to fill up the questionnaires, without which the project study wouldn‘t have been a success 4 Index Topic Page No. Data Warehousing History What is Data Warehousing Subject Oriented Integrated Non-volatile Time Variant 06 06 06 08 08 09 09 Benefits of a Data warehouse Key developments in early years of data warehousing Dimensional V/S Normalized Data warehouses V/S operational systems Operational Systems V/S Data Warehousing Systems Evolution in organization use Data Warehouse Architecture Data Warehouse Architecture components Types of Data Warehouse Architectures 09 11 12 14 15 15 17 17 21 Data Mining Overview The Foundations of Data Mining The Scope of Data Mining Database can be larger in depth and breadth How data mining works Architecture of Data Mining Components of Data Mining Integration of data mining system with a database Or Data Warehouse system Conclusion Bibliography 5 30 30 33 34 38 40 41 42 43 Data Warehousing History The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users. What is Data warehouse? In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before they are used in the DW for reporting. The typical ETL-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data 6 extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data. A data warehouse constructed from an integrated data source system does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems. Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Data warehousing a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. TheData Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful data warehouse project. A lot of the 7 information is from my personal experience as business intelligence professional, both as a client and as a vendor. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Integrated Non-volatile Time Variant Subject Oriented:Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented. Integrated:Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. 8 Nonvolatile:Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. Benefits of a Data warehouse: A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to: Maintain data history, even if the source transaction systems do not. Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger. Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data. Present the organization's information consistently. Provide a single common data model for all data of interest regardless of the data's source. Restructure the data so that it makes sense to the business users. Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems. 9 Add value to operational business applications, notably customer relationship management (CRM) systems. Generic data warehouse environment: The environment for data warehouses and marts includes the following: Source systems that provide data to the warehouse or mart; Data integration technology and processes that are needed to prepare the data for use; Different architectures for storing data in an organization's data warehouse or data marts; Different tools and applications for the variety of users; Metadata, data quality, and governance processes must be in place to ensure that the warehouse or mart meets its purposes. In regards to source systems listed above, Rainer states, ―A common source for the data in data warehouses is the company‘s operational databases, which can be relational databases‖. Regarding data integration, Rainer states, ―It is necessary to extract data from source systems, transform them, and load them into a data mart or warehouse‖. Rainer discusses storing data in an organization‘s data warehouse or data marts. ―There are a variety of possible architectures to store decision-support data‖.Metadata are data about data. ―IT personnel need information about data sources; database, table, and column names; refresh schedules; and data usage measures. Today, the most successful companies are those that can respond quickly and flexibly to market changes and opportunities. A key to this response is the effective and efficient use of data and information by analysts and managers. A ―data warehouse‖ is a repository of historical data that are organized by subject to support decision makers in the organization. Once data are stored in a data mart or warehouse, they can be accessed. 10 Key developments in early years of data warehousing were: 1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts. 1970s — ACNielsen and IRI provide dimensional data marts for retail sales. 1970s — Bill Inmon begins to define and discuss the term: Data Warehouse 1975 — Sperry Univac Introduce MAPPER (Maintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. It was the first platform specifically designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms) 1983 — Teradata introduces a database management system specifically designed for decision support. 1983 — Sperry Corporation Martyn Richard Jones defines the Sperry Information Center approach, which while not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales 1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system. 1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information system in IBM Systems Journal where they introduce the term "business data warehouse". 1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing. 1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse. 1992 — Bill Inmon publishes the book Building the Data Warehouse. 1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded. 1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit. 2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse. 11 Dimensional V/S Normalized Approach for Storage of Data There are two leading approaches to storing data in a data warehouse — the dimensional approach and the normalized approach. The dimensional approach, whose supporters are referred to as ―Kimballites‖, believe in Ralph Kimball‘s approach in which it is stated that the data warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also called the 3NF model, whose supporters are referred to as ―Inmonites‖, believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model. In a dimensional approach, transaction data are partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. Dimensional structures are easy to understand for business users, because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization‘s business processes and operational system whereas the dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008). The main disadvantages of the dimensional approach are: 1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The 12 normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to: 1. Join data from different sources into meaningful information and then 2. Access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. It should be noted that both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The difference between the two models is the degree of normalization. These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008). In Information-Driven Business (Wiley 2010), Robert Hillard proposes an approach to comparing the two approaches based on the information needs of the business problem. The technique shows that normalized models hold far more information than their dimensional equivalents (even when the same fields are used in both models) but this extra information comes at the cost of usability. The technique measures information quantity in terms of Information Entropy and usability in terms of the Small Worlds data transformation measure. Advantages of Data Warehousing Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers 13 Data warehouses versus operational systems The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems major reason for a separation between database and dataware house is to help promote the high performance of both systems Comparison between OLTP and OLAP Feature OLTP OLAP Characteristic Operational processing Informational processing Orientation transaction Analysis Function day-to-day operation long term informational requirements Design Application oriented subject oriented Access Read and write Mostly read Data accessed tens millions View Detailed Summarized Priority High performance and availability High flexibility User clerk, DBA Knowledge worker Size 100MB to GB 100GB to TB 14 Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entityrelationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems. Operational Systems v/s Data Warehousing Systems Operational Holds current data Data is dynamic Read/Write accesses Repetitive processing Transaction driven Application oriented Used by clerical staff for day-today operations Normalized data model (ER model) Must be optimized for writes and small queries. Data Warehouse Holds historic data Data is largely static Read only accesses Adhoc complex queries Analysis driven Subject oriented Used by top managers for analysis Denormalized data model (Dimensional model) Must be optimized for queries involving a large portion of the warehouse. Evolution in organization use These terms refer to the level of sophistication of a data warehouse: Offline operational data warehouse: Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data 15 Offline data warehouse: Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting. On time data warehouse: Online Integrated Data Warehousing represent the real time Data warehouses stage data in the warehouse is updated for every transaction performed on the source data Integrated data warehouse: These data warehouses assemble data from different areas of business, so users can look up the information they need across other systems. Sample applications Some of the applications of data warehousing include: Agriculture Biological data analysis Call record analysis Churn Prediction for Telecom subscribers, Credit Card users etc. Decision support Financial forecasting Insurance fraud analysis Logistics and Inventory management Trend analysis Problems with Data Warehousing Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands High maintenance Long duration projects Complexity of integration 16 Data Warehouse Architecture A typical data warehousing architecture is illustrated below: Data Warehouse Components & Architecture The data in a data warehouse comes from operational systems of the organization as well as from other external sources. These are collectively referred to as source systems. The data extracted from source systems is stored in a area called data staging area, where the data is cleaned, transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data staging area is generally a collection of machines where simple activities like sorting and sequential processing takes place. The data staging area does not provide any query or presentation services. As soon as a system provides query or presentation services, it is categorized as a presentationserver. A presentation server is the target machine on which the data is loaded from the data staging area organized and stored for direct querying by end users, report writers and other applications. The three different kinds of systems that are required for a data warehouse are: 1. Source Systems 2. Data Staging Area 3. Presentation servers 17 The data travels from source systems to presentation servers via the data staging area. The entire process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and transfer). Oracle‘s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server‘s ETL tool is called Data Transformation Services (DTS). Each component and the tasks performed by them are explained below: OPERATIONAL DATA The sources of data for the data warehouse are supplied from: o The data from the mainframe systems in the traditional network and hierarchical format. o Data can also come from the relational DBMS like Oracle, Informix. o In addition to these internal data, operational data also includes external data obtained from commercial databases and databases associated with supplier and customers. LOAD MANAGER The load manager performs all the operations associated with extraction and loading data into the data warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs. WAREHOUSE MANAGER The warehouse manager performs all the operations associated with the management of data in the warehouse. This component is built using vendor data management tools and custom built programs. The operations performed by warehouse manager include: o Analysis of data to ensure consistency o Transformation and merging the source data from temporary storage into data warehouse tables 18 o Create indexes and views on the base table. o Denormalization o Generation of aggregation o Backing up and archiving of data In certain situations, the warehouse manager also generates query profiles to determine which indexes and aggregations are appropriate. QUERY MANAGER The query manager performs all operations associated with management of user queries. This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs. The complexity of a query manager is determined by facilities provided by the end-user access tools and database. DETAILED DATA This area of the warehouse stores all the detailed data in the database schema. In most cases detailed data is not stored online but aggregated to the next level of details. However the detailed data is added regularly to the warehouse to supplement the aggregated data. LIGHTLY AND HIGHLY SUMMERIZED DATA The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles. The purpose of the summarized information is to speed up the query performance. The summarized data is updated continuously as new data is loaded into the warehouse. ARCHIVE AND BACK UP DATA This area of the warehouse stores detailed and summarized data for the purpose of archiving and back up. The data is transferred to storage archives such as magnetic tapes or optical disks. 19 META DATA The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse. It is used for variety of purposed including: o The extraction and loading process – Meta data is used to map data sources to a common view of information within the warehouse. o The warehouse management process – Meta data is used to automate the production of summary tables. o As part of Query Management process Meta data is used to direct a query to the most appropriate data source. The structure of Meta data will differ in each process, because the purpose is different. More about Meta data will be discussed in the later Lecture Notes. END-USER ACCESS TOOLS The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making. These users interact with the warehouse using end user access tools. The examples of some of the end user access tools can be: o Reporting and Query Tools o Application Development Tools o Executive Information Systems Tools o Online Analytical Processing Tools o Data Mining Tools THE ETL (EXTRACT TRANSFORMATION LOAD) PROCESS In this section we will discussed about the 4 major process of the data warehouse. They are extract (data from the operational systems and bring it to the data warehouse),transform(the data into internal format and structure of the data warehouse),cleanse (to make sure it is of sufficient quality to be used for decision making) and load (cleanse data is put into the data warehouse). The four processes from extraction through loading often referred collectively as Data Staging. 20 EXTRACT Some of the data elements in the operational database can be reasonably be expected to be useful in the decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. Many commercial tools are available to help with the extraction process. Data Junction is one of the commercial products. The user of one of these tools typically has an easy-to-use windowed interface by which to specify the following: o Which files and tables are to be accessed in the source database? o Which fields are to be extracted from them? This is often done internally by SQL Select statement. o What are those to be called in the resulting database? o What is the target machine and database format of the output? o On what schedule should the extraction process be repeated? TRANSFORM: The operational databases developed can be based on any set of priorities, which keeps changing with the requirements. Therefore those who develop data warehouse based on these databases are typically faced with inconsistency among their data sources. Transformation process deals with rectifying any inconsistency (if any). One of the most common transformation issues is ‗Attribute Naming Inconsistency‘. It is common for the given data element to be referred to by different data names in different databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse. Once all the data elements have right names, they must be converted to common formats. The conversion may encompass the following: Characters must be converted ASCII to EBCDIC or vise versa. Mixed Text may be converted to all uppercase for consistency. Numerical data must be converted in to a common format. Data Format has to be standardized. 21 Measurement may have to convert. Coded data (Male/ Female, M/F) must be converted into a common format. All these transformation activities are automated and many commercial products are available to perform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool. CLEANSING Information quality is the key consideration in determining the value of the information. The developer of the data warehouse is not usually in a position to change the quality of its underlying historic data, though a data warehousing project can put spotlight on the data quality issues and lead to improvements for the future. It is, therefore, usually necessary to go through the data entered into the data warehouse and make it as error free as possible. This process is known as Data Cleansing. Data Cleansing must deal with many types of possible errors. These include missing data and incorrect data at one source; inconsistent data and conflicting data when two or more source are involved. There are several algorithms followed to clean the data, which will be discussed in the coming lecture notes. LOADING Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse. 22 Types of Data Warehouse Architectures Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts) Data Warehouse Architecture (Basic) Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse. Figure 1-2 Architecture of a Data Warehouse Text description of the illustration In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view. 23 Data Warehouse Architecture (with a Staging Area) In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture. Figure 1-3 Architecture of a Data Warehouse with a Staging Area Text description of the illustration Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding datamarts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. 24 Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts Data Warehousing Systems A data warehousing system can perform advanced analyses of operational data without impacting operational systems. OLTP is very fast and efficient at recording the business transactions - not so good at providing answers to high-level strategic questions. Component Systems Legacy Systems Any information system currently in use that was built using previous technology generations. Most legacy Systems are operational in nature, largely because the automation of transaction-oriented business process had long been the priority of IT projects. Source Systems Any system from which data is taken for a data warehouse. A source system is often called a legacy system in a mainframe environment. Operational Data Stores (ODS) An ODS is a collection of integrated databases designed to support the monitoring of operations. Unlike the databases of OLTP applications (that are function oriented), the ODS 25 contains subject oriented, volatile, and current enterprise-wide detailed information. It serves as a system of record that provides comprehensive views of data in operational sources. Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is always current and is constantly updated. The ODS is an ideal data source for a data warehouse, since it already contains integrated operational data as of a given point in time. In short, ODS is an integrated collection of clean data destined for the data warehouse. Data Warehouse Design An introduction to Dimensional Modeling Data Warehouses are not easy to build. Their design requires a way of thinking that is just opposite to manner in which traditional computer systems are developed. Their construction requires radical restructuring of vast amounts of data, often of dubious or inconsistent quality, drawn from numerous heterogeneous sources. Their implementation strains the limits of today‘s IT. Not surprisingly, a large number of data warehouse projects fail. Successful data warehouses are built for just one reason: to answer business questions. The type of questions to be addressed will vary, but the intention is always the same. Projects that deliver new and relevant information succeed. Projects that do no, fail. To deliver answers to businesspeople, one must understand their questions. The DW design fuses business knowledge and technology know-how. The design of the data warehouse will mean the difference between success and failure. The design of the data warehouse requires a deep understanding of the business. Yet the task of design is undertaken by IT professionals, but not business decision makers. Is it reasonable to expect the project to succeed? The answer is yes. The key is learning to apply technology toward business objectives. Most computer systems are designed to capture data, data warehouses are designed to for getting data out. This fundamental difference suggests that the data warehouse should be designed according to a different set of principles. Dimensional Modeling is the name of a logical design technique often used for data warehouses. It is different from entity-relationship modeling. ER modeling is very useful for transaction capture in OLTP systems. 26 Dimensional Modeling is the only viable technique for delivering data to the end users in a data warehouse. Comparison between ER and Dimensional Modeling The characteristics of ER Model are well understood; its ability to support operational processes is its underlying characteristic. The conventional ER models are constituted to • Remove redundancy in the data model • Facilitate retrieval of individual records having certain critical identifiers and • Therefore, optimize online transaction processing (OLTP) performance In contrast, the dimensional model is designed to support the reporting and analytical needs of a data warehouse system. Why ER is not suitable for Data Warehouses? • End user cannot understand or remember an ER Model. End User cannot navigate an ER Model. There is no graphical user interface or GUI that takes a general ER diagram and makes it usable by end users. • ER modeling is not optimized for complex, ad-hoc queries. They are optimized for repetitive narrow queries • Use of ER modeling technique defeats this basic allure of data warehousing, namely intuitive and high performance retrieval of data because it leads to highly normalized relational tables. Introduction to Dimensional Modeling Concepts The objective of dimensional modeling is to represent a set of business measurements in a standard framework that is easily understandable by end users. A Dimensional model contains the same information as an ER model but packages the data in a symmetric format whose design goals are • User understandability 27 • Query Performance • Resilience to Change The main components of a Dimensional Model are Fact Tables and Dimension Tables. A fact table is the primary table in each dimensional model that is meant to contain measurements of the business. The most useful facts are numeric and additive. Every fact table represents a many to many relationship and every fact table contains a set of two or more foreign keys that join to their respective dimension tables. A fact depends on many factors. For example, sale amount, a fact, depends on product, location and time. These factors are known as dimensions. Dimensions are factors on which a given fact depends. The sale amount fact can also be thought of as a function of three variables. sales amount = f(product, location, time) Likewise in a sales fact table we may include other facts like sales unit and cost. Dimension tables are companion tables to a fact table in a star schema. Each dimension table is defined by it‘s primary key that serves as the basis for referential integrity with any given fact table to which it is joined. Most dimension tables contain textual information. To understand the concepts of facts, dimension, and star schema, let us consider the following scenario: Imagine standing in the marketplace and watching the products being sold and writing down the quantity sold and the sales amount each day for each product in each store. Note that a measurement needs to be taken at every intersection of all dimensions (day, product, and store). The information gathered can be stored in the following fact table: 28 The facts are Sale Unit, Sale Amount, and Cost (note that all are numeric and additive), which depend on dimensions Date, Product, and Store. The details of the dimensions are stored in dimension tables.. 29 Data Mining Overview Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today‘s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users. The Foundations of Data Mining Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive 30 information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Massive data collection Powerful multiprocessor computers Data mining algorithms Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user‘s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments. Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level (1960s) Data Access (1980s) 31 Data Warehousing & Decision Support "What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses Pilot, Comshare, Arbor, Cognos, Micro strategy Retrospective, dynamic data delivery at multiple levels "What‘s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry) Prospective, proactive information delivery (1990s) Data Mining (Emerging Today) Table 1. Steps in the Evolution of Data Mining. The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include 32 detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. Databases can be larger in both depth and breadth: More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselecting a subset of variables. More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population. A recent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five key technology areas that "will clearly have a major impact across a wide range of industries within the next 3 to 5 years."2 Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies will invest during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the after-market value of their vast stores of detail data, employing MPP [massively parallel processing] systems to create new sources of business advantage (0.9 probability)."3 The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. 33 Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbour technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms. How Data Mining Works: How exactly is data mining able to tell you important things that you didn't know or what is going to happen next? The technique that is used to perform these feats in data mining is called modelling. Modelling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that these ships often tend to be found off the coast of Bermuda and that there are certain characteristics to the ocean currents, and certain routes that have likely been taken by the ship‘s captains in that era. You note these similarities and build a model that includes the characteristics that are common to the locations of these sunken treasures. With these models in hand you sail off looking for treasure where your model indicates it most likely 34 might be given a similar situation in the past. Hopefully, if you've got a good model, you find your treasure. This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. For example, say that you are the director of marketing for a telecommunications company and you'd like to acquire some new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better than random - you could use your business experience stored in your database to build a model. As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is 35 that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse. General information (e.g. demographic Customers Prospects Known Known Known Target data) Proprietary information (e.g. customer transactions) Table 2 - Data Mining for Prospecting The goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. For instance, a simple model for a telecommunications company might be: 98% of my customers who make more than $60,000/year spend more than $80/month on long distance This model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently have access to. With this model in hand new customers can be selectively targeted. Test marketing is an excellent source of data for this kind of modelling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall market. Table 3 shows another common scenario for building models: predict what is going to happen in the future. 36 Yesterday Today Tomorrow Static information and current plans (e.g. demographic data, marketing plans) Known Known Known Dynamic information (e.g. customer transactions) Known Known Target Table 3 - Data Mining for Predictions If someone told you that he had a model that could predict customer usage how would you know if he really had a good model? The first thing you might try would be to ask him to apply his model to your customer base - where you already knew the answer. With data mining, the best way to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to confirm the model‘s validity. If the model works, its observations should hold for the vaulted data. 37 Architecture for Data Mining To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates architecture for advanced analysis in a large datawarehouse. Figure 1 - Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data 38 warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users‘ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users‘ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results 39 enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. Components of data mining 40 Integration of a Data Mining System with a Database or Data Warehouse System Data Base and Data Warehouse systems, possible integration schemes include No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. Semitight coupling: Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data mining functions) can be provided in the DB/DW system. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. Some issues we encounter in Data Mining Mining methodology and user interaction issues Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction: Incorporation of background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results Handling noisy or incomplete data Pattern evaluation—the interestingness problem The performance of data minig system is measured on the following issues: Efficiency and scalability of data mining algorithms Parallel, distributed, and incremental mining algorithms Issues relating to the diversity of database types Handling of relational and complex types of data 41 Conclusion: Data warehousing and Data Mining are two important components of business intelligence. Data warehousing is necessary to analyze (Analysis) the business needs, integrate (Integration) data from several sources, model (Data Modeling) the data in an appropriate manner to present the business information in the form of dashboards and reports (Reporting). 42 Bibliography: Google Wikipedia Slideshare.com Atuthorstream.com Yahoo.com Google images 43