Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data warehousing to Data Mining Abstract: This presentation explores the concepts and techniques of data warehousing and data mining, a promising and flourishing frontier in database systems and new database applications. Data mining also popularly referred to as knowledge discovery databases Introduction 1.2 Integrated: A data warehouse is usually constructed by integrating Multiple heterogeneous sources, such as relational databases, flat-files, and online transaction records .data cleaning and data integration technique are applied to ensure consistency in naming conventions, Encoding structures attribute measures. 1.3 Time-variant: Data are stored to Data warehouse provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. A large number of organizations have found that data warehouse system is valuable tools in today’s competitive, fast evolving world. In the last several years many firms have spent millions of dollars in building enterprise –wide data warehouses. “What exactly data warehouse” The definition presents four key words Subject-Oriented, Integrated, Timevarient, and Nonvolatile. 1.1 Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier, product and sales. Rather than concentrating on the day-to-day operations and transactions processing of an organization a data warehouse focuses on the modeling and analysis of data for decision makers. Hence data warehouse typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. provide information from a historical perspective. Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. 1.4 Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. A data warehouse is semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions a data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneous sources to support structured ad hoc queries, analytical reporting and decision making. 2.0 Differences between operational database systems and data warehouses The major task of online operational database systems is to perform online transaction processing and query processing. These systems are called online transaction processing systems. they cover most of the day-to –day operations of an organization ,such as purchasinhg,inventory,manufacturing,ba nking,payroll,registrationand accounting Data warehouse systems on the other hand serve users or knowledge workers in the role of data analysis and decision making such systems can organize present data in various formats in order to accommodated diverse needs of the different users. These systems are known as online analytical processing systems. The major distinguish features between OLTP and OLAP are summarized as follows. 2.1 Users and system orientation: An OLTP system is customer–oriented and is used for transaction and query processing by clerks, clients and information technology professionals. An olap system is market oriented and is used for data analysis by knowledge workers, including managers, executives and analysts. 2.2 Data contents: An OLTP system manages current data to be easily used for decision making. An oltp system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making. 2.3 Database Design: An OLTP system usually adopts an entity relationship data model and an application oriented database design. An OLAP system adopts either a Star or Snowflake model 2.4 View:-An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations an OLTP system often spans multiple versions of a database scheme, due to the evolutionary process of an organization. OLAP data are stored on multiple storage media. 2.5 Access patterns: The access patterns an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms 3.0 The process warehouse design: of data The warehouse design process consists of the following steps: 3.1. Choose a business process model: For example orders, invoices, shipments, inventory, account administration, sales, and the general ledger. If the business processes organized and involves multiple complex object collections, a data warehouse model should be followed however if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 3.2 Choose the grain of the business process. The grain is the fundamental, atomic level of data to be presented in the fact table process. For example Individual transactions, individual daily snapshots 3.3 Choose the dimensions that will apply to each fact tablerecord typically dimensions are time, item, customer, supplier, warehouse, transaction type and status 3.4 Choose the measures that will populate each fact table record .typical measures are numeric additive quantities like dollars-sold 4.0 Data warehouse models: enterprise warehouse, the data mart, and virtual warehouse: 4.1 enterprise ware house:-an enterprise warehouse collects all of the information about subject spanning the entire organization .it provides corporatewide data integration ,usually one or more operational systems or external information providers .it typically contains detailed data as well as summarized data and can range in size from a few gigabytes to hundred of gigabytes. 4.2 Data mart:-a data mart contains a subset of corporate wide data that is of value to a specific group of users, the scope is confined to specific selected subjects For example a marketing data mart may contain its subjects to customer, item and sales. Data marts are usually implemented on low cost departmental servers that are UNIX or windows based. Depending on the source of data. data marts can be categorized as independent or dependent. Independent data mare are sourced from data captured from one or more operational systems or external information providers or from data generated locally with in a particular department or geographic area 4.3 Virtual warehouse: a virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers. 5.0 Types of OLAP servers: ROLAP versus MOLAP versus HOLAP 5.1 Relational OLAP Servers: these are the intermediate servers that stand between a relational back-end server and client front tools .They use a relational or extended-relational dbms to store and manage warehouse data and olap middleware to support missing pieses.rolap severs include optimization for each dbms backend, implementation of aggregation navigation logic and additional tools and services Multidimensional Servers: These servers 5.2 OLAP support multidimensional views of data through array-based multidimensional engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to pre computed summarized data .notice that with multidimensional data stores, the storage utilization may be low if data set is sparse. Many MOLAP servers adopt a two level storage representation to handle sparse and dense data sets: The dense sub cubes are identified and stored as array structures, while the sparse sub cubes employ compression technology for efficient storage utilization 5.3 Hybrid OLAP servers: the hybrid olap approach combines ROLAP and MOLAP technology, benefiting From the greater scalability of ROLAP and the faster computation of MOLAP. For example a HOLAP server may allow large volumes of detail data to be stored in a relational database From online analytical processing to online analytical mining among many different paradigms and architectures of data mining systems, online analytical mining which integrates online analytical processing with data mining and mining knowledge in multidimensional databases, is particularly important in the following reasons High quality of data in data warehouse available information processing infrastructure surrounding data warehouses Olap-based exploratory data analysis Online selection of data mining functions what is data mining Data Mining refers to extracting or “mining” knowledge from large amounts of data. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Data mining should have been more appropriately named knowledge mining from data .data mining such as knowledge mining from databases, knowledge extraction, pattern analysis, data archaeology and data dredging knowledge discovery consists of the following steps. 1.Data Cleaning(to remove noise and inconsistent data) 2.Data Integration(with multiple data sources may be combined) 3.Data Selection(where data relevant to the analysis task are retrieved from the database) 4.Data Transformation(where data are transformed or consolidated into forms appropriate for mining by performing summery or aggregation operations for instance) 5.Data Mining(an essential process where intelligent methods are applied in order to extract data patterns) 6.pattern evaluation(to identify the true interesting patterns representing knowledge based on some interestingness measures. 7.knowledge presentation (where visualization and knowledge presentation techniques are used to present the mined knowledge to the user) The architecture of a typical data mining system may have the following major components 1.0 Database, data warehouse, or other information repository: this is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data . Database or data warehouse server: 1.1 The database or data warehouse servers responsible for fetching the relevant data ,based on the user’s data mining request. Knowledge base: this is the domain knowledge that issued to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attributes values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included 1.2 Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association, classification, cluster analysis, and evaluation and deviation analysis 1.3 pattern evolution module: This component typically employees interestingness measures and interact with data mining modules so as to focus the search towards interesting patterns. It may use interestingness threshold to filter out discovered patterns 1.4 Graphical user interface: This module communicates between users and data mining system, allowing the user to interact with the system by specifying a data mining query or task ,providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the pattern in different forms Classification of data mining systems: 2.0 Data mining system can be categorized according to various criteria as follows 2.1 Classification according to the kinds of databases mined: A data mining system can be classified according to the kinds of databases mined database systems themselves can classified according to different criteria, each of which may require its own data mining technique. 2.2 classification according to the kinds of knowledge mined: Data Mining systems can be categorized according to the kinds of knowledge they mine, that is based on data mining functionalities, such as characterization, discrimination, assosiation, classification, clustering, outlier analysis and evolution analysis. A comprehensive data mining system usually provides multiple data integrated data mining functionalities 2.3 Classification according to the kind s of techniques utilized: Data Mining systems can be categorized according to the underlying data mining techniques employed. these techniques can be described according to the degree of user interaction involved or the methods of data analysis employed 2.4 Classification according to the application adapted: Data Mining system can also be categorized according to the applications they adapt. for example, there could be data mining systems tailored specifically for finance, telecommunications, DNA, stock markets and so on Conclusion: A Data Warehouse is a repository for long-term storage of data from multiple sources, organizes so as to facilitate management decision making. Data warehouse systems provide some data analysis capabilities, collectively referred to as OLAP. Data mining systems mainly used for data analysis and decision making Bibliography:1).Data mining: concepts and techniques Jiawei Han Micheline kamber Morgan Kaufmann publisher 2).Data warehousing in the real world Sam Anahory Dennis Murray Pearson education