Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Paper Presentation on - Information repository with knowledge discovery Abstract Organisations are today suffering from a malaise of data overflow. The developments in the transaction processing technology has given rise toa situation where the amount and rate of data capture is very high, but the processing of this data into information that can be utilised for decision making, is not developing at the same pace. Data warehousing and data mining (both data & text) provide a technology that enables the decision-maker in the corporate sector/govt. to process this huge amount of data in a reasonable amount of time, to extract intelligence/knowledge in a near real time. A data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from non-homogeneous sources as they are generated and processed using process managers (load/warehouse/query). This makes it easier and more efficient to run queries over data that originally came from different sources. It also enables the people to take informed decisions. Various technologies for extracting new insight from the data warehouse have come up which we classify loosely as "Data Mining Techniques" Data mining systems improve an organization’s effectiveness, efficiency and value by increasing the usefulness of the knowledge the organization process. Our paper focuses on the need for information repositories and discovery of knowledge and thence the overview of, the so hyped, Data Warehousing and Data Mining. CONTENTS Introduction What is Data-Warehousing? Warehousing Functions Architecture Of Data Warehouse What is Data Mining ? Data Mining as a part of Knowledge Discovery Goals of Data Mining and Knowlegdge Discovery Bibliography in the information systems field. Since the early 1990s, data warehouses have been at the forefront of information technology applications as a way for organizations to effectively use digital Introduction information for business planning and decision making. Hence, an This paper presents an overview of how understanding of data warehouse system Data Ware-houses serve as a data source architecture is or will be important in our for datamining.Data warehousing is one roles and responsibilities in information of the most important strategic initiatives management. a data warehouse provides “a single version of the truth.” All users and applications access the same data. Because users access better data, their ability to analyze data and make decisions Most fundamentally, a data warehouse is Warehousing created to provide a dedicated source of increasingly data concept to support decision-making of has improves. emerged popular and applying Data as an powerful information applications. Rather than having data technology to turn this huge islands of scattered across a variety of systems, a data into meaningful information for data warehouse integrates the data into a better business decisions single repository. It is for this reason that Most simply, a data warehouse is a collection of data created to support decision making. Users and applications access the warehouse for the data that they need. A warehouse provides a data infrastructure. It eliminates a reason for the failure of many decision support applications – the lack of quality data. A data warehouse has the following four characteristics Subject-oriented means that all relevant data about a subject is gathered and stored as a single set in a useful format. Integrated refers to the Data collected from multiple systems and are integrated around subjects. Data being stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures, and physical attributes, even when the underlying operational systems store the data differently. Non-volatile: A warehouse is nonvolatile – users cannot change or update the data. The data warehouse is read-only. Non-volatile makes sure that all users are working with the same data. The warehouse is updated, but through IT controlled load processes rather than by users. Time variant. A warehouse maintains historical data (i.e., it includes time as a variable). Unlike transactional systems, where only recent data, such as for the last day, week, or month, are maintained, a warehouse may store years of data. Historical data is needed to detect deviations, trends, and long-term relationships. Whereas a data warehouse is a repository of data, data warehousing is the entire process. As shown in Figure 1, data warehousing encompasses a broad range of activities: all the way from extracting data from source systems to the use of the data for decision-making purposes. Specifically, it includes data extraction, transformation, and loading, the access of the data by end users, and applications Conceptually a data warehouse looks like this Information Sources always include the core operational systems which form the backbone of day-to-day activities. It is these systems which have traditionally provided management information to support decision making. Decision Support Tools are used to analyze the information stored in the warehouse, typically to identify trends and new business opportunities.. The Data Warehouse itself is the bridge between the operational systems and the decision support tools. It holds a copy of much of the operational system data in a logical structure which is more conducive to analysis. The Data Warehouse, which will be refreshed in scheduled bursts from operational systems and from relevant external data sources, provides a single, consistent view of corporate data, leaving operational systems unaffected. Data – Warehouse Functions The main function behind a data warehouse is to get the enterprise-wide data in a format that is most useful to end-users, regardless of their locations. Data warehousing is used for: 1. Increasing the speed and flexibility of analysis. 2. Providing a foundation for enterprise-wide integration and access. 3. Improving or re-inventing business processes. 4. Gaining a clear understanding of customer behavior. Data Warehouse-Goals: The fundamental goal is to enable users appropriate access to a homogenized and comprehensive view of organization. It also supports forecasting, planning and decisionmaking process. An additional goal is to achieve information consistency, provide security and adaptability. ARCHITECTURE OF DATA WAREHOUSE: Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and presentation that exists for end-user computing within the enterprise. The architecture is made up of a number of interconnected parts: Operational Database / External Database Layer: Operational systems process data to support critical operational needs. To do that, operational databases have been historically created to provide an efficient processing structure for a relatively small number of well-defined business transactions. Information Access Layer: This is the layer that the end-user deals with directly. In particular, it represents the tools that the end-user normally uses day to day, e.g., Excel, Lotus 1-2-3 e.t.c. Data Access Layer: The Data Access Layer of the Data Warehouse Architecture is involved with allowing the Information Access Layer to talk to the Operational Layer. Data Directory (Meta-data) Layer: Meta-data is the data about data within the enterprise. Record descriptions in a COBOL program are Meta-data. Process Management Layer: The Process Management Layer is involved in scheduling the various tasks that must be accomplished to build and maintain the data warehouse and data directory. Application Messaging Layer: The Application Message Layer has to do with transporting information around the enterprise-computing network. Application Messaging is also referred to as “Middleware”, but it can involve those just networking protocols. Data warehouse (physical) Layer: The (core) Data Warehouse is where the actual data used primarily for informational uses occur. In a Physical Data Warehouse copies, in some cases many copies, of operational and or external data are actually stored in a form that is easy to access. Data Staging Layer: Data Staging is also called copy management or replication management, but in fact, it includes all of the processes necessary to select, edit, summarize, combine and load data warehouse and information access data from operational and/or external database. Classification of data warehouses Data warehouses can be classified into three types: Enterprise data warehouse: An enterprise data warehouse provides a central database for decision support through out enterprise. Operational data store (ODS): This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. Data Mart: Data mart is a subset of data warehouse and it supports a particular region, business unit or business function. Data Marts A data mart is typically defined as a subset of the contents of a data warehouse, stored within its own database. A data mart tends to contain data focused at the department level, or on a specific business area. The data can exist at both the detail and summary levels. The data mart can be populated with data taken directly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself. Because the volume of data in a data mart is less than that in a data warehouse, query processing is often faster. Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data and the science of extracting useful information from large data sets or databases". Application of search techniques from Artificial Intelligence to these problems. It is the analysis of large data sets to discover patterns of interests.” Many of the early data mining software packages were based on one algorithm. Data base mining or Data mining (DM) (formally termed Knowledge Discovery in Databases – KDD) is a process that aims to use existing data to invent new facts and to uncover new relationships previously unknown even to experts thoroughly familiar with the data. It is like extracting precious metal (say gold etc.) and/or gems, hence the term “mining”, It is based on filtration and assaying of mountain of data “ore” in order to get “nuggets” of knowledge. Characteristics of a data mart include: 1) Quicker and simpler implementation. 2) Lower implementation cost. 3) Needs of a specific business unit or function met. 4) Protection of sensitive information stored elsewhere in the data warehouse. 5) Faster response times due to lower volumes of data. 6) Distribution of data marts to user organizations. 7) Built from the bottom upward. The Data Mining process is not a simple function, as it often involves a variety of feedback loops since while applying a particular technique, the user may determine that the selected data is of poor quality or that the applied techniques did not produce the results of the expected quality. In such cases, the User has to repeat and refine earlier steps, possibly even restarting the entire process from the beginning. Data mining is a capability consisting of the hardware, software, "warm ware" (skilled labor) and data to support the recognition of previously unknown but potentially useful relationships. It supports the transformation of data to information, knowledge and wisdom, a cycle that should take place in every organization. Companies are now using this capability to understand more about their customers, to design targeted sales and marketing campaigns, to predict what and how frequently customers will buy products, and to spot trends in customer preferences that lead to new product development. Data Mining as a Part of the Knowledge Discovery Process · Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more than data mining. · The knowledge discovery process comprises six phases: Data selection ,Data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. Data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment typically enhances the data with additional sources of information. Data transformation and encoding may be done to reduce the amount of data. Goals of Data Mining and Knowledge Discovery The goals of data mining fall into the following classes: Prediction: Data mining can show how certain attributes within the data will behave in the future. Identification: Data patterns can be used to identify the existence of an item, an event, or an activity. Classification : Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. Optimization :One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints. Conclusion Data Warehousing provides the means to change raw data into information for making effective business decisions--the emphasis on information, not data. The data warehouse is the hub for decision support data. A good data warehouse will... provide the RIGHT data... to the RIGHT people... at the RIGHT time: RIGHT NOW! While data warehouse organizes data for business analysis, Internet has emerged as the standard for information sharing Data warehouse and data mining plays an important role in storing data and sorting out the particular data. It has become very easy for a user to get the information that he wants through this mining. Quantifiable business benefits have been prove through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users. Bibliography Eckerson, W.W. (1988) "PostChasm Warehousing," Journal of Data Warehousing, Recent Developments in Data Warehousing by H.J. Watson. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber WEBSITES www.datawarehousingonline.com www.pcc.ac.uk.com www.dsstechniques.com