Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unit-1 1. What is ODS? How does it differ from data warehouse? Explain [Dec-14/Jan 2015] [8marks] 2. What is data mining? Explain Data Mining and Knowledge Discovery? [June/July 2014][10marks] 3. What is operational data store (ODS)? Explain with neat diagram. [June/July 2014][10marks] 4. What is ETL? Explain the steps in ETL. [June/july 2015][Dec 2013 /jan14][10marks] 5. What are the guide lines for implementing the data warehouse? Dec-14/Jan 2015] [8marks] [Dec 2013/jan 2014] [10marks], [june/july 2015] Solutions 1. What is ODS? How does it differ from data warehouse? Explain [Dec-14/Jan 2015] [8marks] An operational data store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. An operational data store (ODS) is a place where data from multiple source systems is stored. As a general rule, an ODS is maintained in near real time and the data is usually at the transaction level. Assuming that you don't intend to do the 'real time' thing then you are probably wise, as you suggest, not to call it an ODS. So, if it isn't an ODS, what is it? Well, I would argue that it depends on the purpose you have in mind. If you intend to use it for direct reporting, then the best term is probably enterprise data warehouse (EDW) If you are going to feed data from it to data marts for reporting, then the term "data warehouse" is probably best. However, you say "I want to describe a database where I will store all my Corporate Data (used in multiple systems) for use, not on a reporting basis, but as source data to other systems." Well, that's OK. If the feed from it to the other systems isn't real time but batch, then the term "data warehouse" is probably still OK because I know of several data warehouses that are also used as data sources in a limited way. However, you also say "My end thought is I will eventually be able to minimize any system's need for interfaces to the one that will connect to this data store system." That sort of implies that you intend to use the data store you are creating as an operational database. That would be a huge undertaking since the new data store would have to mimic the business logic and functionality of all the original systems. So I have to admit that I don't know of any general term that would apply; mainly because so few people would attempt this. 2. What is data mining? Explain Data Mining and Knowledge Discovery? [June/July 2014][10marks] Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material (Figure 1.3). Thus, such a misnomer that carries both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. 3. What is operational data store (ODS)? Explain with neat diagram. [June/July 2014] [10marks] The major task of on-line operational database systems is to perform on-line transaction and query Processing. These systems are called on-line transaction processing (OLTP) systems. They cover Most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems. The major distinguishing features between OLTP and OLAP are summarized as follows: Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily Used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an Application-oriented database design. An OLAP system typically adopts either a star or snowflake Model and a subject oriented database design. View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system Often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. 4. What is ETL? Explain the steps in ETL. [June/jul 2015][10marks] [DEC 2013] [10marks] The data analysis task involves data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data Cubes, or flat files. There are a number of issues to consider during data integration. Schema integration and object matching can be tricky. How can equivalent real-world entities from multiple data sources be matched up? This is referred to as the entity identification problem. For example, how can the data analyst or the computer be sure that customer id in one database and custnumber in another refer to the same attribute? Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values. Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be “H” and “S”, and 1 and 2 in another). Hence, this step also relates to data cleaning, as described earlier. Redundancy is another important issue. An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For numerical attributes, we can evaluate the correlation between two attributes 5. What are the guide lines for implementing the data warehouse? [Dec-14/Jan 2015][8marks] [DEC 2013/jan14] [10marks], [june/july] [10 marks] Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouse systems are valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing Weapon - a way to retain customers by learning more about their needs. Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization’s operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. According to William H. Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process”. This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems