Download Solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Unit-1
1. What is ODS? How does it differ from data warehouse? Explain [Dec-14/Jan 2015] [8marks]
2. What is data mining? Explain Data Mining and Knowledge Discovery? [June/July
2014][10marks]
3. What is operational data store (ODS)? Explain with neat diagram. [June/July 2014][10marks]
4. What is ETL? Explain the steps in ETL. [June/july 2015][Dec 2013 /jan14][10marks]
5. What are the guide lines for implementing the data warehouse? Dec-14/Jan 2015] [8marks]
[Dec 2013/jan 2014] [10marks], [june/july 2015]
Solutions
1. What is ODS? How does it differ from data warehouse? Explain [Dec-14/Jan
2015] [8marks]
An operational data store (ODS) is a type of database that's often used as an interim logical area
for a data warehouse. An operational data store (ODS) is a place where data from multiple source systems
is stored. As a general rule, an ODS is maintained in near real time and the data is usually at the transaction
level. Assuming that you don't intend to do the 'real time' thing then you are probably wise, as you suggest,
not to call it an ODS. So, if it isn't an ODS, what is it? Well, I would argue that it depends on the purpose
you have in mind. If you intend to use it for direct reporting, then the best term is probably enterprise data
warehouse (EDW) If you are going to feed data from it to data marts for reporting, then the term "data
warehouse" is probably best.
However, you say "I want to describe a database where I will store all my Corporate Data (used in multiple
systems) for use, not on a reporting basis, but as source data to other systems." Well, that's OK. If the feed
from it to the other systems isn't real time but batch, then the term "data warehouse" is probably still OK
because I know of several data warehouses that are also used as data sources in a limited way. However,
you also say "My end thought is I will eventually be able to minimize any system's need for interfaces to
the one that will connect to this data store system." That sort of implies that you intend to use the data store
you are creating as an operational database. That would be a huge undertaking since the new data store
would have to mimic the business logic and functionality of all the original systems. So I have to admit that
I don't know of any general term that would apply; mainly because so few people would attempt this.
2. What is data mining? Explain Data Mining and Knowledge Discovery? [June/July
2014][10marks]
Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of
data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is
referred to as gold mining rather than rock or sand mining. Thus, data mining should have been
more appropriately named “knowledge mining from data,” which is unfortunately somewhat long.
“Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts
of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of
precious nuggets from a great deal of raw material (Figure 1.3). Thus, such a misnomer that carries
both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly
different meaning to data mining, such as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery from Data, or KDD. Alternatively, others view data mining as simply an evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may
also be included. Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern
evaluation module: This component typically employs interestingness measures and interacts with
the data mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation
module may be integrated with the mining module, depending on the implementation of the data
mining method used. For efficient data mining, it is highly recommended to push the evaluation
of pattern interestingness as deep as possible into the mining process so as to confine the search to
only the interesting patterns. User interface: This module communicates between users and the
data mining system, allowing the user to interact with the system by specifying a data mining query
or task, providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures, evaluate mined patterns, and
visualize the patterns in different forms.
3. What is operational data store (ODS)? Explain with neat diagram. [June/July 2014]
[10marks]
The major task of on-line operational database systems is to perform on-line transaction and query
Processing. These systems are called on-line transaction processing (OLTP) systems. They cover
Most of the day-to-day operations of an organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the
other hand, serve users or knowledge workers in the role of data analysis and decision making.
Such systems can organize and present data in various formats in order to accommodate the diverse
needs of the different users. These systems are known as on-line analytical processing (OLAP)
systems. The major distinguishing features between OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used for transaction
and query processing by clerks, clients, and information technology professionals. An OLAP
system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too detailed to be easily
Used for decision making. An OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages information at different
levels of granularity. These features make the data easier to use in informed decision making.
Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
Application-oriented database design. An OLAP system typically adopts either a star or snowflake
Model and a subject oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP system
Often spans multiple versions of a database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that originates from different
organizations, integrating information from many data stores. Because of their huge volume,
OLAP data are stored on multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
4. What is ETL? Explain the steps in ETL. [June/jul 2015][10marks] [DEC 2013] [10marks]
The data analysis task involves data integration, which combines data from multiple sources into
a coherent data store, as in data warehousing. These sources may include multiple databases, data
Cubes, or flat files. There are a number of issues to consider during data integration. Schema
integration and object matching can be tricky. How can equivalent real-world entities from
multiple data sources be matched up? This is referred to as the entity identification problem. For
example, how can the data analyst or the computer be sure that customer id in one database and
custnumber in another refer to the same attribute? Examples of metadata for each attribute include
the name, meaning, data type, and range of values permitted for the attribute, and null rules for
handling blank, zero, or null values. Such metadata can be used to help avoid errors in schema
integration. The metadata may also be used to help transform the data (e.g., where data codes for
pay type in one database may be “H” and “S”, and 1 and 2 in another). Hence, this step also relates
to data cleaning, as described earlier.
Redundancy is another important issue. An attribute (such as annual revenue, for instance) may
be redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis. Given two attributes, such analysis can
measure how strongly one attribute implies the other, based on the available data. For numerical
attributes, we can evaluate the correlation between two attributes
5. What are the guide lines for implementing the data warehouse? [Dec-14/Jan
2015][8marks]
[DEC 2013/jan14] [10marks], [june/july] [10 marks]
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that
with competition mounting in every industry, data warehousing is the latest must-have marketing
Weapon - a way to retain customers by learning more about their needs.
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. Loosely speaking, a data warehouse refers to a database that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for the integration of
a variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis. According to William H. Inmon, a leading
architect in the construction of data warehouse systems, “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection of data in support of management’s decision
making process”. This short, but comprehensive definition presents the major features of a data
warehouse. The four keywords, subject-oriented, integrated time-variant, and nonvolatile,
distinguish data warehouses from other data repository systems, such as relational database
systems, transaction processing systems, and file systems