Download DATA MINING AND DATA WAREHOUSING Discuss the typical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database model wikipedia , lookup

Transcript
DATA MINING AND DATA WAREHOUSING
1-
Discuss the typical OLAP operations with an example.
An OLAP cube is a set of data, organized in a way that facilitates non-predetermined queries
for aggregatedinformation, or in other words, online analytical processing.[1] OLAP is one of the computerbased techniques for analyzing business data that are collectively called business intelligence.[2] OLAP
cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For example a
company might wish to analyze some financial data by product, by time-period, by city, by type of
revenue and cost, and by comparing actual data with a budget. These additional methods of analyzing
the data are known as dimensions.[3] Because there can be more than three dimensions in an OLAP
system the term hypercube is sometimes used. The OLAP cube consists of numeric facts
called measures which are categorized by dimensions. The cube metadata (structure) may be created
from a star schema or snowflake schema of tables in arelational database. Measures are derived from the
records in the fact table and dimensions are derived from the dimension tables. A financial analyst might
want to view or pivot the data in various ways, such as displaying all the cities down the page and all the
products across a page. This could be for a specified period, version and type of expenditure. Having
seen the data in this particular way the analyst might then immediately wish to view it in another way. The
cube could effectively be re-oriented so that the data displayed now has periods across the page and
type of cost down the page. Because this re-orientation involves re-summarizing very large amounts of
data, this new view of the data has to be generated efficiently to avoid wasting the analyst's time, i.e.,
within seconds, rather than the hours a relational database and conventional report-writer might have
taken.[4] Each of the elements of a dimension could be summarized using a hierarchy.[5] The hierarchy is
a series of parent-child relationships, typically where a parent member represents the consolidation of the
members which are its children. Parent members can be further aggregated as the children of another
parent.[6]For example May 2005 could be summarized into Second Quarter 2005 which in turn would be
summarized in the Year 2005. Similarly the cities could be summarized into regions, countries and then
global regions; products could be summarized into larger categories; and cost headings could be grouped
into types of expenditure. Conversely the analyst could start at a highly summarized level, such as the
total difference between the actual results and the budget, and drill down into the cube to discover which
locations, products and periods had produced this difference.
2-
Discuss how computations can be performed efficiently on data cubes.
Users of decision support systems often see data in the form of data cubes. The cube is used to
represent data along some measure of interest. Although called a "cube", it can be 2-dimensional,
3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database
and the cells in the data cube represent the measure of interest. For example, they could contain a
count for the number of times that attribute combination occurs in the database, or the minimum,
maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve
decision support information.Example: We have a database that contains transaction
information relating company sales of a part to a customer at a store location. The data cube
formed from this database is a 3-dimensional representation, with each cell (p,c,s) of the cube
representing a combination of values from part, customer and store-location. A sample data cube
for this combination. The contents of each cell is the count of the number of times that specific
combination of values occurs together in the database. Cells that appear blank in fact have a value of zero.
The cube can then be used to retrieve information within the database about, for example, which store
should be given a certain part to sell in order to make the greatest sales. The goal is to retrieve the
decision support information from the data cube in the most efficient way possible. Three possible
solutions are:
1. Pre-compute all cells in the cube
2. Pre-compute no cells
3. Pre-compute some of the cells
If the whole cube is pre-computed, then queries run on the cube will be very fast. The disadvantage is that
the pre-computed cube requires a lot of memory. The size of a cube for n attributes A1,...,An with
cardinalities |A1|,...,|An| is π|Ai|. This size increases exponentially with the number of attributes and
linearly with the cardinalities of those attributes.To minimize memory requirements, we can pre-compute
none of the cells in the cube. The disadvantage here is that queries on the cube will run more slowly
because the cube will need to be rebuilt for each query.As a compromise between these two, we can precompute only those cells in the cube which will most likely be used for decision support queries. The
trade-off between memory space and computing time is called the space-time trade-off, and it often exists
in data mining and computer science in general. m-Dimensional Array:
A data cube built from m attributes can be stored as an m-dimensional array. Each element of the array
contains the measure value, such as count. The array itself can be represented as a 1-dimensional array.
For example, a 2-dimensional array of size xx y can be stored as a 1-dimensional array of size x*y, where
element (i,j) in the 2-D array is stored in location (y*i+j) in the 1-D array. The disadvantage of storing the
cube directly as an array is that most data cubes are sparse, so the array will contain many empty elements
(zero values).
3-
Write short notes on data warehouse meta data.
The term metadata is used for two similar but fundamentally different concepts (types). The usual
explanation is "data about data", but this does not apply to both concepts in the same way. In the first
instance, structural metadata means the specification of data structures. This cannot be about data
because the actual data content is unknown when the data structures are being designed. In this case the
correct description would be "data about the containers of data". Descriptive metadata, on the other hand,
is about individual instances of application data, the data content. In this case, a useful description
(resulting in a disambiguating neologism) would be "data about data content" or "content about content"
thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of
administrative metadata are all subtypes of metacontent.Metadata (metacontent) is traditionally found in
the card catalogs of libraries. As information has become increasingly digital, metadata is also used to
describe digital data using metadata standardsspecific to a particular discipline. By describing
the contents and context of data files, the quality of the original data/files is greatly increased. For
example, a webpage may include metadata specifying what language it's written in, what tools were used
to create it, and where to go for more on the subject, allowing browsers to automatically improve the
experience of users. Metadata is data. As such, metadata can be stored and managed in a database,
often called aMetadata registry or Metadata repository.[1] However, without context and a point of
reference, it can be impossible to identify metadata just by looking at it. [2] For example: by itself, a
database containing several numbers, all 13 digits long could be the results of calculations or a list of
numbers to plug into an equation - without any other context, the numbers themselves can be perceived
as the data. But if given the context that this database is a log of a book collection, those 13-digit
numbers may now beISBNs - information that refers to the book, but is not itself the information within the
book.The term "metadata" was coined in 1968 by Philip Bagley, one of the pioneers of computerized
document retrieval.[3][4] Since then the fields of information management, information science, information
technology, librarianship and GIS have widely adopted the term. In these fields the word metadata is
defined as "data about data".[5] While this is the generally accepted definition, various disciplines have
adopted their own more specific explanation and uses of the term.
4-
Explain various methods of data cleaning in detail.
Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases,
the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then
replacing, modifying, or deleting this dirty data.After cleansing, a data set will be consistent with other
similar data sets in the system. The inconsistencies detected or removed may have been originally
caused by user entry errors, by corruption in transmission or storage, or by different data
dictionary definitions of similar entities in different stores.Data cleansing differs from data validation in that
validation almost invariably means data is rejected from the system at entry and is performed at entry
time, rather than on batches of data.

Data auditing: The data is audited with the use of statistical methods to detect anomalies and
contradictions. This eventually gives an indication of the characteristics of the anomalies and their
locations.

Workflow specification: The detection and removal of anomalies is performed by a sequence of
operations on the data known as the workflow. It is specified after the process of auditing the data
and is crucial in achieving the end product of high-quality data. In order to achieve a proper workflow,
the causes of the anomalies and errors in the data have to be closely considered. For instance, if we
find that an anomaly is a result of typing errors in data input stages, the layout of the keyboard can
help in manifesting possible solutions.

Workflow execution: In this stage, the workflow is executed after its specification is complete and its
correctness is verified. The implementation of the workflow should be efficient, even on large sets of
data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be
computationally expensive.

Post-processing and controlling: After executing the cleansing workflow, the results are inspected
to verify correctness. Data that could not be corrected during execution of the workflow is manually
corrected, if possible. The result is a new cycle in the data-cleansing process where the data is
audited again to allow the specification of an additional workflow to further cleanse the data by
automatic processing.
5-
Give an account on data mining Query language.
6-
How is Attribute-Oriented Induction implemented? Explain in detail.
Data mining has become an important technique which has tremendous potential in many commercial
and industrial applications. Attribute-oriented induction is a powerful mining technique and has been
successfully implemented in the data mining system DBMiner (Han et al. Proc. 1996 Int'l Conf. on Data
Mining and Knowledge Discovery (KDD'96), Portland, Oregon, 1996). However, its induction capability
is limited by the unconditional concept generalization. In this paper, we extend the concept generalization
to rule-based concept hierarchy, which enhances greatly its induction power. When previously proposed
induction algorithm is applied to the more general rule-based case, a problem of induction anomaly
occurs which impacts its efficiency. We have developed an efficient algorithm to facilitate induction on
the rule-based case which can avoid the anomaly. Performance studies have shown that the algorithm is
superior than a previously proposed algorithm based on backtracking.
data mining - knowledge discovery in databases - rule-based concept generalization - rule-based concept
hierarchy - attribute-oriented induction - inductive learning - learning and adaptive systems
This paper will propose a novel star schema attribute induction as a new attribute induction paradigm and as
improving from current attribute oriented induction. A novel star schema attribute induction will be examined with
current attribute oriented induction based on characteristic rule and using non rule based concept hierarchy by
implementing both of approaches. In novel star schema attribute induction some improvements have been
implemented like elimination threshold number as maximum tuples control for generalization result, there is no ANY
as the most general concept, replacement the role concept hierarchy with concept tree, simplification for the
generalization strategy steps and elimination attribute oriented induction algorithm. Novel star schema attribute
induction is more powerful than the current attribute oriented induction since can produce small number final
generalization tuples and there is no ANY in the results.
7-
Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example.
8-
Discuss the approaches for mining multi level association rules from the transactional
databases. Give relevant example.