Download Chapter 5 Concept Description Characterization

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia, lookup

Operational transformation wikipedia, lookup

Big data wikipedia, lookup

Data Protection Act, 2012 wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

Data model wikipedia, lookup

Data center wikipedia, lookup

Data analysis wikipedia, lookup

Forecasting wikipedia, lookup

Information privacy law wikipedia, lookup

3D optical data storage wikipedia, lookup

Database model wikipedia, lookup

Relational model wikipedia, lookup

Data vault modeling wikipedia, lookup

Business intelligence wikipedia, lookup

Concept Description Characterization
and Comparison
This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul , Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed
under a Creative Commons Attribution 4.0 International License.
Data mining can be classified into two categories:
 Descriptive data mining
 Predictive data mining.
 The descriptive describes the data set in a concise and
summarized manner and presents interesting general
properties of the data.
 The predictive constructs a set of models, by performing
certain analysis on the available set of data, and attempts to
predict the behavior of new data sets.
A concept usually refers to a collection of data such as
frequent buyers, graduate students, and so on.
characterization and comparison of the data. Characterization
provides a concise and summarization of the given collection
of data, while class comparison provides descriptions
comparing two or more collections of data.
Since concept description involves both characterization and
Differences between concept description in OLTP &OLAP
Data warehouses and OLAP tools are based on a
multidimensional data model which views data in the form of
a data cube, consisting of attributes and measures (aggregate
functions). However, the data types of the dimensions and
measures are restricted.
Many current OLAP systems confine dimensions to
nonnumeric data . Similarly, measures as count(),sum(),
average()) in current OLAP systems apply only to numeric
User control versus automation
On-line analytical processing in data warehouses is a purely
user-controlled process. The selection of dimensions and the
application of OLAP operations, such as drill-down, roll-up,
dicing, and slicing, are directed and
controlled by the users. Although the control in most OLAP
systems is quite user-friendly, users do require a good
understanding of the role of each dimension. Furthermore, in
order to find a satisfactory description of the data, users may
need to specify a long sequence of OLAP operations.
The item relation in a sales database may contain
attributes describing low level item information
such as item ID, name,brand, category, supplier,
place made, and price. It is useful to summarize a
large set of data and present it at a high conceptual
Summarizing a large set of items relating to season
sales provides a general description of such data,
which can be very helpful for sales and marketing
Data generalization is a process which abstracts a large
set of task-relevant data in a database from a
relatively low conceptual level to higher conceptual
Two Methods are (1) the data cube approach
(2) the attribute-oriented
 The data cube approach can be considered as a
data warehouse-based, pre-computation-oriented,
materialized view approach. It performs off-line
aggregation before an OLAP or data mining query
is submitted for processing.
 The attribute-oriented approach, at least in its
initial proposal, is a relational database queryoriented, generalization-based, on-line data
analysis technique.
Attribute removal is based on the following rule
If there is a large set of distinct values for an attribute of
the initial working relation:
(1) There is no generalization operator on the attribute (e.g.,
there is no concept hierarchy defined for the attribute).
(2) Its higher level concepts are expressed in terms of other
attributes, then the attribute should be removed from the
working relation.
2. Attribute generalization is based on the following rule:
If there is a large set of distinct values for an attribute in the
initial working relation, and there exists a set of
generalization operators on the attribute, then a
generalization operator should be selected and applied to
the attribute.
Both rules, attribute removal and attribute generalization,
claim that if there is a large set of distinct values for an
attribute, further generalization should be applied.
This raises the question: how large is “a large set of distinct
values for an attribute" considered to be?
Two common approaches to control
a generalization process
 The
First technique, called attribute generalization
threshold control, either sets one generalization threshold
for all of the attributes, or sets one threshold for each
attribute. If the number of distinct values in an attribute is
greater than the attribute threshold, further attribute removal
or attribute generalization should be performed.
 The second technique, called generalized relation
threshold control, sets a threshold for the generalized
relation. If the number of (distinct) tuples in the generalized
relation is greater than the threshold, further generalization
should be performed. Otherwise, no further generalization
should be performed. Such a threshold may also be preset in
the data mining system (usually within a range of 10 to 30)
Thank you