Download Basic Principles of Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Time series wikipedia , lookup

Transcript
266
Chapter XV
Basic Principles of
Data Mining
Karl-Ernst Erich Biebler
Ernst-Moritz-Arndt-University, Germany
Bernd Paul Jäger
Ernst-Moritz-Arndt-University, Germany
Michael Wodney
Ernst-Moritz-Arndt-University, Germany
Abstract
This chapter gives a summary of data types, mathematical structures, and associated methods of data
mining. Topological, order theoretical, algebraic, and probability theoretical mathematical structures
are introduced. The n-dimensional Euclidean space, the model used most for data, is defined. It is executed briefly that the treatment of higher dimensional random variables and related data is problematic.
Since topological concepts are less well known than statistical concepts, many examples of metrics are
given. Related classification concepts are defined and explained. Possibilities of their quality identification are discussed. One example each is given for topological cluster and for topological discriminant
analyses.
Introduction
Data mining is up to a point a self-guided dataevaluating process and influenced by accompanying activity of the user. In comparison to data
analysis, it describes an in-advance-defined process of the data evaluation. Data mining describes
explorative procedures most of the time. Hypoth-
eses being in connection with the examined data
are sought. One must presuppose nothing about
the methods of the collection of the data.
The concluding procedures pursue another aim
position: A given hypothesis shall be checked with
data. The collection of the data then must be carried out according to certain principles, however.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Basic Principles of Data Mining
As a rule, if statistical procedures are used, the
data must be able to be regarded as samples.
More exact definitions of the concepts of
information and hypothesis are not looked here.
Contributions to the methods of data mining are
from different branches, for example computer
science, logic, learning theory, artificial intelligence, also from the application fields like medical
informatics, financial analysis etc. Basic concepts of data mining shall be explained in the following. The concepts used are
part of different areas of mathematics. They are
defined and illustrated as examples.
One has to distinguish data of different types.
According to this, the mathematical methods of
data evaluation have to be designed. The mathematical structures are of basic importance. They
correspond with the respective data types. The
result interpretations must refer to it.
If one can calculate the pair wise distances for
the objects of a data set, then so-called topological
methods of data mining can be designed.
Statistical methods of data mining are based
on observations of random variables. It is presupposed mostly that the data are a sample. If this is
not the case, statistical methods are considered
only in exceptions. It is not a trivial problem of
deciding whether data are a sample of a random
variable. Therefore, we point to not statistical
methods of data mining.
Methods of data mining are mathematical
procedures. Its variety is exceptionally broad.
We therefore confine ourselves to some classification methods and different possibilities of
their treatment.
The reader is able thus in principle to recognize the connection of data type, observation
strategy, structure of the data as well as the datamining method. This is essential for any result
interpretation.
Transformations of the original data can influence the results of data mining. It is therefore recommended always to refer to the original data.
Data types
Observations at objects are informed about as data.
One can receive these observations as measuring, numbers or verbal descriptions, for example.
Sometimes they concern a quality, often also
more qualities. Also more complicated facts can
be included concerning the objects, such as relations. It is therefore required to distinguish data
types. Data types relevant for the data analyses
are described in the following.
One knows data types also from programming
languages. These shall not be treated here.
A set X in the set-theoretical meaning consists of elements xi, X = {xi , i ∈ I }. The index I
may be finite or infinite. According to this one
distinguishes finite and infinite sets. The sets
{x1, x1, x1, x2} and {x1, x2} are the same in the settheoretical meaning. This means all elements of
a set are different.
Data sets are collections of elements of a set.
The data sets {x1, x1, x1, x2} and {x1, x2} have to
be distinguished. The same element of a set can
appear repeatedly in a data set.
String data are signs or character strings (e.g.
letters, words, abstract words). Numerical data
are numbers (e.g. 3, 324, 2.1482). Dates are not
regarded as numeric data. They form a type of
their own.
Categorical data are collections of elements of
a set X, e.g., {red, red, red, green, green} is collected from X = {red, green, blue}. Categorical
data can be string data or numerical data.
Ordinal data is data which can be ordered. One
can order numbers after their size. The words of
a language are string data and can be ordered in
a dictionary.
Metric data are collections of elements of
an interval X of real numbers, e.g., {2.001, 13.2,
1.008, 200.23} shall have been collected from X
= [0; 225]
.
267
22 more pages are available in the full version of this document, which may
be purchased using the "Add to Cart" button on the publisher's webpage:
www.igi-global.com/chapter/basic-principles-data-mining/29155
Related Content
An UML Profile and SOLAP Datacubes Multidimensional Schemas Transformation Process for
Datacubes Risk-Aware Design
Elodie Edoh-Alove, Sandro Bimonte and François Pinet (2015). International Journal of Data Warehousing
and Mining (pp. 64-83).
www.irma-international.org/article/an-uml-profile-and-solap-datacubes-multidimensionalschemas-transformation-process-for-datacubes-risk-aware-design/130667/
Aesthetics in Data Visualization: Case Studies and Design Issues
Heekyoung Jung, Tanyoung Kim, Yang Yang, Luis Carli, Marco Carnesecchi, Antonio Rizzo and Cathal
Gurrin (2016). Big Data: Concepts, Methodologies, Tools, and Applications (pp. 1053-1076).
www.irma-international.org/chapter/aesthetics-in-data-visualization/150205/
Big Data Paradigm for Healthcare Sector
Jyotsna Talreja Wassan (2016). Big Data: Concepts, Methodologies, Tools, and Applications (pp. 570587).
www.irma-international.org/chapter/big-data-paradigm-for-healthcare-sector/150182/
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer
Informatics: Finding the Right Information for Consumer's Health Information Need
Ki Jung Lee (2009). Handbook of Research on Text and Web Mining Technologies (pp. 758-765).
www.irma-international.org/chapter/literature-review-computational-linguistics-issues/21756/
Cooperation between Expert Knowledge and Data Mining Discovered Knowledge
Fernando Alonso, Loïc Martínez, Aurora Pérez and Juan Pedro Valente (2013). Data Mining: Concepts,
Methodologies, Tools, and Applications (pp. 1936-1959).
www.irma-international.org/chapter/cooperation-between-expert-knowledge-data/73529/