Download Data Mining Chapter 2 Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 2: Data Mining
Dr. Goutam Sarker,
Fellow: IE(I), Fellow: IETE(I),
Senior Member: IEEE(USA), Associate
Professor, CSE, NITD
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
1
What is Data Mining ?

The term “data mining” refers to the finding of
relevant and useful information from
databases.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
2
Definition 1
1.
Data mining or knowledge discovery in
databases, is the non trivial extraction of
implicit, previously unknown and potentially
useful information from the data.
This encompasses a number of technical
approaches, such as clustering, data
summarization, classification, pattern
recognition, etc.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
3
Definition 2
Data mining is the search for the
relationships and global patterns that exist
in large databases but are hidden among
vast amounts of data.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
4
Definition 3
Data mining is the process of discovering
meaningful, new correlation patterns and
trends by sifting through large amount of
data stored in repositories, using pattern
recognition techniques as well as statistical
and mathematical techniques.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
5
KDD vs. Data Mining


Knowledge Discovery in Database (KDD): was formalized in
1989, with reference to the general concept of being broad and
high level in the pursuit of seeking knowledge from data.
Data mining: is the only one of the many steps involved in
knowledge discovery in databases. The various steps in the
knowledge discovery process include data selection, data
cleaning and preprocessing, data transformation and reduction,
data mining algorithm selection and finally the post processing
and the interpretation of the discovered knowledge. The KDD
process tends to be highly iterative and interactive.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
6
Stages of KDD
1.
2.
3.
4.
5.
6.
Selection.
Preprocessing.
Transformation.
Data Mining.
Interpretation and Evaluation.
Data Visualization.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
7
Stages of KDD
1.
2.
3.
4.
5.
6.
contd.
Selection: This stage is concerned with selecting or segmenting the data that
are relevant to some criteria.
Preprocessing: Preprocessing is the data cleaning stage where
unnecessary information is removed.
Transformation: The data is not merely transferred across, but transformed
in order to be suitable for the task of data mining. In this stage, the data is
made usable and navigable.
Data Mining: This stage is concerned with the extraction of patterns from the
data.
Interpretation and Evaluation: The pattern obtained in the data mining
stage are converted into knowledge, which in turn is used to support decision
making.
Data Visualization: Data visualization makes it possible for the analyst to
gain a deeper, more intuitive understanding of the data.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
8
DBMS vs. DM




We know that DBMS supports query languages which are
useful for query triggered data exploration, whereas data
mining supports automatic data exploration.
If we know exactly what information we are seeking, a DBMS
query would suffice; whereas if we vaguely know the possible
correlations or patterns, then data mining techniques are useful.
One of the tasks of data mining is hypothesis testing, wherein
we formulate a hypothesis and test it by sifting through the
database.
The data mining application goes where the naturally reside.
This avoids performance degradation and takes full advantage
of database technology.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
9
Related Areas:
 Statistics
 Machine Learning
1. Supervised Learning.
2. Unsupervised Learning.
Artificial Intelligence (AI) vs. Data
Mining
The tasks of automatically discovering
patterns in the data has so far been mostly
the domains of Artificial Intelligence.
There are mainly 2 aspects to differentiate
DM from AI. These are:
1.
2.
Data Mining emphasizes the human
understandability of discovered patterns;
whereas in AI, the discovered patterns are
meant to be used by the machine itself.
Data Mining techniques are meant to be
scalable to huge store of data such as the world
wide web (www). In contrast, the traditional AI
approaches have mostly been researched using
small “toy” data sets that fit in the main memory.
Data Mining has borrowed a good deal from
AI, especially from the field of machine
learning in which a program dynamically
improves itself. Almost all classification
techniques of machine learning have been
used in data mining. Only those classification
models that are not easily understandable by
human users (e.g. neural network techniques
have been omitted.
Goals and DM Techniques

Two fundamental goals of data mining
1.
Prediction
Description
2.
Prediction makes use of existing variables in the database in order
to predict unknown or future values of interest.
Description focuses on finding patterns describing the data and
subsequent presentation for user interpretation.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
14
Classification of Techniques
1.
2.
User guided or verification driven data
mining
Discovery driven or automatic discovery of
rules
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
15
Data Mining Techniques


Verification Model: In this process of data mining, the user
makes a hypothesis and tests the hypothesis on the data to
verify its validity. The emphasis is on the user who is
responsible for formulating the hypothesis.
Discovery Model: The discovery model differs in its emphasis.
It is the system automatically discovering important information
hidden in the data. The data is sifted in search of frequently
occurring patterns, trends and generalizations about the data
without guidance from the user.
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
16
Discovery Driven Tasks
1.
2.
3.
4.
5.
Discovery of association rules
Discovery of classification rules
Clustering
Discovery of frequent episodes
Deviation detection
4/30/2017 11:29 AM
Data Mining / CSE Department/
Dr. Goutam Sarker
17
Discovery of Association Rules
An association rule has the form X ⇒ Y,
where X and Y are the sets of items.
 The intuitive meaning of such a rule is that
the transaction of database which contains X
tends to contain Y
 Given a database, the goal is to discover all
the rules that have the support and
confidence greater than or equal to the
minimum support
and confidence.
4/30/2017 11:29 AM
Data Mining / CSE Department/
18

Dr. Goutam Sarker
Classification

19
* Classification involves finding rules that
partition the data into disjoint groups. The
input for the classification is the training data
set, whose class labels are already known.
4/30/2017 11:29 AM
Clustering



1.
2.
3.
20
*Clustering is a method of grouping data into
different groups, so that the data in each group
share similar trends and patterns
Clustering constitutes a major class of data mining
algorithms
The objectives of clustering are:
To uncover natural grouping
To initiate hypothesis about the data
To find out consistent and valid organization of the
data
4/30/2017 11:29 AM
Discovery of Classification Rules
Classification involves finding rules that
partition the data into disjoint groups. The
input to the classification is the training data
set whose class labels are already known.
This can be termed as supervised learning
also.
There are several classification discovery
models:
1. Decision Trees.
2. Neural Networks.
3. Genetic Algorithms.
Frequent Episodes
Frequent episodes are the sequence of
events that occur frequently, close to each
other and are extracted from the time
sequence
23
4/30/2017 11:29 AM
R is a set of event types
A is a particular type of event
Therefore A ϵ R
An event is defined as a pair (A, t) ,
where as above
AϵR
A sequence of events (also called event
sequence ) S of R is a triple (TS, TC, S)
Where TS = starting time
TC = ending time
S= {(A1,t1), (A2,t2), … … … (An, tn) } is the
ordered sequence of events, such that
Ai ϵ R
and
Ts <= ti <= Tc for all i = 1,2, … … … n-1
3 types of episodes
a) Serial episodes: Which occur in sequence.
b) Parallel episodes: No constraints on the occurrence
of event types.
c) Non serial non parallel: If the occurrences of A and B
preceed an occurrence of C, and there is no
constraint on the occurrences of A and B
Deviation Detection

28
Deviation detection is to identify outlying
points in a particular data set, and explain
whether they are due to noise or other
impurities being present in the data or due to
trivial reasons
4/30/2017 11:29 AM
Mining Problems
1.
2.
3.
4.
29
Neural Networks
Genetic Algorithms
Rough Set Techniques
Support Vector Machines
4/30/2017 11:29 AM
Other Mining Problems:



30
Sequence Mining: is concerned with mining
sequence data.
Web Mining: World Wide Web is a fertile
area for data mining research having the
huge amount of information available online.
Text Mining: Text documents are structured
by means of information extraction, text
categorization etc
4/30/2017 11:29 AM

1.
2.
3.
4.
Spatial Data Mining: Spatial Data mining is
the branch of data mining that deals with
spatial (location) data.
Geographically referenced data
Digital mapping
Remote Sensing
DM Applications: case studies
1.
2.
3.
4.
32
Housing Loan Prepayment Prediction
Crime Detection
Customer Retention
Brand Loyalty
4/30/2017 11:29 AM
5. Banking
 Detection of patterns of fraudulent credit card
use.
 Identifying ‘loyal’ customers.
 Determining ‘credit card spending’ by
customer group
6. Astronomy: Detection of unusual stars or
galaxies or nebulas or super galaxies may
lead to the discovery of previously unknown
phenomena and terrestrial body.

35
End of Chapter 2
4/30/2017 11:29 AM