Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Maximum parsimony (phylogenetics) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Title :
Author :
College :
Data Mining
V.Ramesh
Syed ammal Engineering college.
Introduction
It is a part of iterative process that involves data source selection, preprocessing, transformation, and data mining and finally interpretation of results.
Graphics tools are used to illustrate data relationships.
Data mining is related to the sub areas of statistics called exploratory data
analysis, which are similar goals and relies on statistical measures. It is also
closely related to the subareas of AI called knowledge discovery and machine
learning. The important distinguishing characteristics of data mining is the
volume of data is very large, although ideas from these related areas of study are
applicable to data mining problems, scalability wrt data size is an important new
criterion. An algorithm is scalable if the running time grows in proportion to the
dataset size, give the available system resources. Old algorithm must be adapted
or new algorithms must be developed to ensure scalability.
Itemset
An itemset is a collection of items purchased by a customer in a single customer
transaction. Given a database of transacations, we call an itemset frequent if it is
contained in a user specific percentage of all transactions.
Prior property
A priori property is that every subset of a frequent itemset is also frequent. We
can identify frequent itemsets efficiently through a bottom up algorithm that first
generates all frequent item sets of size one, then size two and so on.
Iceberg queries
Iceberg queries are SELECT FROM GROUP, BY HAVING queries with a
condition involving aggregation in the HAVING clause. Iceberg queries are
amenable to the same bottom up strategy that is used for computing frequent
itemsets.
Split Selection method
A split selection selects the splitting criterion at each node of the tree. A relatively
compact data structure the AVC set contains the sufficient information to let split
selection methods decide on the splitting criterion. A sequential pattern is a
sequence of item sets purchased by the same customer.
Elements of data mining
Data mining consists of five major elements:
1. Extract, transform and load transaction data onto the data ware house
system.
2. Store and manage the data in a multidimensional data base system.
3. Provide data access to business analysts and information technology
professionals.
4. Analyse the data by application software.
5. Present the data in a useful format, such as a graph or table.
Levels of data mining:
1. Artificial Neural Networks
2. Genetic Algorithms
3. Decision Trees
4. Nearest Neighbourhood
5. Rule Induction
6. Data Visualization
Decision trees classifiers
The decision tree classifier is widely used technique used for
classification. As the name suggests decision tree classifiers use a tree, each
leaf node has an associated class and each internal node has a predictive
associated with it.
To classify a new instance, we start at the root, and traverse the tree to reach
a leaf; at an internal node we evaluate the predicate on the data instance, to
find which child to go to. The process continues till we reach a leaf node.
Decision tree construction algorithm
The main idea of decision tree construction tree is to evaluate different
attributes and different partitioning conditions, and pick the attributes and
partitioning condition that results in the maximum information gain ratio. The
same procedure works recursively on each of the sets resulting from the split,
thereby recursively constructing a decision tree. If the data can be perfectly
classified, the recursion stops when the purity of asset is zero. However, often
data are noisy or a set may be so small that partitioning if further may not be
justified statistically. In this case, the recursion stops when the purity of a set
is sufficiently high and the class of resulting leaf is defined as a class of
majority of the elements of the set. In general different branches of the tree
could grow to different levels.