Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Title : Author : College : Data Mining V.Ramesh Syed ammal Engineering college. Introduction It is a part of iterative process that involves data source selection, preprocessing, transformation, and data mining and finally interpretation of results. Graphics tools are used to illustrate data relationships. Data mining is related to the sub areas of statistics called exploratory data analysis, which are similar goals and relies on statistical measures. It is also closely related to the subareas of AI called knowledge discovery and machine learning. The important distinguishing characteristics of data mining is the volume of data is very large, although ideas from these related areas of study are applicable to data mining problems, scalability wrt data size is an important new criterion. An algorithm is scalable if the running time grows in proportion to the dataset size, give the available system resources. Old algorithm must be adapted or new algorithms must be developed to ensure scalability. Itemset An itemset is a collection of items purchased by a customer in a single customer transaction. Given a database of transacations, we call an itemset frequent if it is contained in a user specific percentage of all transactions. Prior property A priori property is that every subset of a frequent itemset is also frequent. We can identify frequent itemsets efficiently through a bottom up algorithm that first generates all frequent item sets of size one, then size two and so on. Iceberg queries Iceberg queries are SELECT FROM GROUP, BY HAVING queries with a condition involving aggregation in the HAVING clause. Iceberg queries are amenable to the same bottom up strategy that is used for computing frequent itemsets. Split Selection method A split selection selects the splitting criterion at each node of the tree. A relatively compact data structure the AVC set contains the sufficient information to let split selection methods decide on the splitting criterion. A sequential pattern is a sequence of item sets purchased by the same customer. Elements of data mining Data mining consists of five major elements: 1. Extract, transform and load transaction data onto the data ware house system. 2. Store and manage the data in a multidimensional data base system. 3. Provide data access to business analysts and information technology professionals. 4. Analyse the data by application software. 5. Present the data in a useful format, such as a graph or table. Levels of data mining: 1. Artificial Neural Networks 2. Genetic Algorithms 3. Decision Trees 4. Nearest Neighbourhood 5. Rule Induction 6. Data Visualization Decision trees classifiers The decision tree classifier is widely used technique used for classification. As the name suggests decision tree classifiers use a tree, each leaf node has an associated class and each internal node has a predictive associated with it. To classify a new instance, we start at the root, and traverse the tree to reach a leaf; at an internal node we evaluate the predicate on the data instance, to find which child to go to. The process continues till we reach a leaf node. Decision tree construction algorithm The main idea of decision tree construction tree is to evaluate different attributes and different partitioning conditions, and pick the attributes and partitioning condition that results in the maximum information gain ratio. The same procedure works recursively on each of the sets resulting from the split, thereby recursively constructing a decision tree. If the data can be perfectly classified, the recursion stops when the purity of asset is zero. However, often data are noisy or a set may be so small that partitioning if further may not be justified statistically. In this case, the recursion stops when the purity of a set is sufficiently high and the class of resulting leaf is defined as a class of majority of the elements of the set. In general different branches of the tree could grow to different levels.