Download Chapter_1_Introduction to Data Mining

Date: 26th February 2016 Special thanks: Han & Kamber  Introduction  Classification of Data Mining System  Data Mining Architecture  Data Mining Functionalities  Major Issues in Data Mining  Importance of Data Mining  Application of Data Mining  Social Impacts of Data Mining Data???  Information??? Database??? DBMS??? Data Structured :DBMS Dhaval Gohel 40 50 60 Rishabh Chauhan 60 70 80 Mayur Padiya 70 60 80 Ankit Prajapati 30 40 50 Viral Prajapati 80 90 70 Unstructured:text Dhaval Gohel,40,50,60 Rishabh Chauhan 60,70,80 Semi –structured:XML <Name>Dhaval Gohel</Name> <CA>40</CA> <IP>50</IP> <CS>60</CS> Information  Dhaval Gohel have 50% in current Sem.  Viral Prajapati have highest marks in Reaserch Skill.  Ankit Prajapti have lowest marks in CA. Data base 120160107001 Dhaval Gohel Dhaval Gohel 120160107002 Rishabh Chauhan Rishabh Chauhan Modasa 120160107004 Mayur Padiya Mayur Padiya Nadiyad 120160107007 Ankit Prajapati Ankit Prajapati Dehgam 120160107008 Viral Prajapati Viral Prajapati Naroda Dhaval Gohel 40 50 60 Rishabh Chauhan 60 70 80 Mayur Padiya 70 60 80 Ankit Prajapati 30 40 50 Viral Prajapati 80 90 70 Dakor DBMS  Data: row facts  Information: processed data  Database: collection of organized related data  DBMS: set of software and tools used manipulate the database  Data Mining: “ Data Mining is the process of discovering interesting knowledge from large amount of data stored in databases, data warehouses, or other information repositories.“  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Knowledge discovery (mining) in databases (KDD)  knowledge extraction  data/pattern analysis  data archeology  data dredging  information harvesting  business intelligence, etc.   Database: - Find all employee having salary >=50,000 - Find all the student who have attendance 0% last month - Find all the Student who have Apple Laptop Data Mining: - Find all employee who is contractual (Classification) - Find all the student who have attending lectures (Clustering) - Find all the Student who have Apple Laptop and Apple Phone (Association Rule)  Database technology  Information Science  Statistics  Machine Learning  Visualization  Other disciplines Information Science Machine Learning Database Technology Statistics Visualization Algorithms Data Mining  Classification is based on  Kind of database Mined: • Data model like relational, transactional, object- relational, or data warehouse. • Special types of data handled like spatial, time series, text, stream data, multimedia data mining system, or a World Wide Web mining system.  Kind of knowledge Mined • Data Mining functionalities like Characterization and Discrimination, Mining Frequent Patterns, Classification and Prediction, Cluster Analysis, Outlier Analysis, Evolution Analysis • Data regularities vs data irregularities  Kinds of techniques utilized • Degree of user iteration involved e.g., autonomous systems, interactive exploratory systems, query-driven system • Method of data analysis employed e.g., database-oriented or data warehouse oriented techniques, machine learning, statistics, visualization, pattern recongnization, neural networks, and so on.  Application adapted • Finance, telecommunication, DNA, stock markets, e-mail and so on. Pattern Evaluation Data Mining Pattern Task-relevant Data Data transformations Preprocessed Data Data Cleaning Data Integration Databases Selection and Transformation    Cleaning: remove noise and inconsistent data Integration: where multiple data sources may be combine Selection: Data relevant to the analysis task are retrieved from the database  Transformation: Data are transformed into appropriate form for mining. Summary or aggregation operations  Data Mining: Various techniques like Association rule mining, Classification, Clustering are apply to Identify and count patterns  Pattern Evaluation: Identify truly interesting patterns representing knowledge base on some interestingness measure. • For example Support and Count for Association Rule Mining  Knowledge Presentation: Visualization and knowledge representation techniques are used to present the mined knowledge to the user        Cleaning: remove error logs Integration: multiple logs may be combine Selection: Data having valid Status and Media type is selected Transformation: Transfer data to day wise, week wise Data Mining: Identify Pattern and count frequent access Pattern Evaluation: Display frequently access sequences Knowledge Presentation: url page wise user count graph, IP address wise number of page visited count graph  Components 1. Databases, Data warehouse, World Wide Web or other Information repository 2. Database or Data warehouse server 3. Knowledge base 4. Data mining engine 5. Pattern Evaluation Module 6. User Interface  Data Mining functionalities are used to specify the kind of patterns to be found in data mining tasks.  Task: Descriptive and Predictive  Descriptive: General Properties of data and database  Predictive: Perform inference (Conclusion) on the current data 1. 2. 3. 4. 5. 6. Characterization and Discrimination Mining Frequent Patterns Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis  Data Characterization is a summarization of the general characteristics or features of a target class of data.  For example: to analyze the improvements of the students who study in 2nd Semester ME in GECM and whose marks increased 5% in the current semester.  Display forms: pie charts, bar charts, multidimensional data cubes etc..    Data Discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. For example: faculties may like to compare the results of students who study in 2nd Semester ME in GECM and whose marks increased 5% and decreased 5% in the current semester . Display forms: pie charts, multidimensional data cubes etc.. bar charts,  Frequent patterns are patterns that occur frequently in data set.  Forms: Frequent itemsets, subsequences, and substructures.  Frequent itemsets: ex. milk and bread.  Subsequence: ex. PC followed by Soft.  Substructure: sub graph, tress, or lattices  Association Rule Mining is method use to find the interesting frequent pattern from large set of data items.  computer    antivirus [support=2%, Confidence=60%] Support means that 2% of all the transactions in which computer and antivirus purchased together. Confidence 60% means 60% of customers who purchased a computer also purchased antivirus together     Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts. The model is derived based on the analysis of a set of training data and is used to predict the class label of objects for which the class label is unknown. Classification is a two phase process 1) Lerning: Training data are analyzed by classification algorithm. 2) Classification: Classify data into the class lable. Prediction values continuous valued functions, i.e. it is used to predict missing or unavailable numeric data values rather than class labels. Regression analysis is a statistical method used numeric prediction. Dhaval Gohel 40 50 60 Pass Rishabh Chauhan 60 70 80 Pass Mayur Padiya 70 30 80 Fail Ankit Prajapati 30 40 50 70 80 Rishabh Chauhan Prediction Pass Classification Clustering analyzes data objects without consulting class labels. Clustering can be used to generate class labels for a group of data which did not exist at the beginning. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. Outliers are data objects that do not comply with the general behavior or model of data. The analysis of outlier data is referred to as outlier mining. Many data mining techniques discard outliers or exceptions as noise. However, in some events these kind of events are more interesting. This analysis of outlier data is referred to as outlier analysis ex: fraud detection. Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. This may include characterization, discrimination, association and correlation analysis, classification, prediction or clustering of time related data. Distinct features of such data include time series data analysis, sequence or periodicity pattern matching and similarity based data analysis.  Data collected in large data repositories become “data tombs”.  Data Mining tools perform data analysis and my uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research.  Data Mining tools turns data tombs into “Golden nuggets” of knowledge.  Market analysis  Fraud detection  Customer retention  Production control  Science exploration 1. Mining different kinds of data 2. Handling multiple levels of abstraction 3. Incorporation of background knowledge 4. Visualization of mining results 5. Handling of incomplete or noisy data 6. Scalability of algorithms  Privacy  Profiling  Unauthorized use

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter_1_Introduction to Data Mining