Download Data Mining Chapter 2 Data Mining

Chapter 2: Data Mining Dr. Goutam Sarker, Fellow: IE(I), Fellow: IETE(I), Senior Member: IEEE(USA), Associate Professor, CSE, NITD 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 1 What is Data Mining ?  The term “data mining” refers to the finding of relevant and useful information from databases. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 2 Definition 1 1. Data mining or knowledge discovery in databases, is the non trivial extraction of implicit, previously unknown and potentially useful information from the data. This encompasses a number of technical approaches, such as clustering, data summarization, classification, pattern recognition, etc. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 3 Definition 2 Data mining is the search for the relationships and global patterns that exist in large databases but are hidden among vast amounts of data. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 4 Definition 3 Data mining is the process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 5 KDD vs. Data Mining   Knowledge Discovery in Database (KDD): was formalized in 1989, with reference to the general concept of being broad and high level in the pursuit of seeking knowledge from data. Data mining: is the only one of the many steps involved in knowledge discovery in databases. The various steps in the knowledge discovery process include data selection, data cleaning and preprocessing, data transformation and reduction, data mining algorithm selection and finally the post processing and the interpretation of the discovered knowledge. The KDD process tends to be highly iterative and interactive. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 6 Stages of KDD 1. 2. 3. 4. 5. 6. Selection. Preprocessing. Transformation. Data Mining. Interpretation and Evaluation. Data Visualization. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 7 Stages of KDD 1. 2. 3. 4. 5. 6. contd. Selection: This stage is concerned with selecting or segmenting the data that are relevant to some criteria. Preprocessing: Preprocessing is the data cleaning stage where unnecessary information is removed. Transformation: The data is not merely transferred across, but transformed in order to be suitable for the task of data mining. In this stage, the data is made usable and navigable. Data Mining: This stage is concerned with the extraction of patterns from the data. Interpretation and Evaluation: The pattern obtained in the data mining stage are converted into knowledge, which in turn is used to support decision making. Data Visualization: Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 8 DBMS vs. DM     We know that DBMS supports query languages which are useful for query triggered data exploration, whereas data mining supports automatic data exploration. If we know exactly what information we are seeking, a DBMS query would suffice; whereas if we vaguely know the possible correlations or patterns, then data mining techniques are useful. One of the tasks of data mining is hypothesis testing, wherein we formulate a hypothesis and test it by sifting through the database. The data mining application goes where the naturally reside. This avoids performance degradation and takes full advantage of database technology. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 9 Related Areas:  Statistics  Machine Learning 1. Supervised Learning. 2. Unsupervised Learning. Artificial Intelligence (AI) vs. Data Mining The tasks of automatically discovering patterns in the data has so far been mostly the domains of Artificial Intelligence. There are mainly 2 aspects to differentiate DM from AI. These are: 1. 2. Data Mining emphasizes the human understandability of discovered patterns; whereas in AI, the discovered patterns are meant to be used by the machine itself. Data Mining techniques are meant to be scalable to huge store of data such as the world wide web (www). In contrast, the traditional AI approaches have mostly been researched using small “toy” data sets that fit in the main memory. Data Mining has borrowed a good deal from AI, especially from the field of machine learning in which a program dynamically improves itself. Almost all classification techniques of machine learning have been used in data mining. Only those classification models that are not easily understandable by human users (e.g. neural network techniques have been omitted. Goals and DM Techniques  Two fundamental goals of data mining 1. Prediction Description 2. Prediction makes use of existing variables in the database in order to predict unknown or future values of interest. Description focuses on finding patterns describing the data and subsequent presentation for user interpretation. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 14 Classification of Techniques 1. 2. User guided or verification driven data mining Discovery driven or automatic discovery of rules 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 15 Data Mining Techniques   Verification Model: In this process of data mining, the user makes a hypothesis and tests the hypothesis on the data to verify its validity. The emphasis is on the user who is responsible for formulating the hypothesis. Discovery Model: The discovery model differs in its emphasis. It is the system automatically discovering important information hidden in the data. The data is sifted in search of frequently occurring patterns, trends and generalizations about the data without guidance from the user. 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 16 Discovery Driven Tasks 1. 2. 3. 4. 5. Discovery of association rules Discovery of classification rules Clustering Discovery of frequent episodes Deviation detection 4/30/2017 11:29 AM Data Mining / CSE Department/ Dr. Goutam Sarker 17 Discovery of Association Rules An association rule has the form X ⇒ Y, where X and Y are the sets of items.  The intuitive meaning of such a rule is that the transaction of database which contains X tends to contain Y  Given a database, the goal is to discover all the rules that have the support and confidence greater than or equal to the minimum support and confidence. 4/30/2017 11:29 AM Data Mining / CSE Department/ 18  Dr. Goutam Sarker Classification  19 * Classification involves finding rules that partition the data into disjoint groups. The input for the classification is the training data set, whose class labels are already known. 4/30/2017 11:29 AM Clustering    1. 2. 3. 20 *Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns Clustering constitutes a major class of data mining algorithms The objectives of clustering are: To uncover natural grouping To initiate hypothesis about the data To find out consistent and valid organization of the data 4/30/2017 11:29 AM Discovery of Classification Rules Classification involves finding rules that partition the data into disjoint groups. The input to the classification is the training data set whose class labels are already known. This can be termed as supervised learning also. There are several classification discovery models: 1. Decision Trees. 2. Neural Networks. 3. Genetic Algorithms. Frequent Episodes Frequent episodes are the sequence of events that occur frequently, close to each other and are extracted from the time sequence 23 4/30/2017 11:29 AM R is a set of event types A is a particular type of event Therefore A ϵ R An event is defined as a pair (A, t) , where as above AϵR A sequence of events (also called event sequence ) S of R is a triple (TS, TC, S) Where TS = starting time TC = ending time S= {(A1,t1), (A2,t2), … … … (An, tn) } is the ordered sequence of events, such that Ai ϵ R and Ts <= ti <= Tc for all i = 1,2, … … … n-1 3 types of episodes a) Serial episodes: Which occur in sequence. b) Parallel episodes: No constraints on the occurrence of event types. c) Non serial non parallel: If the occurrences of A and B preceed an occurrence of C, and there is no constraint on the occurrences of A and B Deviation Detection  28 Deviation detection is to identify outlying points in a particular data set, and explain whether they are due to noise or other impurities being present in the data or due to trivial reasons 4/30/2017 11:29 AM Mining Problems 1. 2. 3. 4. 29 Neural Networks Genetic Algorithms Rough Set Techniques Support Vector Machines 4/30/2017 11:29 AM Other Mining Problems:    30 Sequence Mining: is concerned with mining sequence data. Web Mining: World Wide Web is a fertile area for data mining research having the huge amount of information available online. Text Mining: Text documents are structured by means of information extraction, text categorization etc 4/30/2017 11:29 AM  1. 2. 3. 4. Spatial Data Mining: Spatial Data mining is the branch of data mining that deals with spatial (location) data. Geographically referenced data Digital mapping Remote Sensing DM Applications: case studies 1. 2. 3. 4. 32 Housing Loan Prepayment Prediction Crime Detection Customer Retention Brand Loyalty 4/30/2017 11:29 AM 5. Banking  Detection of patterns of fraudulent credit card use.  Identifying ‘loyal’ customers.  Determining ‘credit card spending’ by customer group 6. Astronomy: Detection of unusual stars or galaxies or nebulas or super galaxies may lead to the discovery of previously unknown phenomena and terrestrial body.  35 End of Chapter 2 4/30/2017 11:29 AM

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Chapter 2 Data Mining