Download UNIT-4 Data Mining Basics

MCA 204, Data Warehousing & Data Mining UNIT-4 Data Mining Basics © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.1 Learning Objective • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.2 Evolution of Sciences: New Data Science Era Before 1600: Empirical science 1600-1950s: Theoretical science  Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s: Computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now: Data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes  Data mining is a major new challenge! © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.3 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.1 MCA 204, Data Warehousing & Data Mining What is Data Mining? Data mining refers to : extracting or “mining” knowledge from large amounts of data. It is also known as Knowledge Discovery from Data. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.4 What is Data Mining? Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer? Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems 5 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.5 Knowledge Discovery (KDD) Process • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant T k l t Data D t Data Warehouse Selection Data Cleaning Data Integration Databases © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.6 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.2 MCA 204, Data Warehousing & Data Mining Example: A Web Mining Framework Web mining usually involves  Data cleaning  Data integration from multiple sources  Warehousing the data  Data cube construction  Data selection for data mining  Data mining  Presentation of the mining results  Patterns and knowledge to be used or stored into knowledge-base U3.7 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Visualization Techniques Business Analyst y Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.8 Example: Mining vs. Data Exploration • Business intelligence view  Warehouse, data cube, reporting but not much mining • Business objects vs. data mining tools • Supply chain example: tools • Data presentation • Exploration © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.9 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.3 MCA 204, Data Warehousing & Data Mining KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.10 Example: Medical Data Mining • Health care & medical data mining – often adopted such a view in statistics and machine learning. • Preprocessing of the data (including feature extraction and dimension reduction) • Classification or/and clustering processes • Post-processing for presentation © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.11 Multi-Dimensional View of Data Mining Data to be mined  Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions)  Characterization, discrimination, association, classification, clustering trend/deviation clustering, trend/deviation, outlier analysis analysis, etc etc.  Descriptive vs. predictive data mining  Multiple/integrated functions and mining at multiple levels Techniques utilized  Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted  Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.12 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.4 MCA 204, Data Warehousing & Data Mining Data Mining: On What Kinds of Data? Database-oriented data sets and applications  Relational database, data warehouse, transactional database Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. biosequences))  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.13 Data Mining Function: (1) Generalization Information integration and data warehouse construction  Data cleaning, transformation, integration, and multidimensional data model Data cube technology p g ((i.e., materializing) g)  Scalable methods for computing multidimensional aggregates  OLAP (online analytical processing) Multidimensional concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g. dry vs. wet region © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.14 Data Mining Function: (2) Association and Correlation Analysis Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart? Association, correlation vs. causality  A typical association rule  Milk Bread [0.5%, 75%] (support, confidence)  Are strongly associated items also strongly correlated? How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, and other applications? © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.15 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.5 MCA 204, Data Warehousing & Data Mining Data Mining Function: (3) Classification Classification and label prediction  Construct models (functions) based on some training examples  Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage)  Predict some unknown class labels Typical methods  Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, patternbased classification, logistic regression, … Typical applications:  Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.16 Data Mining Function: (4) Cluster Analysis • Unsupervised learning (i.e., Class label is unknown) • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns • Principle: Maximizing intra-class similarity & minimizing interclass similarity • Many methods and applications © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.17 Data Mining Function: (5) Outlier Analysis Outlier analysis  Outlier: A data object that does not comply with the general behavior of the data  Noise or exception? ― One person’s garbage could be another th person’s ’ ttreasure  Methods: by product of clustering or regression analysis, …  Useful in fraud detection, rare events analysis © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.18 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.6 MCA 204, Data Warehousing & Data Mining Relationships Let us take the case study of a fast food restaurant. The combo meals that are available are designed after applying data mining to the sales trends’ data over some months or years. • Data mining discovers relationships of this type. The relationships may be between two or more different objects along with the time dimension or between the attributes of the same object. • Discovery of knowledge is a key result of data mining. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.19 Case Study The Fast Food industry is highly competitive, one where a very small change in operations can have a significant impact on the bottom line. For this reason, quick access to comprehensive information for both standard and on demand reporting is essential. Implement the various data mining techniques to address dd thi requirement this i t for f ABC Corporation, C ti a fast f t food f d franchisee operating approximately 80 outlets at different places. The results should provide strategic and tactical decision support to all levels of management within the Corporation. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.20 Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Sequence, trend and evolution analysis  Trend, time-series, and deviation analysis: e.g., regression and value prediction  Sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards  Periodicity analysis  Motifs and biological sequence analysis Approximate and consecutive motifs  Similarity-based analysis Mining data streams  Ordered, time-varying, potentially infinite, data streams 21 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.21 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.7 MCA 204, Data Warehousing & Data Mining Patterns • Data mining tools mine the usage pattern of the customers which helps the restaurant owner to launch different special offers at different places at different times. • This potential usage pattern also deduces results which help in designing a marketing campaign. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.22 Structure and Network Analysis Graph mining  Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis  Social networks: actors (objects, nodes) and relationships (edges)  e.g., author networks in CS, terrorist networks  Multiple heterogeneous networks  A person could be multiple information networks: friends, family, classmates, …  Links carry a lot of semantic information: Link mining Web mining  Web is a big information network: from PageRank to Google  Analysis of Web information networks  Web community discovery, opinion mining, usage mining, … 23 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.23 Evaluation of Knowledge Are all mined knowledge interesting?  One can mine tremendous amount of “patterns” and knowledge  Some may fit only certain dimension space (time, location, …)  Some may not be representative, may be transient, … Evaluation of mined knowledge → directly mine only interesting knowledge?  Descriptive vs. predictive  Coverage  Typicality vs. novelty  Accuracy  Timeliness  … 24 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.24 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.8 MCA 204, Data Warehousing & Data Mining Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Pattern Recognition Statistics Visualization Data Mining Database Technology Algorithm High-Performance Computing 25 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.25 Why Confluence of Multiple Disciplines? Tremendous amount of data  Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data  Micro-array may have tens of thousands of dimensions High complexity of data  Data streams and sensor data  Time-series data, temporal data, sequence data  Structure data, graphs, social networks and multi-linked data  Heterogeneous databases and legacy databases  Spatial, spatiotemporal, multimedia, text and Web data  Software programs, scientific simulations New and sophisticated applications 26 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.26 Steps in Data Mining Evaluation and Application of Results Application of Suitable Data Mi D Mining i Techniques T h i Selection and Preparation of Data Determination of Business Objectives 20% 15% 45% 20% © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.27 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.9 MCA 204, Data Warehousing & Data Mining Steps in Data Mining (Cont...) We do not try to predict the knowledge we are going to discover but define the business objectives of the engagement. Step 1: Define Business Objectives • State why do you need a data mining solution. p and express p how the final results • Define yyour expectations will be used in the operational system. Step 2: Prepare Data • Consists of data selection, pre-processing of data and data transformation. • Use the business objectives to determine what data has to be selected. The variables selected are called active variables. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.28 Steps in Data Mining (Cont...) Pre-processing is meant to improve the quality of selected data. It involves enriching the selected data with external data, removal of noisy data and missing values. Step 3: Perform Data Mining g discoveryy engine g applies pp the selected • The knowledge algorithm to the prepared data. • The output from this step is a relationship or pattern. Step 4: Evaluate Results • In this step, all the resulting patterns are examined. • A filtering mechanism is applied and only the promising patterns are selected. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.29 Cont... Step 5: Present Discoveries • This may be in the form of visual navigation, charts, graphs, or free-form texts. • It also includes storing of interesting discoveries in the knowledge base for repeated use. Step 6: Incorporate Usage of Discoveries • This step is for using the results to create actionable items in the business. • The results are assembled in the best way so that they can be exploited to improve the business. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.30 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.10 MCA 204, Data Warehousing & Data Mining OLAP versus Data Mining Features OLAP DATA MINING Motivation for Information Request What is happening in the enterprise? Predict the future based on why this is happening. Data granularity Summary data. Detailed transaction-level data. Number of business dimensions Limited number of dimensions. Large number of dimensions. Number of dimension attributes Small number of attributes. Many dimension attributes. Sizes of datasets for the dimensions Not large for each dimension. Usually very large for each dimension. Analysis approach User-driven interactive analysis. Data-driven automatic knowledge discovery. Analysis techniques Multidimensional, drill-drown, and slice-and-dice. Prepare data, launch mining tool and sit back. Mature and widely used. Still emerging; some parts of the technology more mature. State of the technology © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.31 Data Mining in the Data Warehouse Environment OLAP System DATA STAGING AREA Source Operational Systems Flat files with extracted and t transformed f d data Load image files ready for lloading di th the d data t warehouse OPTIONS FOR DATA EXTRACTION Data Mining Enterprise p Data Warehouse Data selected, extracted, transformed, and prepared for mining © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.32 Functions and Application Areas Application Areas Examples of Mining Functions Mining Processes Mining Techniques Fraud Detection Credit card frauds Internal audits Warehouse pilferage Determination of variations from norms Data Visualization Memory-based Reasoning Risk Assessment Credit card upgrades Mortgage loans Customer Retention Credit Ratings Detection and analysis of links Decision Trees Memory-based Reasoning Market Analysis Market basket analysis Target marketing Cross selling Customer relationship Marketing Productive Modelling Database segmentation Cluster Detection Decision Trees Link Analysis Genetic Algorithms © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.33 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.11 MCA 204, Data Warehousing & Data Mining Applications of Data Mining • Web page analysis: from web page classification, clustering to PageRank & HITS algorithms • Collaborative analysis & recommender systems • Basket data analysis to targeted marketing • Biological and medical data analysis: classification, cluster analysis ((microarray y data analysis), y ), biological g sequence q analysis, y , biological g network analysis • Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) • From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.34 Major Issues in Data Mining (1) Mining Methodology  Mining various and new kinds of knowledge  Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment i t  Handling noise, uncertainty, and incompleteness of data  Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.35 Major Issues in Data Mining (2) Efficiency and Scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed, stream, and incremental mining methods Diversity of data types  Handling complex types of data  Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.36 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.12 MCA 204, Data Warehousing & Data Mining Summary • Data mining: Discovering interesting patterns and knowledge from massive amount of data • A natural evolution of science and information technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of data • Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc. • Data mining technologies and applications • Major issues in data mining © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.37 Cont… • Motivation- need to extract useful information and knowledge from a large amount of data (data explosion problem) • Data Mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.38 What is Data Mining? • Data mining refers to extracting or “mining” knowledge from large amounts of data. Also referred as Knowledge Discovery in Databases. • It is a process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.39 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.13 MCA 204, Data Warehousing & Data Mining Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Knowledge base D mining Data i i engine i Database or data warehouse server Data cleansing Data Integration Filtering Database Data warehouse © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.40 Cont… • Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention. • If there was no user intervention then the system would uncover a large set of patterns that may even surpass the size of the database. database Hence, Hence user interference is required. • This user communication with the system is provided by using a set of data mining primitives. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.41 Data Mining Primitives Data mining primitives define a data mining task, which can be specified in the form of a data mining query. • Task Relevant Data • Kinds of knowledge to be mined • Background knowledge • Interestingness measure • Presentation and visualization of discovered patterns © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.42 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.14 MCA 204, Data Warehousing & Data Mining Task Relevant Data • Data portion to be investigated. • Attributes of interest (relevant attributes) can be specified. • Initial data relation • Minable view © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.43 Example • If a data mining task is to study associations between items frequently purchased at AllElectronics by customers in Canada, the task relevant data can be specified by providing the following information: • Name of the database or data warehouse to be used (e.g., AllElectronics_db) • Names N off the th tables t bl or data d t cubes b containing t i i relevant l t data (e.g., item, customer, purchases and items_sold) • Conditions for selecting the relevant data (e.g., retrieve data pertaining to purchases made in Canada for the current year) • The relevant attributes or dimensions (e.g., name and price from the item table and income and age from the customer table) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.44 Kind of Knowledge to be Mined • It is important to specify the knowledge to be mined, as this determines the data mining function to be performed. • Kinds of knowledge include concept description, association classification, association, classification prediction and clustering. clustering • User can also provide pattern templates. metapatterns or metarules or metaqueries. Also called © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.45 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.15 MCA 204, Data Warehousing & Data Mining Example A user studying the buying habits of allelectronics customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) => buys (X,Z) Meta rules such as the following can be specified: age (X (X, “30 30…..39 39”)) ^ income (X, (X “40k 40k….49K 49K”)) => > buys (X (X, “VCR”) VCR ) [2.2%, 60%] occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”) [1.4%, 70%] © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.46 Background Knowledge • It is the information about the domain to be mined • Concept hierarchy: is a powerful form of background knowledge. • Four major types of concept hierarchies:     schema hierarchies set-grouping hierarchies operation-derived hierarchies rule-based hierarchies © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.47 Concept Hierarchies (1) • Defines a sequence of mappings from a set of low-level concepts to higher-level (more general) concepts. • Allows data to be mined at multiple levels of abstraction. • These allow perspectives, relationships. users to view data from different allowing further insight into the • Example (location) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.48 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.16 MCA 204, Data Warehousing & Data Mining Example Level 0 All Level 1 USA Canada British Columbia Vancouver Ontario Victoria Toronto Ottawa New York New York Illinois Buffalo Chicago Level 2 Level 3 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.49 Concept Hierarchies (2) • Rolling Up - Generalization of data  Allows to view data at more meaningful and explicit abstractions.  Makes it easier to understand  Compresses the data  Would require fewer input/output operations • Drilling D illi Down D - Specialization S i li ti off data d t  Concept values replaced by lower level concepts • There may be more than concept hierarchy for a given attribute or dimension based on different user viewpoints • Example:  Regional sales manager may prefer the previous concept hierarchy but marketing manager might prefer to see location with respect to linguistic lines in order to facilitate the distribution of commercial ads. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.50 Schema Hierarchies • Schema hierarchy is the total or partial order among attributes in the database schema. • May formally express existing semantic relationships between attributes. • Provides metadata information. • Example: location hierarchy street < city < province/state < country © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.51 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.17 MCA 204, Data Warehousing & Data Mining Set-grouping Hierarchies • Organizes values for a given attribute into groups or sets or range of values. • Total or partial order can be defined among groups. • Used to refine or enrich schema-defined hierarchies. • Typically used for small sets of object relationships. • Example: Set-grouping hierarchy for age {young, middle_aged, senior} all (age) {20….29} young {40….59} middle_aged {60….89} senior © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.52 Operation-derived Hierarchies • Operation-derived based on operations specified operations may include decoding of information-encoded strings information extraction from complex data objects data clustering • Example: URL or email address [email protected] @ iit i gives i l i name < dept. login d t < univ. i < country t © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.53 Rule-based Hierarchies • Rule-based Occurs when either whole or portion of a concept hierarchy is defined as a set of rules and is evaluated dynamically based on current database data and rule definition • E Example: l Following F ll i rules l are used d to t categorize t i items it as low_profit, medium_profit and high_profit_margin. • low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50) • medium_profit_margin(X) P2)≥50)^((P1-P2)≤250) <= price(X,P1)^cost(X,P2)^((P1- • high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.54 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.18 MCA 204, Data Warehousing & Data Mining Interestingness Measure (1) • Used to confine the number of uninteresting patterns returned by the process. • Based on the structure of patterns and statistics underlying them. • Associate a threshold which can be controlled by the user. • patterns not meeting the threshold are not presented to the user. • Objective measures of pattern interestingness:  Simplicity  certainty (confidence)  utility (support)  novelty © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.55 Interestingness Measure (2) • Simplicity a patterns interestingness is based on its overall simplicity for human comprehension. Example: Rule length is a simplicity measure • C Certainty t i t (confidence) ( fid ) Assesses the validity or trustworthiness of a pattern. confidence is a certainty measure confidence (A=>B) = # tuples containing both A and B # tuples containing A A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”) means that 85% of all customers who purchased a computer also bought software © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.56 Interestingness Measure (3) • Utility (support) usefulness of a pattern support (A=>B) = # tuples containing both A and B total # of tuples A support of 30% for the previous rule means that 30% of all customers in the computer department purchased both a computer t and d software. ft • Association rules that satisfy both the minimum confidence and support threshold are referred to as strong association rules. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.57 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.19 MCA 204, Data Warehousing & Data Mining Interestingness Measure (3) • Novelty Patterns contributing new information to the given pattern set are called novel patterns (example: Data exception) removing redundant patterns is a strategy for detecting novelty. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.58 Presentation and Visualization • For data mining to be effective, data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables, crosstabs (crosstabulations), pie or bar charts, decision trees, cubes, or other visual representations. • User must be able to specify the forms of presentation to be used for displaying the discovered patterns. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.59 Architectures of Data Mining System • With popular and diverse application of data mining, it is expected that a good variety of data mining system will be designed and developed. • Comprehensive information processing and data analysis will be continuously and systematically surrounded by data warehouse and databases. • A critical question in design is whether to integrate data mining systems with database systems. • This gives rise to four architecture:     No coupling Loose Coupling Semi-tight Coupling Tight Coupling © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.60 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.20 MCA 204, Data Warehousing & Data Mining Cont. • No Coupling  DM system will not utilize any functionality of a DB or DW system • Loose Coupling  DM system will use some facilities of DB and DW system like storing the data in either of DB or DW systems and using these systems for data retrieval • Semi-tight Coupling  Besides linking a DM system to a DB/DW systems, efficient implementation of a few DM primitives. • Tight Coupling  DM system is smoothly integrated with DB/DW systems. Each of these DM, DB/DW is treated as main functional component of information retrieval system. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.61 Data Preprocessing • Data Preprocessing: An Overview • Data Quality • Major Tasks in Data Preprocessing • Data Cleaning • Data Integration • Data Reduction • Data Transformation and Data Discretization • Summary © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.62 62 Data Quality: Why Preprocess the Data? • Measures for data quality: A multidimensional view • Accuracy: correct or wrong, accurate or not • Completeness: not recorded, unavailable, … • Consistency: some modified but some not, dangling, … • Timeliness: Ti li ti l update? timely d t ? • Believability: how trustable the data are correct? • Interpretability: how easily the data can be understood? © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.63 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.21 MCA 204, Data Warehousing & Data Mining Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data reduction • Dimensionality reduction • Numerosity reduction • Data compression • Data transformation and data discretization • Normalization • Concept hierarchy generation © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.64 Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday? © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.65 Incomplete (Missing) Data • Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to  equipment malfunction  inconsistent i i t t with ith other th recorded d d data d t and d thus th deleted d l t d  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data • Missing data may need to be inferred © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.66 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.22 MCA 204, Data Warehousing & Data Mining How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably • Fill in the missing value manually: tedious + infeasible? • Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.67 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology t h l li it ti limitation  inconsistency in naming convention Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.68 68 How to Handle Noisy Data? Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression  smooth th by b fitting fitti the th data d t into i t regression i functions f ti Clustering  detect and remove outliers Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.69 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.23 MCA 204, Data Warehousing & Data Mining Data Cleaning as a Process Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections D Data auditing: di i b analyzing by l i d data to discover di rules l and d relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.70 Data Integration Data integration:  Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources Entity identification problem:  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.71 71 Handling Redundancy in Data Integration • Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant attributes may be able to be detected by correlation analysis and covariance analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.72 72 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.24 MCA 204, Data Warehousing & Data Mining Correlation Analysis (Nominal Data) Χ2 (chi-square) test 2   (Observed  Expected) 2 Expected • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count • Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.73 Chi-Square Calculation: An Example Play chess Not play chess Like science fiction 250(90) 200(360) Sum (row) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 • Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2  ( 250  90) 2 (50  210) 2 ( 200  360) 2 (1000  840) 2     507.93 90 210 360 840 • It shows that like_science_fiction and play_chess are correlated in the group © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.74 Correlation Analysis (Numeric Data) • Correlation coefficient (also called Pearson’s product moment coefficient) rA, B   n i 1 (ai  A)(bi  B ) (n  1) A B   n i 1 (ai bi )  n A B ( n  1) A B  where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. • If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. • rA,B = 0: independent; rAB < 0: negatively correlated (atttribute discourage each other) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.75 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.25 MCA 204, Data Warehousing & Data Mining Cont... • • • • Mean A A=∑A/n Standard deviation σA=sqrt(∑(A-A)2/(n-1) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.76 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.77 Correlation (viewed as linear relationship) • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, A and B, and then take their dot product a 'k  (ak  mean( A)) / std ( A) b'k  (bk  mean( B )) / std ( B) correlation( A, B )  A' B ' © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.78 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.26 MCA 204, Data Warehousing & Data Mining Covariance (Numeric Data) • Covariance is similar to correlation Correlation coefficient:  where n is the number of tuples, A and B are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.79 Covariance (Numeric Data) • Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. • Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its B A expected value. value • Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.80 Co-Variance: An Example It can be simplified in computation as Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5 10) (5, 10), (4 (4, 11) 11), (6 (6, 14) 14). Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.81 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.27 MCA 204, Data Warehousing & Data Mining Data Reduction Strategies • Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Whyy data reduction? • A database/data warehouse may store terabytes of data. • Complex data analysis may take a very long time to run on the complete data set. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.82 Data Reduction Strategies Data reduction strategies  Dimensionality reduction, e.g., remove unimportant attributes Wavelet transforms Principal Components Analysis (PCA) Feature subset selection, selection feature creation  Numerosity reduction (some simply call it: Data Reduction) Regression and Log-Linear Models Histograms, clustering, sampling Data cube aggregation  Data compression © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.83 Data Reduction 1: Dimensionality Reduction Curse of dimensionality  When dimensionality increases, data becomes increasingly sparse  Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful  The possible combinations of subspaces will grow exponentially Dimensionality reduction  Avoid A id the th curse off dimensionality di i lit  Help eliminate irrelevant features and reduce noise  Reduce time and space required in data mining  Allow easier visualization Dimensionality reduction techniques  Wavelet transforms  Principal Component Analysis  Supervised and nonlinear techniques (e.g., feature selection) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.84 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.28 MCA 204, Data Warehousing & Data Mining Mapping Data to a New Space   Fourier transform Wavelet transform Two Sine Waves Two Sine Waves + Noise Frequency U3.85 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania What Is Wavelet Transform? • Decomposes a signal into different frequency subbands  Applicable to ndimensional signals • Data are transformed to preserve relative distance between objects at different levels of resolution • Allow natural clusters to become more distinguishable • Used for image compression U3.86 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania Wavelet Transformation Haar2 • Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis Daubechie4 • Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients • Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space • Method:  Length, L, must be an integer power of 2 (padding with 0’s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired length © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.87 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.29 MCA 204, Data Warehousing & Data Mining Wavelet Decomposition • Wavelets: A math tool for space-efficient hierarchical decomposition of functions • S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, 11/4, 1/2, 0, 0, -1, -1, 0] • Compression: many small detail coefficients can be replaced by 0 0’ss, and only the significant coefficients are retained © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.88 Why Wavelet Transform? • Use hat-shape filters  Emphasize region where points cluster  Suppress weaker information in their boundaries • Effective removal of outliers  Insensitive to noise, insensitive to input order • Multi-resolution  Detect arbitrary shaped clusters at different scales • Efficient  Complexity O(N) • Only applicable to low dimensional data © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.89 Principal Component Analysis (PCA) • Find a projection that captures the largest amount of variation in data • The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space x2 e x1 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.90 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.30 MCA 204, Data Warehousing & Data Mining Principal Component Analysis (Steps) Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data  Normalize input data: Each attribute falls within the same range  Compute k orthonormal (unit) vectors, i.e., principal components  Each input data (vector) is a linear combination of the k principal component vectors  The principal components are sorted in order of decreasing “significance” or strength  Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data) Works for numeric data only © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.91 Attribute Subset Selection • Another way to reduce dimensionality of data • Redundant attributes  Duplicate much or all of the information contained in one or more other attributes  E.g., purchase price of a product and the amount of sales t paid tax id • Irrelevant attributes  Contain no information that is useful for the data mining task at hand  E.g., students' ID is often irrelevant to the task of predicting students' GPA © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.92 Heuristic Search in Attribute Selection • There are 2d possible attribute combinations of d attributes • Typical heuristic attribute selection methods:  Best single attribute under the attribute independence assumption: choose by significance tests  Best step-wise feature selection: The The best single-attribute single attribute is picked first Then next best attribute condition to the first, ...  Step-wise attribute elimination: Repeatedly eliminate the worst attribute  Best combined attribute selection and elimination  Optimal branch and bound: Use attribute elimination and backtracking © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.93 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.31 MCA 204, Data Warehousing & Data Mining Attribute Creation (Feature Generation) • Create new attributes (features) that can capture the important information in a data set more effectively than the original ones • Three general methodologies  Attribute extraction  Domain-specific p  Mapping data to new space (see: data reduction) E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)  Attribute construction Combining features (see: discriminative frequent patterns in Chapter 7) Data discretization © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.94 94 Data Reduction 2: Numerosity Reduction • Reduce data volume by choosing alternative, smaller forms of data representation • Parametric methods (e.g., regression)  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Ex.: Log-linear models—obtain value at a point in mD space as the product on appropriate marginal subspaces • Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling, … © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.95 Parametric Data Reduction: Regression and Log-Linear Models • Linear regression  Data modeled to fit a straight line  Often uses the least-square method to fit the line • Multiple regression  Allows a response variable Y to be modeled as a linear function of multidimensional feature vector • Log-linear model  Approximates discrete multidimensional probability distributions © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.96 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.32 MCA 204, Data Warehousing & Data Mining y Regression Analysis • Regression analysis: A collective name for techniques for the modeling and analysis of Y1 numerical data consisting of values of a dependent variable (also called response Y1’ y=x+1 variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors) x X1 • Used for prediction (including forecasting of give a "best fit" of the data time-series data), • Most commonly the best fit is evaluated by inference, hypothesis using the least squares method, but other testing, and modeling of criteria have also been used causal relationships • The parameters are estimated so as to © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.97 Regress Analysis and Log-Linear Models Linear regression: Y = w X + b  Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand  Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2  Many nonlinear functions can be transformed into the above Log-linear models:  Approximate discrete multidimensional probability distributions  Estimate the probability of each point (tuple) in a multi-dimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations  Useful for dimensionality reduction and data smoothing © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.98 Histogram Analysis 40 • Divide data into buckets and store average (sum) for each 35 30 bucket • Partitioning rules: 25  Equal-width: equal bucket 20 range 15  Equal-frequency (or equal- 10 depth) 5 0 10000 30000 50000 70000 90000 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.99 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.33 MCA 204, Data Warehousing & Data Mining Clustering • Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multidimensional index tree structures • There are many choices of clustering definitions and clustering algorithms U3.100 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania Sampling • Sampling: obtaining a small sample s to represent the whole data set N • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data principle: p Choose a representative p subset of the data • Keyy p  Simple random sampling may have performance in the presence of skew very poor  Develop adaptive sampling methods, e.g., stratified sampling: • Note: Sampling may not reduce database I/Os (page at a time) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.101 Types of Sampling • Simple random sampling  There is an equal probability of selecting any particular item • Sampling without replacement  Once an object is selected, it is removed from the population • Sampling S li with ith replacement l t  A selected object is not removed from the population • Stratified sampling:  Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)  Used in conjunction with skewed data © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.102 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.34 MCA 204, Data Warehousing & Data Mining Sampling: With or Without Replacement Raw Data © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.103 Sampling: Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.104 Data Cube Aggregation • The lowest level of a data cube (base cuboid)  The aggregated data for an individual entity of interest  E.g., a customer in a phone calling data warehouse • Multiple levels of aggregation in data cubes  Further u e reduce educe the e ssize eo of da data a to o dea deal with • Reference appropriate levels  Use the smallest representation which is enough to solve the task • Queries regarding aggregated information should be answered using data cube, when possible © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.105 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.35 MCA 204, Data Warehousing & Data Mining Data Reduction 3: Data Compression String compression  There are extensive theories and well-tuned algorithms  Typically lossless, but only limited manipulation is possible without expansion Audio/video compression yp y lossyy compression, p , with p progressive g refinement  Typically  Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio  Typically short and vary slowly with time Dimensionality and numerosity reduction may also be considered as forms of data compression © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.106 Data Compression Compressed Data Original Data lossless Original Data Approximated © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.107 Data Transformation • A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values • Methods  Smoothing: Remove noise from data  Attribute/feature construction  New attributes constructed from the given ones  Aggregation: Summarization, data cube construction  Normalization: Scaled to fall within a smaller, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Discretization: Concept hierarchy climbing © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.108 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.36 MCA 204, Data Warehousing & Data Mining Normalization Min-max normalization: to [new_minA, new_maxA] v'  v  minA (new _ maxA  new _ minA)  new _ minA maxA  minA  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716 98,000  12,000 Z-score normalization (μ: mean, mean σ: standard deviation): v' v  A  A  Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000 Normalization by decimal scaling v' v 10 j 16,000  1.225 Where j is the smallest integer such that Max(|ν’|) < 1 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.109 Discretization • Three types of attributes  Nominal—values from an unordered set, e.g., color, profession  Ordinal—values from an ordered set, e.g., military or academic rank  Numeric—real numbers, e.g., integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals  Interval labels can then be used to replace actual data values  Reduce data size by discretization  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute  Prepare for further analysis, e.g., classification © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.110 Data Discretization Methods Typical methods: All the methods can be applied recursively  Binning Top-down split, unsupervised  Histogram analysis Top-down split, unsupervised  Clustering analysis (unsupervised, top-down split or bottom-up merge)  Decision-tree analysis (supervised, top-down split)  Correlation (e.g., 2) analysis (unsupervised, bottom-up merge) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.111 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.37 MCA 204, Data Warehousing & Data Mining Simple Discretization: Binning • Equal-width (distance) partitioning  Divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well • Equal-depth (frequency) partitioning  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.112 Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.113 Discretization Without Using Class Labels (Binning vs. Clustering) Data Equal frequency (binning) Equal interval width (binning) K-means clustering leads to better results © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.114 114 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.38 MCA 204, Data Warehousing & Data Mining Discretization by Classification & Correlation Analysis • Classification (e.g., decision tree analysis)  Supervised: Given class labels, e.g., cancerous vs. benign  Using entropy to determine split point (discretization point)  Top-down, recursive split  Details to be covered in Chapter 7 • Correlation analysis (e.g., Chi-merge: χ2-based discretization)  Supervised: use class information  Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge  Merge performed recursively, until a predefined stopping condition © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.115 115 Concept Hierarchy Generation • Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse • Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity • Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) • Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers • Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.116 Concept Hierarchy Generation for Nominal Data • Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts  street < city < state < country • Specification of a hierarchy for a set of values by explicit data grouping  {Urbana, Champaign, Chicago} < Illinois • Specification of only a partial set of attributes  E.g., only street < city, not others • Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values  E.g., for a set of attributes: {street, city, state, country} © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.117 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.39 MCA 204, Data Warehousing & Data Mining Automatic Concept Hierarchy Generation • Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set  The attribute with the most distinct values is placed at the lowest level of the hierarchy  Exceptions, e.g., weekday, month, quarter, year 15 distinct values country province_or_ state 365 distinct values city 3567 distinct values 674,339 distinct values street U3.118 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania Measuring the Dispersion of Data Quartiles, outliers and boxplots  Quartiles: Q1 (25th percentile), Q3 (75th percentile)  Inter-quartile range: IQR = Q3 – Q1  Five number summary: min, Q1, median, Q3, max  Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually  Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ)  Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 2 s  xi  ( xi ) ] (xi  x)2  n 1[ n 1 i1 n i1 i 1 2  2  1 N n  (x i 1 i   )2  1 N n x i 1 2 i 2 Standard deviation s (or σ) is the square root of variance s2 (or σ2) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.119 Conclusions • Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability • Data cleaning: e.g. missing/noisy values, outliers • Data integration from multiple sources:  Entity identification problem  Remove redundancies  Detect D t t inconsistencies i i t i • Data reduction  Dimensionality reduction  Numerosity reduction  Data compression • Data transformation and data discretization  Normalization  Concept hierarchy generation 120 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.120 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.40 MCA 204, Data Warehousing & Data Mining Review Questions Objective Questions: 1)The types of information that can be garnered from datamining include: a) sequences, classifications, and clusters. b) model-driven and data-driven. c) associations and forecasts. d) a and c. e) a, b and c. 2) The term “associations” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e) events linked over time. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.121 Review Questions Cont... 3)DSS assist management by combining ________ into a single powerful system to support unstructured decision-making. a) hardware and the Internet b) data, analytical models and tools, and user-friendly software c) analytical models and tools and data from the Internet d) group decision processes and electronics e) data and people 4)DSS, GDSS, and ESS are part of a special category of information systems that are explicitly designed to: a) make decisions for managers. b) enhance Web performance. c) gather data and build data warehouses. d) enhance managerial decision-making. e) interpret data for management. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.122 Review Questions Cont... 5)The term “sequences” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e) events linked over time. ) earliest DSS tended to: 6)The a) rely on Internet data. b) draw on small subsets of corporate data. c) be heavily model-driven. d) b and c. e) a and c. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.123 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.41 MCA 204, Data Warehousing & Data Mining Review Questions Cont... 7)The term “classifications” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e) events linked over time. ) DSS: 8)Model-driven a) analyze large pools of data. b) are an outgrowth of data mining. c) use TPS and OLAP. d) begin with a given group of data and change variables. e) use events linked over time. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.124 Review Questions Cont... 9)The term “forecasting” is associated with: a) Occurrences linked to a single event. b) Classifications when no groups have been defined. c) Pattern recognition describing the group to which an item belongs. d) A series of existing values used to predict other values. e) Events linked over time. ) goal g of data mining g includes which of the following? g 10)A a) To explain some observed event or condition b) To confirm that data exists c) To analyze data for expected relationships d) To create a new data warehouse e) None of these © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.125 Review Questions Cont... Short answer type Questions 1. Define data mining in two or three sentences 2. How is data mining different from OLAP? 3. Is the data warehouse prerequisite for data mining? Does the data warehouse help data mining? If so, in what ways? 4. Name the three common problems of link analysis technique? 5. What is market basket analysis? Give two examples of this application in business. 6. Give three broad reasons why you think data mining is being used in today’s businesses. 7. What business problems can data mining help solve? 8. What is Predictive Analytics? 9. What is the difference between data mining, online analytical processing (OLAP) ? 10. State various benefits of Data mining. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.126 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.42 MCA 204, Data Warehousing & Data Mining Review Questions Cont... Long answer type Questions 1. Describe how decision trees work. Explain with the help of an example. 2. What do you mean by KDD? Explain all the steps of KDD in detail. 3. What are the basic principles of genetic algorithms? Use the example to describe how this technique works 4. Describe cluster detection technique? 5. Discuss Data mining Application in the field of Banking and finance. 6. Do neural networks and genetic algorithms have anything in common? Point out differences. 7. How does the memory-based reasoning technique work? What is the underlying principle? 8. Explain Neural Network in detail? 9. What are the golden rules for data mining? 10. Discuss Data mining Application in the field of Retail Industry. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.127 Suggested Reading/References 1. 2. 3. 4 4. Kamber and Han, “Data Mining Concepts and Techniques”, Hartcourt India P. Ltd.,2001 Paul Raj Poonia, “Fundamentals of Data Warehousing”, John Wiley & Sons, 2003. Sam Anahony, “Data Warehousing in the real world: A practical guide for building decision support systems”, John Wiley, 2004 W H. W. H Inmon, Inmon “Building Building the operational data store store”, 2nd Ed., Ed John Wiley, 1999. 5. E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 6. S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and 2011 Semi-Structured Data. Morgan Kaufmann, 2002 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.128 Suggested Reading/References 7. 8. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 9. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 10. U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 11. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 (3ed. 2011). © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.129 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.43 MCA 204, Data Warehousing & Data Mining References 12. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009. 13. B. Liu, Web Data Mining, Springer 2006. 14. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 15 P.-N. 15. P N Tan, T M Steinbach M. St i b h and d V. V Kumar, K I t d ti Introduction t Data to D t Mining, Wiley, 2005 16. S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 17. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.130 References 18. D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999 19. T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 20. T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser. SIGMOD’02 SIGMOD 02 21. H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997 22. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 23. E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering. Vol.23, No.4. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.131 References 18. V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001 19. T. Redman. Data Quality: Management and Technology. Bantam Books, 1992 20. R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623 7:623-640, 640, 1995 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.132 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.44 MCA 204, Data Warehousing & Data Mining Data Mining Techniques © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.133 Learning Objective • Data Mining Query Language • Major Data Mining Techniques and Benefits • Data Mining Applications © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.134 Data Mining Query Language • There are two powerful tools: • Database Management Systems • Efficient and effective data mining algorithms and frameworks • Generally, this work asks:  “How can we merge the two?”  “How can we integrate data mining more closely with traditional database systems, particularly querying?”  The answer lies in Data Mining Query Language (DMQL). © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.135 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.45 MCA 204, Data Warehousing & Data Mining Data Mining Query Languages • Data mining language must be designed to facilitate flexible and effective knowledge discovery. • Having a query language for data mining may help standardize the development of platforms for data mining systems. • But designed a language is challenging because data mining covers a wide spectrum of tasks and each task has different requirement. • Hence, the design of a language requires deep understanding of the limitations and underlying mechanism of the various kinds of tasks. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.136 Cont… • So…how would language??? you design an efficient query • Based on the primitives discussed earlier. • DMQL allows mining of different kinds of knowledge from relational databases and data warehouses at multiple levels of abstraction. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.137 Cont… • DMQL commands specify the following: • The set of data relevant to the data mining task (the training set) • The kinds of knowledge to be discovered • • • • • Generalized relation Characteristic rules Discriminant rules Classification rules Association rules © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.138 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.46 MCA 204, Data Warehousing & Data Mining DMQL • Adopts SQL-like syntax • Hence, can be easily integrated with relational query languages • Defined in BNF grammar • [ ] represents 0 or one occurrence • { } represents 0 or more occurrences • Words in sans serif represent keywords © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.139 DMQL Syntax • DMQL-Syntax for task-relevant data specification • Names of the relevant database or data warehouse, conditions and relevant attributes or dimensions must be specified • use database ‹database_name› or use data warehouse ‹data_warehouse_name› • from ‹relation(s)/cube(s)› [where condition] • in relevance to ‹attribute_or_dimension_list› • order by ‹order_list› • group by ‹grouping_list› • having ‹condition› © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.140 Example © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.141 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.47 MCA 204, Data Warehousing & Data Mining Syntax for Kind of Knowledge to be Mined Characterization ‹Mine_Knowledge_Specification› ::= mine characteristics [as ‹pattern_name›] analyze ‹measure(s)› Example: • mine characteristics as customerPurchasing analyze count% Discrimination ‹Mine Knowledge Specification› ::= ‹Mine_Knowledge_Specification› mine comparison [as ‹ pattern_name›] for ‹target_class› where ‹target_condition› {versus ‹contrast_class_i where ‹contrast_condition_i›} analyze ‹measure(s)› • Example: • Mine comparison as purchaseGroups  for bigspenders where avg(I.price) >= $100  versus budgetspenders where avg(I.price) < $100  analyze count © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.142 Syntax for Kind of Knowledge to be Mined (2) • Association: ‹Mine_Knowledge_Specification› ::= mine associations [as ‹pattern_name›] [matching ‹metapattern›] • Example: mine associations as buyingHabits  matching P(X: customer, W) ^ Q(X,Y) => buys (X,Z) • Classification: ‹Mine_Knowledge_Specification› ::= mine classification [as ‹pattern_name›] analyze ‹classifying_attribute_or_dimension› • Example: mine classification as classifyCustomerCreditRating  analyze credit_rating © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.143 Syntax for Concept Hierarchy Specification • More than one concept per attribute can be specified • Use hierarchy ‹hierarchy_name› for ‹attribute_or_dimension› • Examples: • Schema concept hierarchy (ordering is important) define hierarchy location_hierarchy on address as [street,city,province_or_state,country] Set-Grouping concept hierarchy define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.144 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.48 MCA 204, Data Warehousing & Data Mining Syntax for Concept Hierarchy Specification (2) operation-derived concept hierarchy  define hierarchy age_hierarchy for age on customer as  {age_category(1), ..., age_category(5)} := cluster (default, age, 5) < all(age) rule-based concept hierarchy  define hierarchy profit_margin_hierarchy profit margin hierarchy on item as  level_1: low_profit_margin < level_0: all  if (price - cost)< $50  level_1: medium-profit_margin < level_0: all  if ((price - cost) > $50) and ((price - cost) <= $250))  level_1: high_profit_margin < level_0: all  if (price - cost) > $250 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.145 Syntax For Interestingness Measure Specification • with [‹interest_measure_name›] threshold = ‹threshold_value› • Example: pp threshold = 5% with support with confidence threshold = 70% © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.146 Syntax for Pattern Presentation and Visualization Specification • display as ‹result_form› • The result form can be rules, tables, cubes, crosstabs, pie or bar charts, decision trees, curves or surfaces. • To facilitate interactive viewing at different concept levels or different angles, the following syntax is defined: ‹Multilevel_Manipulation› ::= roll up on ‹attribute_or_dimension› | drill down on ‹attribute_or_dimension› | add ‹attribute_or_dimension› | drop ‹attribute_or_dimension› © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.147 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.49 MCA 204, Data Warehousing & Data Mining Major Data Mining Techniques       Cluster Detection Decision Trees Memory-Based Reasoning Link Analysis Neural Networks Genetic Algorithms © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.148 Cluster Detection • Cluster means forming groups. • Clustering helps you take specific and proper action for the individual pieces that make up the cluster. • The algorithm searches for groups or clusters of data elements th t are similar that i il to t one another. th This Thi is i because b similar i il customers t or similar products are expected to behave in the same way. • It is not always easy to discern the meaning of every cluster the data mining algorithm formed. If there are two or three dimensions or variables, it is fairly easy to spot the clusters. But while dealing with 500 variables from 100,000 records, a special tool is needed. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.149 Cluster Detection Number of years as customer If there are two variables, then points in a 2-D graph represent the values of sets of these two variables. Total value to the enterprise © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.150 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.50 MCA 204, Data Warehousing & Data Mining Cluster Detection •But if we want the algorithm to use 50 different variables for each customer, we’ll have to have a point in 50-dimensional space. •Suppose that the number of clusters or groups is 15. so, for the K-means clustering algorithm, we’ll set K=15. •15 initial records(“seeds”) are chosen as the first set of centroids based on best guesses. •In the next step, the algorithm assigns each customer record in the database to a cluster based on the seed to which it is the closest. Now, we have the first set of 15 clusters. The value of the cluster is taken to be the values of the 50 variables in each centroid. •In the next iteration, each customer record is re-matched with the new sets of centroids and cluster boundaries are redrawn. •After a few iterations, the final clusters emerge. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.151 Cluster Detection 1 3 2 1 Initial cluster boundaries based on initial seeds. 2 Centroids of new clusters calculated 3 Cluster boundaries redrawn at each iteration. Initial seed Calculated centroid © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.152 Cluster Detection • How does the algorithm redraw the cluster boundaries? • What factors determine that one customer record is near one centroid and not the other? • Each implementation of the cluster detection algorithms adopts a method th d for f comparing i the th values l off the th variables i bl in i individual i di id l records with those in the centroids. • The algorithm uses these comparisons to calculate the distances of individual customer records from the centroids. After calculating the distances, the algorithm redraws the cluster boundaries. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.153 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.51 MCA 204, Data Warehousing & Data Mining Decision Trees • This technique applies to classification and prediction. • By following a tree, we can decipher the rules and understand why a record is classified in a certain way. • A decision tree represents a series of questions. Each question determines • What follow-up question is best to be asked. • The question at the root must be the one that best differentiates among the target classes. The leaf node determines the classification of the record. • A tree showing a high level of correctness is more effective. • Also, attention must be paid to the branches. Some paths are better than others because the rules are better. By pruning the incompetent branches, you can enhance the predictive effectiveness of the whole tree. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.154 Decision Trees • How do the decision tree algorithms build the trees? • First, the algorithm attempts to find the test that will split the records in the best possible manner among the wanted classifications classifications. • At each lower level node from the root, whatever rule works best to split the subset is applied. This process of finding each additional level of the tree continues. • The tree is allowed to grow until you cannot find better ways to split input records. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.155 Memory-Based Reasoning • • • • • MBR uses known instances of a model to predict unknown instances. This data mining technique maintains a dataset of known records. The algorithm knows the characteristics of the records in this training dataset. When a new record arrives at the data mining tool, first the t l calculates tool l l t th “distance” the “di t ” between b t thi record this d and d the th records in the training dataset using its distance function. The results determine which data records in the training dataset qualify to be considered as neighbours to the incoming data records. Next, the algorithm uses a combination function to combine the results of the various distance functions to obtain the final answer. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.156 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.52 MCA 204, Data Warehousing & Data Mining Memory-Based Reasoning • For solving a data mining problem using MBR, we are concerned with three critical issues: • Selecting the most suitable historical records to form the training dataset. • Establishing the best way to compose the historical record. • Determining the two essential functions, namely, the distance function and the combination function. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.157 Link Analysis • This algorithm is extremely useful for finding patterns from relationships. • The link analysis technique mines relationships and discovers knowledge. • For eg. If the Fast Food Restaurant owner in the case study has to apply link analysis technique to mine data from the data warehouse, he might find out that in more than 80% of the cases, customers order a soft drink if they order a pizza. The restaurant owner will try to analyse the link between the two products and promote them together. • Depending upon the types of knowledge discovery, link analysis techniques have three types of applications: associations discovery, sequential pattern discovery and similar time U3.158 discovery. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania Link Analysis Associations Discovery: •These algorithms find combinations where the presence of one item suggests the presence of another. •When we apply these algorithms to the daily sales of the fast food restaurant, they will uncover affinities among menu items that are likely to be ordered together. Association rule head A customer in a restaurant also orders soft drink in 65% of the cases. Association rule body Confidence Factor Whenever the customer orders a pizza, this is happening for 20% of all orders. Support Factor © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.159 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.53 MCA 204, Data Warehousing & Data Mining Link Analysis Sequential Pattern Discovery: These algorithms discover patterns where one set of items follows another specific set. Time plays a role in these patterns. When we select records for analysis, we must have date and time as data items to enable discovery of sequential patterns. For eg. Consider the transaction data file given below: SALE DATE NAME OF CUSTOMER 15/11/2000 15/11/2000 15/11/2000 19/12/2000 19/12/2000 19/12/2000 19/12/2000 20/12/2000 20/12/2000 ABC DEF EFG GHI ABC GHI EFG DEF XYZ PRODUCTS PURCHASED Desktop PC, MP3 Player Desktop PC, MP3 Player, Digital Camera Laptop PC Laptop PC Digital Camera Digital Camera Digital Camera Tape Backup Drive Desktop PC, MP3 Player © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.160 Link Analysis Sequential Patterns--Customer Sequence NAME OF CUSTOMER ABC DEF EFG GHI XYZ PRODUCT SEQUENCE FOR CUSTOMER Desktop PC, MP3 Player, Digital Camera Desktop PC, MP3 Player, Digital Camera, Tape Backup Drive Laptop PC, Digital Camera Laptop PC, Digital Camera Desktop PC, MP3 Player Sequential Patterns (Support Factor >60%) Desktop PC, MP3 Player Supporting Customers ABC, DEF, XYZ Sequential Pattern (Support Factor >40%) Desktop PC, MP3 Player, Digital Camera Laptop PC, Digital Camera Supporting Customers ABC, DEF EFG, GHI © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.161 Link Analysis Typical discoveries include associations of the following types: • Purchase of a digital camera is followed by purchase of a colour printer 60% of the time • Purchase of a desktop is followed by purchase of a tape backup drive 65% of the time • Similar Time Sequence Discovery: • This technique depends on the availability of time sequences. • The results of the previous technique indicate sequential events over time. This technique finds a sequence of events and then comes up with other similar sequences of events. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.162 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.54 MCA 204, Data Warehousing & Data Mining Neural Networks “A type of artificial intelligence that attempts to imitate the way a human brain works” © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.163 Cont…  Neural networks resemble the human brain in the following two ways:  A neural network acquires knowledge through learning.  A neural network's knowledge is stored within inter-neuron connection strengths known as synaptic weights. weights © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.164 Basic Neural Network Structure © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.165 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.55 MCA 204, Data Warehousing & Data Mining Cont…  Input Layer: Consists of neurons that receive input from external environment  Output Layer: Consists of neurons that communicate to the user or external environment  Hidden Layer: Consists of neurons that only communicate with other layers of the network © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.166 Neural Network Models • Supervised: network given facts about various cases along with expected outputs • Unsupervised: network receives only inputs and no expected outputs © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.167 Data Mining Process Based on Neural Networks © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.168 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.56 MCA 204, Data Warehousing & Data Mining Data Preparation • Data Cleansing • Data Option • Data Processing • Data Expression © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.169 Rule Extraction • Extraction of hidden predictive information from large database • Some extracting rules: LRE method, Black Box method etc… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.170 Rule Assessment • Process of extracting and collecting evidences and making judgments • Tells how well a rule can achieve the intended output © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.171 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.57 MCA 204, Data Warehousing & Data Mining Implementing Neural Networks using: MATLAB © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.172 MATLAB • Matrix Laboratory • High level technical computing language • Programming environment for algorithm development, data analysis visualization, analysis, visualization and numerical computation © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.173 MATLAB Applications • Signal and image processing • Communications • Control design • Test and measurement • Financial modeling and analysis • Computational biology • Neural networks © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.174 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.58 MCA 204, Data Warehousing & Data Mining Neural Network Process Using MATLAB © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.175 Some Important Terms • • • • • • • Training Function Adaption Learning Function Performance Function Transfer Function Network Simulation Feed forward neural networks Feed-forward Back-propagation © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.176 Training Function  Mathematical procedures used to automatically adjust the network's weights and biases  Some Backpropogation functions:  TRAINLM  TRAINOSS  TRAINGDX  TRAINBFG etc… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.177 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.59 MCA 204, Data Warehousing & Data Mining Adaption Learning Function  Used for learning. It can be applied to individual weights and biases within a network.  Some functions:     LEARNGDM LEARNHD LEARNPN LEARNSOM etc… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.178 Performance Function  Used for comparing the observed and inferred outputs for a data sample  Some of the functions:  MAE: Mean absolute error performance function  MSE: Mean squared normalized error performance function  SSE: Sum squared error performance function © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.179 Transfer Function  Used to describe the system with all input-output pairs.  Calculate a layer's output from its net input.  Some functions:  TANSIG  LOGSIG  PURELIN etc… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.180 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.60 MCA 204, Data Warehousing & Data Mining Network Simulation  Way of testing on the network to see if it meets our expectations © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.181 Feed-forward Neural Networks  First and simplest type of artificial neural network  The information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes  There are no cycles or loops in the network © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.182 Back-propagation  Back-propagation is a common method of training artificial neural network so as to minimize the objective function.  Systematic method of training multi-layer artificial neural networks. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.183 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.61 MCA 204, Data Warehousing & Data Mining Process of XOR Network  Design  Training  Simulation © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.184 Steps to Implement a XOR Network in MATLAB  Open the Matlab Toolbox  To begin using the NN GUI:  >> nntool © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.185 Design Phase © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.186 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.62 MCA 204, Data Warehousing & Data Mining Cont… Let: Input : P = [0 0 1 1; 0 1 0 1] Target/ Output: T = [0 1 1 0] Click on New Data © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.187 Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.188 Cont…  Click on Create to confirm  Now to create a XORNet, click on New Network  Set the parameters as follows:       Network Type = Feedforward Backprop Input Ranges = [0 1;0 1] Train Function = TRAINLM Adaption Learning Function = LEARNGDM Performance Function = MSE Number of Layers = 2 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.189 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.63 MCA 204, Data Warehousing & Data Mining Cont…  Set Layer 1 properties as:  Number of Neurons = 2  Transfer Function = TANSIG  Set Layer 2 properties as:  Number of Neurons = 1  Transfer Function = TANSIG  Confirm by hitting the create button © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.190 Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.191 Network Training  Highlight XORNet with One click  Click on Train button  On Training Info, select P as Inputs and T as Targets © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.192 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.64 MCA 204, Data Warehousing & Data Mining Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.193 Cont…  On Training Parameters, set:  epochs = 1000 (train network for longer duration)  goal = 0.000000000000001 (precise result)  Max_fail Max fail = 50  Hit Train Network © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.194 Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.195 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.65 MCA 204, Data Warehousing & Data Mining Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.196 Cont…  To confirm the XORNet structure and values of various weights and bias of the trained network click on View on the Network/Data Manager window © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.197 Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.198 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.66 MCA 204, Data Warehousing & Data Mining Network Simulation  Create new test data S = [1; 0] and follow same procedure as before (like for input P) © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.199 Cont…  Again click on XORNet and then click on Simulate button on the Network Manager.  Select S as the Inputs  Type in XORNet_outputSim as Outputs  Hit the Simulate Network button © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.200 Cont… © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.201 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.67 MCA 204, Data Warehousing & Data Mining Cont…  Check the result of XorNet_outputSim on the NN Network Manager, by clicking View © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.202 Neural Networks  Neural networks mimic the human brain by learning from a training dataset and applying the learning to generate patterns for classification and prediction.  These algorithms are effective when the data is shapeless and lacks any apparent pattern.  The basic unit of a neural network is called node and is one of the two main structures of the neural network model. The other structure is the link between these nodes. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.203 Cont… INPUT Output from node OUTPUT Values for input variables Input values weighted Nodes Discovered value for o output variable Input to next node Links Neural Network Model © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.204 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.68 MCA 204, Data Warehousing & Data Mining Cont… Age 35 0.35 Weight=0.9 1.065 Upgrade to Gold Credit Card— Pre-approved Weight=1.0 Incom e $75,00 0 0.75 Neural Network for pre-approval of Gold Credit Card © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.205 Genetic Algorithms  Genetic algorithms apply the principle of ‘natural selection and survival of the fittest’ to data mining.  This technique uses a highly iterative process of selection, cross-over and mutation operators to evolve successive generations of models.  At each iteration, every model competes with everyone other by inheriting traits from previous ones until only the most predictive model survives. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.206 Cont… • Eg. Taking the Fast Food Restaurant case study. •Suppose that, the owner wants to do a promotional mailing and wants to include free coupons in the mailing, with the goal of increasing the profits. At the same time, the promotional mailing must not produce the opposite result of lost revenue. •The question optimum number of couponsThird to be placed in First Generationis: What is the Second Generation Generation each mailer to maximize profits? 16 19 13 15 00 11 15 3 31 10 36 39 13 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.207 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.69 MCA 204, Data Warehousing & Data Mining Comparison Data Mining Technique Underlying Structure Basic Process Validation Method Cluster Detection Distance calculation in n- vector space Grouping of values in the same neighbourhood Cross validation to verify accuracy Decision Trees Binary tree Splits at decision points based on entropy Cross validation Memory-Based Memory Based Reasoning Predictive structure based on distance and combination functions Association of unknown instances with known instances Cross validation Based on linking of variables Discover links among variables by their values Not applicable Forward propagation network Weighted inputs of predictors at each node Not applicable Not applicable Survival of the fittest on mutation of derived values Mostly cross validation Link Analysis Neural Networks Genetic Algorithms © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.208 Review Questions Objective Questions: 1)The types of information that can be garnered from datamining include: a) sequences, classifications, and clusters. b) model-driven and data-driven. c) associations and forecasts. d) a and c. e) a, b and c. 2) The term “associations” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e) events linked over time. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.209 Review Questions cont.. 3)DSS assist management by combining ________ into a single powerful system to support unstructured decision-making. a) hardware and the Internet b) data, analytical models and tools, and user-friendly software c) analytical models and tools and data from the Internet d) group decision processes and electronics e) data and people 4)DSS, GDSS, and ESS are part of a special category of information systems that are explicitly designed to: a) make decisions for managers. b) enhance Web performance. c) gather data and build data warehouses. d) enhance managerial decision-making. e) interpret data for management. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.210 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.70 MCA 204, Data Warehousing & Data Mining Review Questions cont.. 5)The term “sequences” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e)) events linked over time. 6)The earliest DSS tended to: a) rely on Internet data. b) draw on small subsets of corporate data. c) be heavily model-driven. d) b and c. e) a and c. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.211 Review Questions cont.. 7)The term “classifications” is associated with: a) occurrences linked to a single event. b) classifications when no groups have been defined. c) pattern recognition describing the group to which an item belongs. d) a series of existing values used to predict other values. e)) events linked over time. 8)Model-driven DSS: a) analyze large pools of data. b) are an outgrowth of data mining. c) use TPS and OLAP. d) begin with a given group of data and change variables. e) use events linked over time. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.212 Review Questions cont.. 9)The term “forecasting” is associated with: a) Occurrences linked to a single event. b) Classifications when no groups have been defined. c) Pattern recognition describing the group to which an item belongs. d) A series of existing values used to predict other values. e)) Events linked over time. 10)A goal of data mining includes which of the following? a) To explain some observed event or condition b) To confirm that data exists c) To analyze data for expected relationships d) To create a new data warehouse e) None of these © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.213 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.71 MCA 204, Data Warehousing & Data Mining Review Questions cont.. Short answer type Questions 1. Define data mining in two or three sentences 2. How is data mining different from OLAP? 3. Is the data warehouse prerequisite for data g Does the data warehouse help p data mining? mining? If so, in what ways? 4. Name the three common problems of link analysis technique? 5. What is market basket analysis? Give two examples of this application in business © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.214 Review Questions cont.. 6. Give three broad reasons why you think data mining is being used in today’s businesses. 7. What business problems can data mining help solve? 8. What is Predictive Analytics? 9. What is the difference between data mining, online analytical processing (OLAP) ? 10. State various benefits of Data mining. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.215 Review Questions cont.. Long answer type Questions 1. Describe how decision trees work. Explain with the help of an example. 2. What do you mean by KDD? Explain all the steps of KDD in detail. 3. What are the basic principles of genetic algorithms? Use the example to describe how this technique works 4. Describe cluster detection technique? 5. Discuss Data mining Application in the field of Banking and finance. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.216 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.72 MCA 204, Data Warehousing & Data Mining Review Questions cont.. 6. Do neural networks and genetic algorithms have anything in common? Point out differences. 7. How does the memory-based reasoning technique work? What is the underlying principle? 8. Explain Neural Network in detail? 9. What are the golden rules for data mining? 10. Discuss Data mining Application in the field of Retail Industry. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.217 Suggested Reading/References 1. Paul Raj Poonia, “Fundamentals of Data Warehousing”, John Wiley & Sons, 2003. 2. Sam Anahony, “Data Warehousing in the Real World: A Practical Guide for Building Decision Support Systems”, John Wiley, 2004 3. W. H. Inmon, “Building the Operational Data Store”, 2nd Ed., John Wiley, 1999. 4. Kamber and Han, Data Mining Concepts and Techniques”, Hartcourt India P. Ltd.,2001”. © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.218 © Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania U3.73

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download UNIT-4 Data Mining Basics