Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA WAREHOUSING and Mining May 7, 2017 1 This session  0. Introduction  Evolution of Database What is data warehouse?  Motivation: Why data mining?  What is data mining?  I. Data Preprocessing Needs Preprocessing the Data Data Cleaning Data Integration and Transformation Data Reduction Discretization  Data Mining: On what kind of data?  Data mining functionality  Are all the patterns interesting?  Classification of data mining systems  Major issues in data mining May 7, 2017 and Concept Hierarchy Generation 2 Evolution of Database Technology  1960s:   1970s:   Relational data model, relational DBMS implementation 1980s:   Data collection, database creation, IMS and network DBMS RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases May 7, 2017 3 Short History of Data Mining     1989 - KDD term (Knowledge Discovery in Databases) appears in (IJCAI Workshop) 1991 - a collection of research papers edited by Piatetsky-Shapiro and Frawley 1993 – Association Rule Mining Algorithm APRIORI proposed by Agraval, Imielinski and Swami. 1996 – present: KDD evolves as a conjuction of different knowledge areas (data bases, machine learning, statistics, artificial intelligence) and the term Data Mining becomes popular Of “Laws”, Monsters, and Giants…  Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory  It’s more aggressive cousin:  Disk storage “capacity” doubles every 9 months What do the two “laws” combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it. May 7, 2017 Disk TB Shipped per Year 1E+7 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. ExaByte 1E+6 1E+5 disk TB growth: 112%/y Moore's Law: 58.7%/y 1E+4 1E+3 1988 1991 1994 1997 2000 5 Data, Data everywhere yet ...  I can’t find the data I need   data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other May 7, 2017 6 Knowledge Refinement Pattern Warehousing Data Mining OLAP/ROLAP DWH Statistics & Reporting Data 1970’s 1980’s 1990’s 2000 Fig.: From Data to Knowledge --- Series of steps May 7, 2017 7 What motivated data mining ? Why is it so important ? • The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. • Data mining can be viewed as a result of the natural evolution of information technology • It has the following functionalities. Data Collection and Database Creation, Data management (Including data storage and retrieval and database transaction processing) and Data analysis and understanding (involving database transaction processing) May 7, 2017 8 May 7, 2017 9 Evolution of Sciences      Before 1600, empirical science 1600-1950s, theoretical science  Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 May 7, 2017 10 Evolution of Database Technology      1960s:  Data collection, database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s:  Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems May 7, 2017 11 What Is Data Mining?  Data mining (knowledge discovery from data)    Alternative names   Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems May 7, 2017 12 Knowledge Discovery (KDD) Process  Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases May 7, 2017 13 Steps in KDD Process 1. Data Cleaning : (To remove noise and inconsistent data) 2. Data Integration : (Where multiple data sources may be combined) 3. Data Selection : (Where data relevant to the analysis task are retrieved from the database) 4. Data Transformation : (Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data Mining : (An essential process where intelligent methods are applied in order to extract data patterns ) 6. Pattern evaluation : (To identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation : (where visualization and knowledge representation techniques are used to present the mined knowledge to the user ) May 7, 2017 14 Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases May 7, 2017 Filtering Data Warehouse 15       Database, data warehouse, or other information repository : This is one or a set of databases, data warehouses, spreadsheets, or other kinds of informational repositories. Data cleaning and data integration techniques may be performed on the data Database, or data warehouse server : The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. Knowledge base : This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association, classification, cluster analysis, and evolution and deviation analysis Pattern evaluation module : This component typically employs interestingness measures and interacts with data mining modules so as to focus the search towards interesting patterns Graphical user interface : This module communicates between the users and the data mining system, allowing the user to interact with the system by specifying a query or task. May 7, 2017 16 Data Mining: On What Kind of Data?     Relational databases Data warehouses Transactional databases Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW May 7, 2017 17 Data Mining Functionalities (1)   Concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association (correlation and causality)  Multi-dimensional vs. single-dimensional association  age(X, “20..29”) ^ income(X, “20..29K”) à buys (X, “PC”) [support = 2%, confidence = 60%] The number of times, this item set appears in the database is called its "support" Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred  contains(T, “computer”) à contains(x, “software”) [1%, 75%] May 7, 2017 18 Data Mining Functionalities (2)   Classification and Prediction  Finding models (functions) that describe and distinguish classes or concepts for future prediction  E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity May 7, 2017 19 Data Mining Functionalities (3)    Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis  Similarity-based analysis Other pattern-directed or statistical analyses May 7, 2017 20 Are All the “Discovered” Patterns Interesting?    A data mining system/query may generate thousands of patterns, not all of them are interesting.  Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, action ability, etc. May 7, 2017 21 Can We Find All and Only Interesting Patterns?   Find all the interesting patterns: Completeness  Can a data mining system find all the interesting patterns?  Association vs. classification vs. clustering Search for only interesting patterns: Optimization  Can a data mining system find only the interesting patterns?  Approaches  First general all the patterns and then filter out the uninteresting ones.  Generate only the interesting patterns—mining query optimization May 7, 2017 22 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science May 7, 2017 Statistics Data Mining Visualization Other Disciplines 23 Classification of Data mining systems     Classification according to the kinds of databases mined: data models(relational ,transactional ,object relational) and type of data Classification according to the kinds of knowledge mined association , classification, clustering… Classification according to the kinds of techniques utilized techniques can be described according to the degree of user interaction involved Classification according to the applications adapted finance, telecommunications, DNA, stock markets, e-mail, and so on. May 7, 2017 24 Major Issues in Data Mining    Mining methodology  Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web  Performance: efficiency, effectiveness, and scalability  Pattern evaluation: the interestingness problem  Incorporation of background knowledge  Handling noise and incomplete data  Parallel, distributed and incremental mining methods  Integration of the discovered knowledge with existing one: knowledge fusion User interaction  Data mining query languages and ad-hoc mining  Expression and visualization of data mining results  Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts  Domain-specific data mining & invisible data mining  Protection of data security, integrity, and privacy May 7, 2017 25 What is Data Warehousing? Information A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data May 7, 2017 26 Very Large Data Bases  Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes  Petabytes -- 10^15 bytes: Geographic Information Systems  Exabytes -- 10^18 bytes: National Medical Records  Zettabytes -- 10^21 bytes:Weather images  Zottabytes -- 10^24 bytes:Intelligence Agency Videos May 7, 2017 27 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] May 7, 2017 28 Data Warehousing -- It is a process   May 7, 2017 Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database 29 What is Data Warehouse?  Defined in many different ways, but not rigorously.     A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing:  The process of constructing and using data warehouses May 7, 2017 30 Data Warehouse—Subject-Oriented  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. May 7, 2017 31 Data Warehouse—Integrated  Constructed by integrating multiple, heterogeneous data sources   relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources   E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. May 7, 2017 32 Data Warehouse—Time Variant   The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”. May 7, 2017 33 Data Warehouse—Non-Volatile  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:  initial loading of data and access of data. May 7, 2017 34 Data Warehouse vs. Heterogeneous DBMS   Traditional heterogeneous DB integration:  Build wrappers/mediators on top of heterogeneous databases  Query driven approach  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set  Complex information filtering, compete for resources Data warehouse: update-driven, high performance  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis May 7, 2017 35 Data Warehouse vs. Operational DBMS    OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries May 7, 2017 36 OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access May 7, 2017 complex query 37 Why Separate Data Warehouse?  High performance for both systems    DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data:    missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled May 7, 2017 38 Typical Process Flow Within a Data Warehouse Source Warehouse Users Data transformation and movement Extract And load Query Archive data Figure : Process flow within a data warehouse May 7, 2017 39 1. 2. 3. 4. Extract and load the data Clean and transform data into a form that can cope with large data volumes and provide good query performance. Back up and archive data Manage queries and direct them to the appropriate data sources. May 7, 2017 40 Extract and Load Process 1. Controlling the Process - Determine when to start extracting the data 2. When to initiate the extract - Data should be in a consistent state - Start extracting data from data sources when it represents the same snapshot of time as all the other data sources 3. Loading the data - Do not execute consistency checks until all the data sources have been loaded into the temporary data store 4. Copy Management Tools and Data cleanup May 7, 2017 41 Clean and Transform Data 1. Clean and Transform the data Data needs to be cleaned and checked in the following ways: - Make sure data is consistent within itself - Make sure that data is consistent with other data within the same source - Make sure data is consistent with data in the other source systems. - Make sure data is consistent with the information already in the warehouse May 7, 2017 42 2. Transforming into Effective Structure - Once the data has been cleaned, convert the source data in the temporary data store into a structure that is designed to balance query performance and operational cost May 7, 2017 43 Backup and Archive Process  The data within the data warehouse is backed up regularly in order to ensure that the data warehouse can always be recovered from data loss, software failure or hardware failure. May 7, 2017 44 Query Management Process     System process that manages the queries an speeds them up by directing queries to the most effective data source. Directing Queries to the suitable tables Maximizing System Resources Query Capture - Query profiles change on a regular basis - In order to accurately monitor and understand what the new query profiles are, it can be very effective to capture the physical queries that are being executed. May 7, 2017 45 Design of a Data Warehouse: A Business Analysis Framework  Four views regarding the design of a data warehouse  Top-down view   Data source view   exposes the information being captured, stored, and managed by operational systems Data warehouse view   allows selection of the relevant information necessary for the data warehouse consists of fact tables and dimension tables Business query view  sees the perspectives of data in the warehouse from the view of enduser May 7, 2017 46 Data Warehouse Design Process  Top-down, bottom-up approaches or a combination of both    From software engineering point of view    Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process     Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record May 7, 2017 47 Multi-Tiered Architecture Metadata other sources Operational DBs Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources May 7, 2017 Data Storage OLAP Engine Front-End Tools 48 Three Data Warehouse Models  Enterprise warehouse   collects all of the information about subjects spanning the entire organization Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart   Independent vs. dependent (directly from warehouse) data mart Virtual warehouse   A set of views over operational databases Only some of the possible summary views may be materialized May 7, 2017 49 Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model May 7, 2017 50 OLAP Server Architectures  Relational OLAP (ROLAP)     Multidimensional OLAP (MOLAP)    Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)   Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas May 7, 2017 51 May 7, 2017 52 Why Data Mining?  Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases May 7, 2017 53 What Is Data Mining?  Data mining (knowledge discovery in databases):   Alternative names and their “inside stories”:    Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining?   (Deductive) query processing. Expert systems or small ML/statistical programs May 7, 2017 54 Why Data Mining? — Potential Applications  Database analysis and decision support  Market analysis and management   Risk analysis and management    target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications  Text mining (news group, email, documents) and Web analysis.  Intelligent query answering May 7, 2017 55 Market Analysis and Management (1)  Where are the data sources for analysis?   Target marketing   Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time   Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Conversion of single to a joint bank account: marriage, etc. Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information May 7, 2017 56 Market Analysis and Management (2)  Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification)  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) May 7, 2017 57 Corporate Analysis and Risk Management  Finance planning and asset evaluation     Resource planning:   cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending Competition:    monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market May 7, 2017 58 Fraud Detection and Management (1)  Applications   Approach   widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples    auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references May 7, 2017 59 Fraud Detection and Management (2)  Detecting inappropriate medical treatment   Detecting telephone fraud    Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees. May 7, 2017 60 Other Applications  Sports   Astronomy   IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. May 7, 2017 61 Data Mining: A KDD Process Pattern Evaluation  Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases May 7, 2017 62 Steps of a KDD Process  Learning the application domain:     Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:     summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation   Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining   relevant prior knowledge and goals of application visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge May 7, 2017 63 Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP May 7, 2017 DBA 64 Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases May 7, 2017 Filtering Data Warehouse 65 Data Mining: On What Kind of Data?     Relational databases Data warehouses Transactional databases Advanced DB and information repositories       Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW May 7, 2017 66 Data Mining Functionalities (1)  Concept description: Characterization and discrimination   Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association (correlation and causality)  Multi-dimensional vs. single-dimensional association  age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%]  contains(T, “computer”)  contains(x, “software”) [1%, 75%] May 7, 2017 67 Data Mining Functionalities (2)   Classification and Prediction  Finding models (functions) that describe and distinguish classes or concepts for future prediction  E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity May 7, 2017 68 Data Mining Functionalities (3)  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis   Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis  Similarity-based analysis Other pattern-directed or statistical analyses May 7, 2017 69 Are All the “Discovered” Patterns Interesting?  A data mining system/query may generate thousands of patterns, not all of them are interesting.   Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures:  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. May 7, 2017 70 Can We Find All and Only Interesting Patterns?   Find all the interesting patterns: Completeness  Can a data mining system find all the interesting patterns?  Association vs. classification vs. clustering Search for only interesting patterns: Optimization  Can a data mining system find only the interesting patterns?  Approaches  First general all the patterns and then filter out the uninteresting ones.  Generate only the interesting patterns—mining query optimization May 7, 2017 71 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science May 7, 2017 Statistics Data Mining Visualization Other Disciplines 72 Data Mining: Classification Schemes   General functionality  Descriptive data mining  Predictive data mining Different views, different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adapted May 7, 2017 73 MULTIDIMENSIONAL DATA   Analyze data by representing facts and dimensions within a multidimensional cube. Purpose of viewing information in a cube is that it lends itself to viewing statistical operations/aggregations, by applying functions against the plane of cube. May 7, 2017 74 For example: In a retail sales analysis data warehouse, a cubical representation of products by store by day is represented by a threedimensional cube. Time Location Product Figure: Product by store by day cube The point of intersection of all axes represents the actual number of sales for a specific product, in a specific store, on a specific day. May 7, 2017 75 Some operations in the multidimensional data model      Roll-up(drill-up)-Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down- Reverse of roll-up operation. It navigates from less details data to more detailed data. Slice- Performs a selection on one dimension of the given cube, resulting in a sub-cube. Dice- Define a sub-cube by performing a selection on two or more dimensions. Pivot(rotate)- is a visualization operation that rotates the data axes in a view ,in order to provide an alternative presentation of data. May 7, 2017 76 Toronto Vancover Q1 Dice for (location=”Toronto “ or “vancover”) and (time=”Q1” or “Q2”) and (item=”H.E” or “comp) Chicago Location 440 NY 156 Toronto (Cities) Vancover 395 Q1 605 825 Q2 H.E. comp Items (types) H.E 605 825 Comp 14 400 14 Phone Time (quarters) Q2 400 Security Q3 Chicago NY Toronto Vancover Q4 Home comp entertainment Items (types) slice for time “Q1” Pivot phone security Chicago NY Toronto Vancover May 7, 2017 605 825 Home comp entertainment 14 400 phone security 77 Location (Cities) Chicago Q1 Drill-down on 440 156 NY Toronto Vancover 605 825 14 400 Q2 Time (quarters) Q3 Chicago NY Toronto Vancover Q4 Home comp entertainment Roll-up On location (from cities to country) time(from quarters to months) 395 phone security Items (types) Jan Feb Mar App May June July August USA Canada Time (months) Q1 Q2 Q3 Q4 Sep Oct Nov H.E comp phone security Items (types) May 7, 2017 Dec H.E comp phone security Items (types) 78 A Multi-Dimensional View of Data Mining Classification  Databases to be mined   Knowledge to be mined     Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized  Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Applications adapted  Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. May 7, 2017 79 OLAP Mining: An Integration of Data Mining and Data Warehousing  Data mining systems, DBMS, Data warehouse systems coupling   On-line analytical mining data   integration of mining and OLAP technologies Interactive mining multi-level knowledge   No coupling, loose-coupling, semi-tight-coupling, tight-coupling Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions  Characterized classification, first clustering and then association May 7, 2017 80 Data Warehouse Usage  Three kinds of data warehouse applications  Information processing     supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks May 7, 2017 81 From On-Line Analytical Processing to On Line Analytical Mining (OLAM)  Why online analytical mining?      High quality of data in data warehouses  DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses  ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis  mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions  integration and swapping of multiple mining functions, algorithms, and tasks. Architecture of OLAM May 7, 2017 82 An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Databases May 7, 2017 Data Data integration Warehouse Data Repository 83 Major Issues in Data Mining (1)   Mining methodology and user interaction  Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad-hoc data mining  Expression and visualization of data mining results  Handling noise and incomplete data  Pattern evaluation: the interestingness problem Performance and scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed and incremental mining methods May 7, 2017 84 Major Issues in Data Mining (2)  Issues relating to the diversity of data types    Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW) Issues related to applications and social impacts  Application of discovered knowledge      Domain-specific data mining tools Intelligent query answering Process control and decision making Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy May 7, 2017 85 Summary  Data mining: discovering interesting patterns from large amounts of data  A natural evolution of database technology, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.  Classification of data mining systems  Major issues in data mining May 7, 2017 86 Why Data Preprocessing?  Data in the real world is dirty     incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results!   Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data May 7, 2017 87 Multi-Dimensional Measure of Data Quality  A well-accepted multidimensional view:          Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories:  intrinsic, contextual, representational, and accessibility. May 7, 2017 88 Major Tasks in Data Preprocessing  Data cleaning   Data integration   Normalization and aggregation Data reduction   Integration of multiple databases, data cubes, or files Data transformation   Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Obtains reduced representation in volume but produces the same or similar analytical results Data discretization  Part of data reduction but with particular importance, especially for numerical data May 7, 2017 89 Forms of data preprocessing May 7, 2017 90 Data Cleaning  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data May 7, 2017 91 Missing Data  Data is not always available    E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data Missing data may need to be inferred. May 7, 2017 92 How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.  Fill in the missing value manually: tedious + infeasible  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree May 7, 2017 93 Noisy Data   Noise: random error or variance in a measured variable Incorrect attribute values may due to       faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning    duplicate records incomplete data inconsistent data May 7, 2017 94 How to Handle Noisy Data?  Binning method:    Clustering   detect and remove outliers Combined computer and human inspection   first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. detect suspicious values and check by human Regression  smooth by fitting the data into regression functions May 7, 2017 95 Simple Discretization Methods: Binning  Equal-width (distance) partitioning:       It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning:    It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. May 7, 2017 96 Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 May 7, 2017 97 Data Integration  Data integration:   Schema integration    combines data from multiple sources into a coherent store integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources. Detecting and resolving data value conflicts   for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units May 7, 2017 98 Handling Redundant Data in Data Integration  Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant data may be able to be detected by correlational analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality May 7, 2017 99 Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range   min-max normalization  z-score normalization  normalization by decimal scaling Attribute/feature construction  New attributes constructed from the given ones May 7, 2017 100 Data Transformation: Normalization  min-max normalization  z-score normalization v  meanA v'  stand _ devA  normalization by decimal scaling v  minA v'  (new _ maxA  new _ minA)  new _ minA maxA  minA v v'  j 10 May 7, 2017 Where j is the smallest integer such that Max(| v ' |)<1 101 Data Reduction Strategies   Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction   Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies     Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation May 7, 2017 102 Discretization and Concept hierachy  Discretization   reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). May 7, 2017 103 Discretization  Three types of attributes:     Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization:     divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis May 7, 2017 104 Discretization and concept hierarchy generation for numeric data  Binning  Histogram analysis  Clustering analysis  Entropy-based discretization  Segmentation by natural partitioning May 7, 2017 105 Concept hierarchy generation for categorical data  Specification of a partial ordering of attributes explicitly at the schema level by users or experts  Specification of a portion of a hierarchy by explicit data grouping  Specification of a set of attributes, but not of their partial ordering  Specification of only a partial set of attributes May 7, 2017 106 Specification of a set of attributes  Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country 15 distinct values province_or_ state 65 distinct values city 3567 distinct values street May 7, 2017 674,339 distinct values 107 Summary  Data preparation is a big issue for both warehousing and mining   Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization A lot a methods have been developed but still an active area of research May 7, 2017 108