Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OLAP and Data Warehouses 1 Introduction • Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. • Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static. – Contrast such On-Line Analytic Processing (OLAP) with traditional On-line Transaction Processing (OLTP): mostly long queries, instead of short update Xacts. 2 OLTP Compared With OLAP • On Line Transaction Processing – OLTP – Maintains a database that is an accurate model of some realworld enterprise. Supports day-to-day operations. Characteristics: • Short simple transactions • Relatively frequent updates • Transactions access only a small fraction of the database • On Line Analytic Processing – OLAP – Uses information in database to guide strategic decisions. Characteristics: • • • • Complex queries Infrequent updates Transactions access a large fraction of the database Data need not be up-to-date 3 Three Complementary Trends • Data Warehousing: Consolidate data from many sources in one large repository. – Loading, periodic synchronization of replicas. – Semantic integration. • OLAP: – Complex SQL queries and views. – Queries based on spreadsheet-style operations and “multidimensional” view of data. – Interactive and “online” queries. • Data Mining: Exploratory search for interesting trends and anomalies. 4 Data Warehousing EXTERNAL DATA SOURCES • Integrated data spanning long EXTRACT TRANSFORM time periods, often augmented LOAD REFRESH with summary information. • Several gigabytes to terabytes common. DATA Metadata WAREHOUSE • Interactive response times Repository expected for complex SUPPORTS queries; ad-hoc updates uncommon. DATA MINING OLAP5 Warehousing Issues • Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas. • Heterogeneous Sources: Must access data from a variety of source formats and repositories. – Replication capabilities can be exploited here. • Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. • Metadata Management: Must keep track of source, loading time, and other information for all data in the warehouse. 6 OLAP, Data Mining, and Analysis • The “A” in OLAP stands for “Analytical” • Many OLAP and Data Mining applications involve sophisticated analysis methods from the fields of mathematics, statistical analysis, and artificial intelligence • Our main interest is in the database aspects of these fields, not the sophisticated analysis techniques 7 Fact Tables • Many OLAP applications are based on a fact table • For example, a supermarket application might be based on a table Sales (Market_Id, Product_Id, Time_Id, Sales_Amt) • The table can be viewed as multidimensional – Market_Id, Product_Id, Time_Id are the dimensions that represent specific supermarkets, products, and time intervals – Sales_Amt is a function of the other three 8 A Data Cube • Fact tables can be viewed as an N-dimensional data cube (3-dimensional in our example) – The entries in the cube are the values for Sales_Amts 9 A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. sum 10 pid 11 12 13 – E.g., measure Sales, dimensions Product (key: pid), Location (locid), and Time (timeid). 8 10 10 30 20 50 25 8 15 1 2 3 timeid locid sales 11 1 1 25 • Collection of numeric measures, which depend on a set of dimensions. Slice locid=1 is shown: timeid pid Multidimensional Data Model 11 2 1 8 11 3 1 15 12 1 1 30 12 2 1 20 12 3 1 50 13 1 1 8 13 2 1 10 locid 13 3 1 10 11 1 2 35 11 Dimension Hierarchies • For each dimension, the set of values can be organized in a hierarchy: PRODUCT TIME LOCATION year quarter category pname week month date country state city 12 Dimension Tables • The dimensions of the fact table are further described with dimension tables • Fact table: Sales (Market_id, Product_Id, Time_Id, Sales_Amt) • Dimension Tables: Market (Market_Id, City, State, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter) 13 Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables. – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake. – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation. 14 Star Schema • The fact and dimension relations can be displayed in an E-R diagram, which looks like a star and is called a star schema 15 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city province_or_street country Measures 16 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city province_or_street country 17 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_street country dollars_cost units_shipped shipper shipper_key shipper_name 18 location_key shipper_type OLAP Server Architectures • Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces – Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services – greater scalability • Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse matrix techniques) – fast indexing to pre-computed summarized data • Hybrid OLAP (HOLAP) – User flexibility, e.g., low level: relational, high-level: array 19 Aggregation • Many OLAP queries involve aggregation of the data in the fact table • For example, to find the total sales (over time) of each product in each market, we might use SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id • The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data 20 Aggregation over Time • The output of the previous query Market_Id Product_Id M1 SUM(Sales_Amt) P1 P2 P3 P4 P5 M2 3003 1503 6003 2402 4503 3 7503 7000 … … M3 M4 … … … … … 21 Typical OLAP Operations • Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: – project and select – Slice performs a selection on one dimension – Dice defines a subcube by performing a selection on two or more dimensions • Pivot (rotate): – reorient the cube, visualization, 3D to series of 2D planes. 22 Drilling Down and Rolling Up • Some dimension tables form an aggregation hierarchy Market_Id City State Region • Executing a series of queries that moves down a hierarchy (e.g., from aggregation over regions to that over states) is called drilling down – Requires the use of the fact table or information more specific than the requested aggregation (e.g., cities) • Executing a series of queries that moves up the hierarchy (e.g., from states to regions) is called rolling up – Note: In a rollup, coarser aggregations can be computed using prior queries for finer aggregations 23 Drilling Down • Drilling down on market: from Region to State Sales (Market_Id, Product_Id, Time_Id, Sales_Amt) Market (Market_Id, City, State, Region) 1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.Region 2. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.State, 24 Rolling Up • Rolling up on market, from State to Region – If we have already created a table, State_Sales, using 1. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.State then we can roll up from there to: 2. SELECT FROM WHERE T.Product_Id, M.Region, SUM (T.Sales_Amt) State_Sales T, Market M M.State = T.State GROUP BY T.Product_Id, M.Region 25 Pivoting • Aggregation on selected dimensions. – E.g., Pivoting on Location and Time yields this cross-tabulation: WI CA 1995 63 Total 81 144 1996 38 107 145 1997 75 35 110 Total 176 223 339 Pivoting • When we view the data as a multi-dimensional cube and group on a subset of the axes, we are said to be performing a pivot on those axes – Pivoting on dimensions D1,…,Dk in a data cube D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY A1,…,Ak and aggregate over Ak+1,…An, where Ai is an attribute of the dimension Di – Example: Pivoting on Product and Time corresponds to grouping on Product_id and Quarter and aggregating Sales_Amt over Market_id: SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt) FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id GROUP BY S.Product_Id, T.Quarter Pivot 27 Slicing-and-Dicing • When we use WHERE to specify a particular value for an axis (or several axes), we are performing a slice – Slicing the data cube in the Time dimension (choosing sales only in week 12) then pivoting to Product_id (aggregating over Market_id) SELECT S.Product_Id, SUM (Sales_Amt) Slice FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’ GROUP BY S. Product_Id Pivot 28 Slicing-and-Dicing • Typically slicing and dicing involves several queries to find the “right slice.” For instance, change the slice and the axes: • Slicing on Time and Market dimensions then pivoting to Product_id and Week (in the time dimension) SELECT FROM WHERE S.Product_Id, T.Quarter, SUM (Sales_Amt) Sales S, Time T T.Time_Id = S.Time_Id AND T.Quarter = 4 AND S.Market_id = 12345 GROUP BY S.Product_Id, T.Week Pivot Slice 29 The CUBE Operator • To construct the following table, would take 3 queries (next slide) Market_Id Product_Id M1 SUM(Sales_Amt) P1 P2 P3 P4 Total M2 3003 1503 6003 2402 4503 3 7503 7000 … … M3 Total … … … … … … … … … … 30 The Three Queries • For the table entries, without the totals (aggregation on time) SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id • For the row totals (aggregation on time and supermarkets) SELECT S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Product_Id • For the column totals (aggregation on time and products) SELECT S.Market_Id, SUM (S.Sales) FROM Sales S GROUP BY S.Market_Id 31 Definition of the CUBE Operator • Doing these three queries is wasteful – The first does much of the work of the other two: if we could save that result and aggregate over Market_Id and Product_Id, we could compute the other queries more efficiently • The CUBE clause is part of SQL:1999 – GROUP BY CUBE (v1, v2, …, vn) – Equivalent to a collection of GROUP BYs, one for each of the 2n subsets of v1, v2, …, vn 32 Example of CUBE Operator • The following query returns all the information needed to make the previous products/markets table: SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY CUBE (S.Market_Id, S.Product_Id) 33 ROLLUP • ROLLUP is similar to CUBE except that instead of aggregating over all subsets of the arguments, it creates subsets moving from right to left • GROUP BY ROLLUP (A1,A2,…,An) is a series of these aggregations: – – – – – – GROUP BY A1 ,…, An-1 ,An GROUP BY A1 ,…, An-1 ……… GROUP BY A1, A2 GROUP BY A1 No GROUP BY • ROLLUP is also in SQL:1999 34 Example of ROLLUP Operator SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY ROLLUP (S.Market_Id, S. Product_Id) – first aggregates with the finest granularity: GROUP BY S.Market_Id, S.Product_Id – then with the next level of granularity: GROUP BY S.Market_Id – then the grand total is computed with no GROUP BY clause 35 ROLLUP vs. CUBE • The same query with CUBE: - first aggregates with the finest granularity: GROUP BY S.Market_Id, S.Product_Id - then with the next level of granularity: GROUP BY S.Market_Id and GROUP BY S.Product_Id - then the grand total with no GROUP BY 36 Materialized Views The CUBE operator is often used to precompute aggregations on all dimensions of a fact table and then save them as a materialized views to speed up future queries 37 Incremental Updates • The large volume of data in a data warehouse makes loading and updating a significant task • For efficiency, updating is usually incremental – Different parts are updated at different times • Incremental updates might result in the database being in an inconsistent state – Usually not important because queries involve only statistical summaries of data, which are not greatly affected by such inconsistencies 38 Introduction to Data Mining 39 Outline • What’s Data Mining? • Data Mining Techniques – – – – Association Rule Mining Classification Clustering Other Data Mining Techniques • New Challenges in Data Mining 40 Data Mining • Data Mining, also referred to as KDD (Knowledge Discovery in Database), is a process of nontrivial extraction of implicit, previously unknown and potentially useful information from data in large databases. • Those useful information could be knowledge rules, constraints or regularities. 41 Data Mining: A KDD Process Evaluation and Presentation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 42 Tasks Solved by Data Mining • • • • • Prediction/Classification Clustering Association Analysis Outlier Detection Evolution Analysis 43 Data Mining Applications • Customer Relation Management (CRM) – Customer Retention • Market analysis – Target market – Market segmentation – Cross selling • Fraud detection – Health care fraud detection – Credit card fraud detection – Telecom fraud detection 44 Data Mining – a Multidisciplinary field • • • • • • Database technology Data warehousing Artificial Intelligence Machine learning Statistics Neural networks • Pattern recognition • • • • • Knowledge-based systems Knowledge acquisition Information retrieval Data visualization High-performance computing 45 Association Rule Mining • An association is a correlation between certain values in a database (in the same or different columns) • Originally proposed on market basket data (items purchased on a per-transaction basis) – A rule such as “70% of transactions that purchase bread also purchase butter”. – In a convenience store in the early evening, a large percentage of customers who bought diapers also bought beer 46 Confidence and Support • To determine whether an association exists, the system computes the confidence and support for that association • Confidence in A => B – The percentage of transactions (recorded in the database) that contain B among those that contain A • Diapers => Beer: The percentage of customers who bought beer among those who bought diapers • Support – The percentage of transactions that contain both items among all transactions • 100* (customers who bought both Diapers and Beer)/(all customers) 47 Formal Definition • Let I = {i1, i2,…, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items (itemsets). An association rule is an implication of the form X=>Y, where XI, Y I , and XY=. • X is called antecedent, Y is called consequence. • The rule X=>Y has support s if s% of transactions in D contains XY. • The rule X=>Y has confidence c if c% of transactions in D that contains X also contains Y. • Support indicates the frequencies of the occurring patterns while confidence measures the rule’s strength. 48 Example TID Items 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk Rule: {Diaper, Milk} => Beer (Diaper, Milk, Beer) Support = = 0.4 |D| (Diaper, Milk, Beer) Confidence = (Diaper, Milk) = 0.66 49 Two steps of Association Rule Mining • Given a user specified minimum support threshold (minsup) and minimum confidence threshold (minconf), the problem can be solved in two steps: – Find the frequent itemsets which have support above minsup. – Based on each frequent itemsets, derive all the rules, which have confidence above minconf. • The overall performance is determined by step 1. 50 Apriori Algorithm • Step-wise algorithm • Any subset of a frequent itemset is also frequent. • Candidate set generation 51 Apriori Algorithm k-itemset An itemset having k items Lk Set of frequent k-itemset (those with minimum support) Ck Set of candidate k-itemset (potentially frequent itemsets) 1) L1 = {frequent 1-itemsets}; 2) For (k=2; L k-1; k++) do begin 3) Ck = apriori-gen(L k-1); // New candidates 4) forall trans tD do begin 5) Ct = subset(Ck,t); //Candidates contained in t 6) forall candidates c Ct do 7) c.count++; 8) end 9) Lk = {c Ck | c.count >= minsup} 10) End 11) Answer = kLk; 52 Example Database D TID Items 100 A C D 200 B C E 300 A B C E 400 B E Candidate 1-itemset Itemset Support_Count {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Candidate 2-itemset Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Frequent 1-itemset Itemset Support_Count {A} 2 {B} 3 {C} 3 {E} 3 Candidate 2-itemset Itemset Support_Count {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Frequent 2-itemset Itemset Support_Count {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Candidate 3-itemset Candidate 3-itemset Itemset Itemset Support_Count {B, C, E} {B, C, E} 2 Frequent 3-itemset Itemset Support_Count {B, C, E} 2 B and C E, B and E C, C and E B, B C and E, C B and E, E B and C, with support = 50% with support = 50% with support = 50% with support = 50% with support = 50% with support = 50% and confidence = 100% and confidence = 66.7% and confidence = 100% and confidence = 66.7% and confidence = 66.7% and confidence = 66.7% TID Items 100 A C D 200 B C E 300 A B C 400 B E E minsup = 50% minconf = 60% 53 Topics on Association Rule Mining • Quantitative Association Rule Mining – Partition continuous data into intervals • Mining multilevel Association Rules – Concept Hierarchy • 80% of customers that purchase milk may also purchase bread • 70% of people buy wheat bread if they buy 2% milk • • • • • Constraint-base Rule Mining Mining high confidence rules Rule interestingness Removing redundant and misleading rules From association rule mining to correlation analysis 54 Classification • Each tuple belongs to a class. The class of a tuple is indicated by the value of goal attribute (or classlabel attribute). Other attributes are predicting attributes. • The task is to discover relationship between the predicting attributes and the goal attribute • The discovered knowledge can be used to predict the class of a new, unknown-class tuple. • Classification and Regression Classification – on categorical data Regression – on continuous data 55 Classification Accuracy • Training data set and testing data set • For each tuple in test data, the class predicted by the selected rule is compared with the actual class. • Accuracy is the ratio of the number of tuples in test data correctly classified by the discovered rules. 56 Example 1 Gender Country Age Buy(goal) Male Male Female Female Female Male Male Female Female Male France England France England France Germany Germany Germany France France 25 21 23 34 30 21 20 18 34 55 Yes Yes Yes Yes No No No No No No If (Country = “Germany”) then (Buy = “no”) If (Country = “England”) then (Buy = “yes”) Input data IF-THEN If (Country = “France” and Age <= 25) then (Buy = “yes”) Classification If (Country = “France”) and Age >= 25) then (Buy = “no”) Rules 57 Example 2 No. Risk Credit History Debt Collateral Income 1 high bad high none $0 to $15k 2 high unknown high none $15 to $35k 3 moderate unknown low none $15 to $35k 4 high unknown low none $0 to $15k 5 low unknown low none over $35k 6 low unknown low adequate over $35k 7 high bad low none $0 to $15k 8 moderate bad low adequate over $35k 9 low good low none over $35k 10 low good high adequate over $35k 11 high good high none $0 to $15k 12 moderate good high none $15 to $35k 13 low good high none over $35k 14 high bad high none $15 to $35k 58 Example 2 (cont.) Income? $0 to $15K high risk $15 to 35K Credit history? unknown Debt? high high risk good moderate risk bad high risk Over $35K Credit history? unknown bad low risk moderate risk good low risk low moderate risk Decision Trees 59 Decision Tree Induction • Decision tree – – – – A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution • Two phases of decision tree generation – Tree construction – Tree pruning 60 Algorithm for Decision Tree Induction • Basic algorithm – Tree is constructed in a top-down recursive manner – At start, all the training examples are at the root – Test attributes are selected based on a heuristic or some statistical measure (e.g., information gain in ID3/C4.5 or Gini Index in IBM Intelligent Miner) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left • DTI Algorithms: ID3, C4.5, CART, SPRINT, SLIQ 61 Attribute selection Income? $15 to 35K $0 to $15K Examples {1,4,7,11} Examples {2,3,12,14} Over $35K Examples {5,6,8,9,10,13} Why select “Income” attribute at root level? 6 6 3 3 5 5 I log 2 ( ) log 2 ( ) log 2 ( ) 1.531(bits) 14 14 14 14 14 14 4 4 4 4 2 2 2 2 E (income) ( log 2 ( )) ( log 2 ( ) log 2 ( )) 14 4 4 14 4 4 4 4 6 5 5 1 1 ( log 2 ( ) log 2 ( )) 0.564 14 6 6 6 6 Information_gain(income) = 1.531 – 0.564 = 0.967 bits Information_gain(credit history) = 0.266 bits Information_gain(debt) = 0.581 bits Information_gain(collateral) = 0.756 bits 62 Other Classification Methods • Bayesian Classification – Naïve Bayesian Classification – Bayesian Belief Networks • Neural Networks – Classification by Backpropagation • • • • • Nearest Neighbor Classification Case-Based Reasoning Genetic Algorithms Rough Set Approach Fuzzy Set Approach 63 Clustering • Data clustering is to group a set of data (without a predefined class attribute), based on the conceptual clustering principle: maximize the intraclass similarity and minimize the interclass similarity. • Classification is supervised learning, while clustering is unsupervised learning. 64 Similarity Metrics Suppose i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects • Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 i p jp • Euclidean distance d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp • Minkowski distance d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp 65 The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: – Partition objects into k nonempty subsets. – Compute seed points (mean) of each cluster. – Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment. 66 Example of K-Means Clustering Method 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 67 k-means and k-medoids clustering • Partitioning Methods – k-means: Each cluster is represented by the center (mean) of the cluster – k-medoids: Each cluster is represented by one of the objects in the cluster called medoid which is the most centrally located object. 68 Other Clustering Methods • Hierarchical methods – Top-down (starts with all the objects in the same cluster) – Bottom-up (starts with each object forming a separate cluster) • Density-Based Methods – a neighborhood has to contain at least a minimum number of points (density threshold) • Grid-Based Methods – It quantizes the object space into a finite number of cells that form a grid structure • Model-Based Clustering Methods – Neural Networks, Self-Organizing Maps (SOM) 69 Other Data Mining Techniques • Sequential Pattern Mining and Time-Series Similarity Analysis – Stock Analysis • Outlier Detection – Fraud Detection – Medical Analysis • Text Mining • Web Mining 70 New Challenges in Data Mining • Incremental data mining – data mining on data streams, such as supermarket transactions and phone call records • Scalable data mining methods, parallel and distributed data mining • New application areas, such as Bioinformatics • Privacy protection and information security in data mining 71 New Challenges in Data Mining (cont.) • • • • • Incorporation of domain knowledge Interestingness measures Data mining query language Handling noisy or incomplete data Data mining on complex types of data – spatial data, web data, multimedia data, temporal data, XML data 72