Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Decision support tools : Data warehousing, OLAP and data mining. Sunita Sarawagi Data explosion • Banks, companies, websites, retail stores, scientific labs --- contain terabytes of data and is continually growing. • Storage and processing getting cheaper • Wealth of information hidden in the flood of data • Conventional querying/analysis methods did not scale • Need new ways of interaction Data warehousing and Data mining What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Why a Warehouse? • Large organizations have complex internal organizations, and have data stored at different locations, on different operational • Data sources often store only current data, not historical data • Corporate decision making requires a unified view of all organizational data, including historical data • A data warehouse – Greatly simplifies querying, permits study of historical trends – Shifts decision support query load away from transaction processing systems Decision support tools Direct Query Merge Clean Summarize Reporting tools OLAP Crystal reports Essbase Mining tools Intelligent Miner Relational DBMS+ e.g. Redbrick Data warehouse Staging + ETL layer Detailed transactional data GIS data Bombay branch Delhi branch Oracle Calcutta branch IMS Operational data Census data SAS MTNL: case study MTNL operates in 9 zones, each with 5-7 exchanges. Database Size GB CSMS: Customer service management system 40x9 FRS: Fault Repair System 200 FMS: Finance Management System 20 DQS: Directory Query system PMS: Personal Management system 10 IGS: Interactive graphic System 5 TCMIS: Traffic Control Information Management System 10 INTERNET: Internet Database System 70 Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) – Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • Data warehouse system – Data analysis and decision making • Distinct features – User and system orientation: customer vs. market – Data contents: current, detailed vs. historical, consolidated – Database design: ER + application vs. star + subject – View: current, local vs. evolutionary, integrated – Access patterns: update vs. read-only but complex queries Data warehouse construction • Heterogeneous schema integration – merge from various sources, fuzzy matches – remove inconsistencies • Data cleaning: – missing data, outliers, clean fields e.g. names/addresses • Data loading: efficient parallel loads • Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality Warehouse maintenance • Data refresh – when to refresh, what form to send updates? • Materialized view maintenance with batch updates. • Query evaluation using materialized views • Monitoring and reporting tools – HP intelligent warehouse advisor Warehouse Schemas • Typically warehouse data is multidimensional, with very large fact tables – Examples of dimensions: item-id, date/time of sale, store where sale was made, customer identifier – Examples of measures: number of items sold, price of items • Dimension values are usually encoded using small integers and mapped to full values via dimension tables – Resultant schema is called a star schema • More complicated schema structures – Snowflake schema: multiple levels of dimension tables – Constellation: multiple fact tables OLAP Fast, interactive answers to large aggregate queries. • Multidimensional model: dimensions with hierarchies – Dim 1: Bank location: • branch-->city-->state – Dim 2: Customer: • sub profession --> profession – Dim 3: Time: • month --> quarter --> year • Measures: loan amount, #transactions, balance Multidimensional Data • Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Month Month Week Day A Sample Data Cube 2Qtr 3Qtr 4Qtr sum India Canada Mexico sum Country TV PC VCR sum 1Qtr Date Total annual sales of TV in India Typical OLAP Operations • Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: – project and select • Pivot (rotate): – reorient the cube, visualization, 3D to series of 2D planes. • Other operations – drill across: involving (across) more than one fact table – drill through: through the bottom level of the cube to its back-end relational tables (using SQL) OLAP products • About 30 OLAP vendors • Dominant ones: – Oracle Express: largest market share – Hyperion: technology leader – Microsoft Plato: introduced late last year, rapidly taking over... Part 2: Data mining Data mining • Process of semi-automatically analyzing large databases to find patterns that are: – – – – valid: hold on new data with some certainity novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern • Other names: Knowledge discovery in databases, Data analysis Relationship with other fields • Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on – scalability of number of features and instances – stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. – automation for handling large, heterogeneous data Applications • Banking: loan/credit card approval – predict good customers based on old customers • Customer relationship management: – identify those who are likely to leave for a competitor. • Targeted marketing: – identify likely responders to promotions • Fraud detection: telecommunications, financial transactions – from an online stream of event identify fraudulent events • Manufacturing and production: – automatically adjust knobs when process parameter changes Applications (continued) • Medicine: disease outcome, effectiveness of treatments – analyze patient disease history: find relationship between diseases • Molecular/Pharmaceutical: identify new drugs • Scientific data analysis: – identify new galaxies by searching for sub clusters • Web site/store design and promotion: – find affinity of visitor to pages and modify layout Mining technology today Data warehouse Preprocessing utilities Extract data via ODBC Vendors (IDC 1999) – SAS: 29% – SPSS: 13.5% – IBM: 6% •Sampling •Attribute transformation Scalable algorithms • association • classification • clustering • sequence mining Mining operations Visualization Tools Mining operations Itemset mining Clustering Classification • hierarchical • Association rules • Regression • EM • Causality • Classification trees • density based • Neural networks • Bayesian learning • Nearest neighbour • Radial basis functions • Support vector machines • Meta learning methods – Bagging,boosting Classification • Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Age Salary Profession Location Customer type Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad Classification methods Goal: Predict class Ci = f(x1, x2, .. Xn) • Regression: (linear or any other polynomial) • Decision tree classifier: divide decision space into piecewise constant regions. • Neural networks: partition by non-linear boundaries • Probabilistic/generative models • Lazy learning methods: nearest neighbor • Support vector machines: boundary to maximally separate classes Decision tree classifiers • Widely used learning method • Easy to interpret: can be re-represented as if-thenelse rules • Approximates function by piece wise constant regions • Does not require any prior knowledge of data distribution, works well on noisy data. • Has been applied to: – classify medical patients based on the disease, – equipment malfunction by cause, – loan applicant by likelihood of payment. Decision trees • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teacher Good Bad Age < 30 Bad Good Algorithm for tree building • Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Selection Find best attribute and best split on attribute criteria Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Nearest neighbor • Define proximity between instances, find neighbors of new instance – K-NN approach: assign majority class amongst k nearest neighbour – weighted regression: learn a new regression equation by weighting each training instance based on distance from new instance • Pros + Fast training • Cons – Slow during application. – No feature selection. – Notion of proximity vague Neural networks • Useful for learning complex data like handwriting, speech and image recognition Decision boundaries: Linear regression Classification tree Neural network Bayesian learning • Assume a probability model on generation of data. • Apply Bayes theorem to find most likely class as: c max p(c j | d ) max cj p(d | c j ) p(c j ) p(d ) cj • Naïve bayes: Assume attributes conditionally independent given class value c max cj p(c j ) n p(a p(d ) i 1 i | cj) Meta learning methods • No single classifier good under all cases • Difficult to evaluate in advance the conditions • Meta learning: combine the effects of the classifiers – Voting: sum up votes of component classifiers – Combiners: learn a new classifier on the outcomes of previous ones: – Boosting: staged classifiers • Disadvantage: interpretation hard – Knowledge probing: learn single classifier to mimick meta classifier What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters • Cluster analysis – Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes • Typical applications – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms Sarawagi Applications • Customer segmentation e.g. for targeted marketing – Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. – Identify micro-markets and develop policies foreach • Image processing • Text clustering e.g. scatter/gather • Compression Distance functions • Numeric data: euclidean, manhattan distances – Minkowski metric: [sum(xi-yi)^m]^(1/m) – Larger m gives higher weight to larger distances • Categorical data: 0/1 to indicate presence/absence – Euclidean distance: equal weightage to 1 and 0 match – Hamming distance (# dissimilarity) – Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important – data dependent measures: similarity of A and B depends on co-occurance with C. • Combined numeric and categorical data:weighted normalized distance: Clustering methods • Hierarchical clustering – agglomerative Vs divisive – single link Vs complete link • Partitional clustering – distance-based: K-means – model-based: EM – density-based: Agglomerative Hierarchical clustering • Given: matrix of similarity between every point pair • Start with each point in a separate cluster and merge clusters based on some criteria: – Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the least – Complete link: merge two clusters such that all points in one cluster are “close” to all points in the other. Partitional methods: K-means • Criteria: minimize sum of square of distance • Between each point and centroid of the cluster. • Between each pair of points in the cluster • Algorithm: – Select initial partition with K clusters: random, first K, K separated points – Repeat until stabilization: • Assign each point to closest cluster center • Generate new cluster centers • Adjust clusters by merging/splitting Association Rule • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items Rule form: “Body ead [support, confidence]”. – E.g., 98% of people who have an checking account and a PPF also apply for a credit card – Rule Measures: Support and Confidence Customer buys both Customer buys b Customer buys d • Find all the rules X & Y Z with minimum confidence and support – support, s, probability that a transaction contains {X Y Z} – confidence, c, conditional probability that a transaction having {X Y} also contains Z Transaction ID Items Bought Let minimum support 50%, 2000 A,B,C and minimum confidence 1000 A,C 50%, we have 4000 A,D 5000 B,E,F – A C (50%, 66.6%) – C A (50%, 100%) Applications of fast itemset counting Find correlated events: • Applications in medicine: find redundant tests • Cross selling in retail, banking • Improve predictive capability of classifiers that assume attribute independence • New similarity measures of categorical attributes [Mannila et al, KDD 98] Mining market • Around 20 to 30 mining tool vendors • Major tool players: – SAS’s Enterprise Miner. – IBM’s Intelligent Miner, – SGI’s MineSet, • All pretty much the same set of tools • Many embedded products: – – – – fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany Summary • Need for new decision support tools • Data warehousing – Data integration, loading, cleaning – Interactive data analysis/navigation: OLAP • Data mining: definition and an overview of the various operations: – Classification: regression, nearest neighbour, neural network, bayesian – Clustering: distance based (k-means), Heirarchical – Itemset counting