Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI 1 EXTENSIONS TO DSS BI systems (aka EIS) Geographical Information Systems (GIS) Collaborative Systems (formerly GDSS) Expert Systems OLAP/Data mining/warehousing DATA WAREHOUSES 3 DATA WAREHOUSE A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining) DATA WAREHOUSING USERS APPLICATIONS WAREHOUSE User1 Prodn. Mktg. Fin. Acctg. S E R V E R User2 User3 User4 THE DATA LIFE CYCLE Dashboard/BI Internal Data Data Visualization Data Marts OLAP External Data Internal Data Data Warehouse Decisions Data Marts Data Mining Knowledge The activities taking place with respect to data for warehouse/OLAP/mining INTEGRATION OF DATA Integration Operational databases application A – m,f application B – 1,0 application C – x,y application D – male, female Data Warehouse M/F 7 CHARACTERISTICS OF WAREHOUSES Data warehouses have a number of characteristics Subject-oriented Integrated Time-variant Non-volatile 8 KEY FEATURES OF DATA WAREHOUSES… Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time Nonvolatile: The data in a warehouse is permanent. DESIGN OF WAREHOUSE Design of warehouses is similar to databases: Schema not radically different W/H designed primarily for queries Database will be converted to cube at run time. 10 ORGANIZING THE W/H STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension. SNOWFLAKE SCHEMA: A variant of the star schema, Where some dimension tables are normalized, thereby splitting the data into additional tables. CONSTELLATION SCHEMA: A collection of stars. STAR SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS What are the dimensions here? SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS CREDIT DEBIT CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI) BRANCH DEPARTMENTS ACCOUNTS EMPLOYEES TRANSACTIONS CREDIT DEBIT EXAMPLE OF WAREHOUSE & MARTS Weekly sales by region 2007-2013 Weekly sales by state 2005-2013 Sales by Product Line 2010-2013 Data Mart Data Mart Warehouse Sales Detail Sales Detail 2003-2013 Weekly sales by product 2008-2013 OLAP & DATA WAREHOUSES 16 ONLINE ANALYTICAL PROCESSING OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December? Aggregation (“data summarization”) Dimension Slice & Dice Cube organization (Multidimensional database) AGGREGATION & CUBE Region North. E South E. South W. Midwest South W. North E. Units 40 20 30 20 50 65 NE SE SW 65 20 30 40 50 sales in the Northern region? 18 DIMENSIONS & CONCEPT HIERARCHIES A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable. Dimensions can have hierarchies (or various levels of aggregations) A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts CONCEPT HIERARCHY Mfr. dimension iPhone5 GAL GALII 30,000 13,600 iPhone4 150,600 G 15,067 G2 FX 18,240 51,326 MULTI-DIMENSIONAL ORGANIZATION Sales, costs etc. nw Regions sw Products (tables, desks, lamps..) Cube organization supports slice & dice MULTI-DIMENSIONAL ORGANIZATION.. LG G LG G G2 MW G2 LG G FX G2 FX TOTAL FX TOTAL TOTAL SW TOTAL MW SW SE NW NE TOTAL 3,780 MW 4,893 SW 7,494 SE6,520 NW 2,450 NE 25,137 TOTAL 3,780 4,893 7,494 6,520 2,450 25,137 2,342 1,200 950 7,570 3,780 1,400 4,893 1,678 7,494 6,520 2,450 25,137 2,342 1,200 1,400 1,678 950 7,570 3,780 7,893 2,342 5,647 1,200 6,493 1,400 1,678 7,839 31,652 950 7,570 3,780 7,893 5,647 6,493 7,839 31,652 9,902 13,986 7,893 14,541 5,647 14,691 11,239 64,359 3,780 6,493 7,839 31,652 9,902 13,986 14,541 14,691 11,239 64,359 9,902 SE 13,986 NW 14,541 NE 14,691 11,239 March sales February sales January sales 64,359 shows multi-dimensional/cube organization 22 CUBE ORGANIZATION Data from warehouse imported into memory A sophisticated 3D representation is created Referred to as “sparse matrix” Sides of cube are dimensions Allows “slice & dice” Answers to high level queries/reports DATA MINING 24 DATA MINING Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data. e.g. sequence/association, classification, and clustering Buyer non-buyer Success/failure of an operation Customers who spend more Preferences of target groups 25 SOME DATA MINING APPLICATIONS Predicting the probability of default for consumer loans Predicting audience response to TV advertisements Predicting the probability that a cancer patient will respond to radiation therapy. Predicting the probability that an offshore well will produce oil A phone company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles 26 DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY Evaluation and Presentation Knowledge Data Mining Selection and Transformation Cleaning and Integration Database Data Warehouse Flat files Patterns DATA MINING PROCESS Cleaning & integration – data is brought in from multiple sources Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin. Data mining - process of extracting data patterns, using statistical or AI techniques. Pattern evaluation - identifying patterns useful and relevant to the organizational context. Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user. 28 OLAP, MINING & WAREHOUSING... Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers. OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb? Data mining refers to identification of patterns from data. DATA MINING TECHNIQUES 30 TYPES OF ANALYSES Sequence -- Activities occurring one after another e.g. loan after buying car, warranty. Association -- (AKA Market Basket Analysis) Activities which occur together (e.g. bread and meat) Classification -- Identifying profiles of data classified into pre-defined groups (frequent & infrequent shoppers) Clustering -- Identifying natural characteristics of data (what major areas are cust. coming from?) SEQUENTIAL ANALYSIS 32 SEQUENTIAL ANALYSIS Predict purchasing 40% of customers buy a gray skirt six months after buying a red jacket Identified by time-series analysis, Neural networks or genetic algorithms 33 IDENTIFYING SEQUENTIAL PATTERNS.. applications in forecasting exchange rates, meat consumption , bankruptcies etc. 34 ASSOCIATIVE ANALYSIS 35 ASSOCIATION Identifies items purchased together Customers who bought items A, B, C also buy Y 85 percent of customers who buy a certain wine brand also buy a certain type of pasta Many men buy a six-pack when they purchase diapers Enhances product placement in stores 36 ANALYSIS TECHNIQUES ASSOCIATION TID Items 100 ACD 200 BCE 300 ABCE 400 BE Database Algorithm A-priori list all items – 1 item set (C1) filter by min. transaction support (L1) identify 2-item sets (L1* L1) = (C2 ) filter and generate L2 repeat process... Min. transaction support is the number (sometimes given as %) of transactions in which the item must occur. 37 A-PRIORI ALGORITHM TID Items Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset Itemset Sup. Itemset Sup. {A B} {A B} 1 {A C} 2 {A C} 2 {B C} 2 {A E} 1 {B E} 3 {B C} {B C} 2 {C E} 2 {B E} {B E} 3 {C E} 2 Itemset Sup. Itemset Sup. {B C E} 2 {B C E} 2 Scan Database 100 A C D 200 B C E 300 A B C E 400 B E {A C} {A E} {C E} Itemset {B C E} C3 Scan Database C2 Scan Database minimum Support: 2 C1 minimum Support: 2 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 L1 L2 C2 L3 C3 38 DISCUSSION Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two. Client1 Client2 Client3 Client4 AAPL, ORCL, WFC, BAC BAC, WTU GOOG, IBM, C WTU, WFC, C, IBM 39 CLASSIFICATION ANALYSIS 40 CLASSIFICATION A technique for grouping data into pre-defined classes using certain attributes of the data. E.g. defaulter or not, cruise customer, 4G subscriber or not etc. Define classes Identify classification criteria Select suitable technique application 41 CLASSIFICATION TECHNIQUES Discriminant analysis (categorical membership) Bayesian classification Logistic regression (probabilistic membership) Neural networks 42 DISCRIMINANT ANALYSIS* A method of classification that uses a Discriminant Function to decide classes DF based on attributes and weights. Calculated as a weighted average of attributes and weights (of training data) (Zave score) E.g. (GMAT + 200 * UGPA) > 1200 Given a new set of data, calculate Z score Compare with Zav *a simplified version DF – Discriminant Function 43 CLASSIFICATION WITH BAYESIAN ANALYSIS 44 REFRESHER ON PROBABILITY Probability is the chance that an event/outcome will take place. E.g. Cubs had 43 wins 36 losses in one season Probability of winning? 45 PRIOR (CONDITIONAL) PROBABILITIES Prior probabilities are knowledge of other events which may help improve predictions E.g. p(IPO success) = 0.33 Suppose we know that a big company behind IPO p(IPO success/big company) = 0.99 Which is higher? P(successful cellphone call) or P(successful call/subscriber in service area) 46 CONDITIONAL PROBABILITIES.. College # of students CBT 1,800 COAS 2,300 COFAC 1,200 COE 6,700 Total 12,000 If we see a student in the union and he/she is a WIU student What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE? 47 BAYESIAN ANALYSIS Bayes theorem can be exploited for classification Bayes theorem P(A/B) = [P(B/A) x P(A)]/P(B) Classify into class I or II based on conditional probability P(class I/x) > P(class II/x)? * E.g. wiu student in union, which college are they? A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events *x is some condition e.g. surgery or being a shopper in a retail chain 48 INTRODUCTION TO BAYESIAN ANALYSIS We are interested in p(person becoming a manager/mba)* How can we use Baye’s theorem? ASSUME 300 m population; 100 m employees 500,000 are managers 10,000 managers go to college for an MBA 20 m go to college 100,000 do MBA *you need to write formula using terms from the problem 49 THIS SLIDE IS INTENTIONALLY BLANK 50 BAYESIAN ANALYSIS - EXAMPLE PROBLEM: An observer has collected information about Eagles & Hawks for a long time. If a new bird is spotted with a certain wingspan (x), need to know whether eagle or hawk From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski 51 BAYESIAN ANALYSIS - EXAMPLE CLASSIFICATION PROBLEM: Compare p(eagle/x) and p(hawk/x) Whichever is higher unknown bird We don’t know p(eagle/x) or p(hawk/x) Aquila rapax 52 BAYESIAN ANALYSIS - EXAMPLE Bayes‘ theorem p( x | eagle) P(eagle) P(eagle | x) p ( x) p( x | hawk ) P(hawk ) P(hawk | x) p( x) Decision rule? 53 PROBABILITY DENSITY FUNCTION Shows from observations of birds, the probability of a bird having a particular wingspan P(x|eagle), P(x| hawk) 54 PROBABILITY OF EAGLE OR HAWK N = Number of birds = neagle + nhawk P(eagle) neagles N = 0.8 P(hawk) nhawk N = 0.2 55 USING BAYES THEOREM New bird’s size = 45 cm p(45|eagle) = 2.22 x 10-2 p(45|hawk) = 1.10 x 10-2 (from known probability density functions) p( x | eagle) * P(eagle) vs p( x | hawk) * P(hawk) 2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2 0.01776 > 0.0021 Decision rule predicts eagle 56 CLUSTERING 57 DATA MINING FUNCTIONS Clustering Goal is to identify natural groupings of data. applicationications in market segmentation, discovering affinity groups, and defect analysis based on euclidean distance similar to centroid trial & error approach need to guess # of clusters 58 CLUSTER ANALYSIS OF CUSTOMER LOCATIONS IN A CITY CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS Income: Medium Children: 2 Income: High Children: 1 Car: Luxury Car: Sedan and Car: Truck Income: Medium Children:3 Cluster 1 Cluster 4 Cluster 3 Income: Low Children: 0 Car: Compact Cluster 2 60 A SIMPLE EXAMPLE ACCT# MONTHLY SPENDING 109905 $200 112343 $30 124205 $1,510 125687 $11,078 130984 $341 131987 $543 140001 $178 140387 $2,000 150933 $1,750 167912 $230 61 63 DISCUSSION QUESTIONS What kind of data is stored in a data warehouse? What are the characteristics of a data warehouse? How is data organized in a warehouse? What tool(s) can we use to analyze data in a warehouse? What type of multidimensional organization of data is also called a “sparse matrix”? What is a concept hierarchy? What technique can you use to predict the probability that a cancer patient will respond to radiation therapy? What type of mining identifies natural characteristics of data? What technique is useful for predicting trends? What is minimum transaction support? How is Bayes theorem used for classification? 64