Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI 1 EXTENSIONS TO DSS BI systems (aka EIS) Geographical Information Systems (GIS) Collaborative Systems (formerly GDSS) Expert Systems OLAP/Data mining/warehousing Big Data DATA WAREHOUSES 3 DATA WAREHOUSE A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining) DATA WAREHOUSING USERS APPLICATIONS WAREHOUSE User1 Prodn. Mktg. Fin. Acctg. S E R V E R User2 User3 User4 THE DATA LIFE CYCLE Dashboard/BI Internal Data Data Visualization Data Marts OLAP External Data Internal Data Data Warehouse Decisions Data Marts Data Mining Knowledge The activities taking place with respect to data for warehouse/OLAP/mining INTEGRATION OF DATA Integration Operational databases application A – m,f application B – 1,0 application C – x,y application D – male, female Data Warehouse M/F 7 DESIGN OF WAREHOUSE Design of warehouses is similar to databases: Schema not radically different W/H designed primarily for queries Database will be converted to cube at run time. 8 ORGANIZING THE W/H STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension. SNOWFLAKE SCHEMA: A variant of the star schema, Where some dimension tables are normalized, thereby splitting the data into additional tables. CONSTELLATION SCHEMA: A collection of stars. STAR SCHEMA OF A DATA WAREHOUSE BRANCH ACCOUNTS TRANSACTIONS What are the dimensions here? SNOWFLAKE SCHEMA OF A DATA WAREHOUSE BRANCH ACCOUNTS TRANSACTIONS CREDIT DEBIT CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING BRANCH DEPARTMENTS ACCOUNTS EMPLOYEES TRANSACTIONS CREDIT DEBIT EXAMPLE OF WAREHOUSE & MARTS Weekly sales by region 2010-2015 Weekly sales by state 2010-2015 Sales by Product Line 2010-2015 Data Mart Data Mart Warehouse Sales Detail Sales Detail 2010-2015 Weekly sales by product 2012-2015 OLAP & DATA WAREHOUSES 14 ONLINE ANALYTICAL PROCESSING OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December? Aggregation (“data summarization”) Dimension Slice & Dice Cube organization (Multidimensional database) AGGREGATION & CUBE Region Units North E. 40 South E. 20 South W. 30 Midwest 20 North E. 65 North E. 80 NE SE SW 65 20 30 40 50 80 sales in the Northern region? 16 DIMENSIONS & CONCEPT HIERARCHIES A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable. Dimensions can have hierarchies (or various levels of aggregations) A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts CONCEPT HIERARCHY Mfr. dimension Iphone 6S Plus GAL S7 10 mil GAL S6 38 mil Iphone 6S 48.7 mil G5 G4 G3 15,000 11 mil 10 mil MULTI-DIMENSIONAL ORGANIZATION Sales, costs etc. nw Regions sw Products (tables, desks, lamps..) Cube organization supports slice & dice MULTI-DIMENSIONAL ORGANIZATION.. LG MW SW SE NW NE TOTAL LG MW SW SE NW NE TOTAL G5 3,780 4,893 7,494 6,520 2,450 25,137 LG MW SW SE NW NE TOTAL G5 3,780 4,893 7,494 6,520 2,450 25,137 G4 2,342 1,200 1,400 1,678 950 7,570 G5 3,780 4,893 7,494 6,520 2,450 25,137 G4 2,342 1,200 1,400 1,678 950 7,570 G3 3,780 7,893 7,839 31,652 G4 2,342 5,647 1,200 6,493 1,400 1,678 950 7,570 G3 3,780 7,893 5,647 6,493 7,839 31,652 TOTAL G3 9,902 13,986 7,893 14,541 5,647 14,691 11,239 64,359 3,780 6,493 7,839 31,652 TOTAL 9,902 13,986 14,541 14,691 11,239 64,359 TOTAL 9,902 13,986 14,541 14,691 11,239 64,359 March sales February sales January sales shows multi-dimensional/cube organization 20 CUBE ORGANIZATION Data from warehouse imported into memory A sophisticated 3D representation is created Referred to as “sparse matrix” Sides of cube are dimensions Allows “slice & dice” Answers to high level queries/reports DATA MINING 22 DATA MINING Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data. e.g. sequence/association, classification, and clustering Buyer non-buyer Success/failure of an operation Customers who spend more Preferences of target groups 23 SOME DATA MINING APPLICATIONS Predicting the probability of default for consumer loans Predicting audience response to banner ads Predicting the probability that a cancer patient will respond to radiation therapy. Predicting the probability that an offshore well will produce oil A cellular company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles 24 DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY Evaluation and Presentation Knowledge Data Mining Selection and Transformation Cleaning and Integration Database Data Warehouse Flat files Patterns DATA MINING PROCESS Cleaning & integration – data is brought in from multiple sources Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin. Data mining - process of extracting data patterns, using statistical or AI techniques. Pattern evaluation - identifying patterns useful and relevant to the organizational context. Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user. 26 OLAP, MINING & WAREHOUSING... Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers. OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb? Data mining refers to identification of patterns from data. DATA MINING TECHNIQUES 28 TYPES OF ANALYSES Sequence -- Activities occurring one after another e.g. loan after buying car, warranty. Association -- (AKA Market Basket Analysis) Activities which occur together (e.g. bread and meat) Classification -- Identifying profiles of data classified into pre-defined groups (frequent & infrequent shoppers) Clustering -- Identifying natural characteristics of data (what major areas are cust. coming from?) SEQUENTIAL ANALYSIS 30 SEQUENTIAL ANALYSIS Predict purchasing 40% of customers buy a gray skirt six months after buying a red jacket Identified by time-series analysis, Neural networks or genetic algorithms 31 IDENTIFYING SEQUENTIAL PATTERNS.. applications in forecasting exchange rates, meat consumption , bankruptcies etc. 32 ASSOCIATIVE ANALYSIS 33 ASSOCIATION Identifies items purchased together Customers who bought items A, B, C also buy Y 85 percent of customers who buy a certain wine brand also buy a certain type of pasta Many men buy a six-pack when they purchase diapers Enhances product placement in stores 34 ANALYSIS TECHNIQUES ASSOCIATION TID Items 100 ACD 200 BCE 300 ABCE 400 BE Database Algorithm A-priori list all items – 1 item set (C1) filter by min. transaction support (L1) identify 2-item sets (L1* L1) = (C2 ) filter and generate L2 repeat process... Min. transaction support is the number (sometimes given as %) of transactions in which the item must occur. 35 A-PRIORI ALGORITHM TID Items Scan Database Itemset Sup. {A} 2 100 ACD 200 BCE {B} 3 ABCE {C} 3 {D} 1 {E} 3 Itemset Itemset Sup. {A B} {A B} 1 300 400 BE {A C} Scan Database minimum Support: 2 C1 minimum Support: 2 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2 {A C} 2 {A E} 1 {B C} {B C} 2 {B E} {B E} 3 {C E} 2 Itemset Sup. Itemset Sup. {B C E} 2 {B C E} 2 {A E} {C E} Itemset {B C E} C3 C2 Scan Database L1 L2 C2 L3 C3 36 DISCUSSION Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two. Client Stocks purchased Client1 AAPL, ORCL, WFC, BA Client2 BA, WTU Client3 GOOG, IBM, C Client4 WTU, WFC, C, IBM 37 CLASSIFICATION ANALYSIS 38 CLASSIFICATION A technique for grouping data into pre-defined classes using certain attributes of the data. E.g. defaulter or not, cruise customer, 4G subscriber or not etc. Define classes Identify classification criteria Select suitable technique application 39 CLASSIFICATION TECHNIQUES Discriminant analysis (categorical membership) Bayesian classification Logistic regression (probabilistic membership) Neural networks 40 DISCRIMINANT ANALYSIS* A method of classification that uses a Discriminant Function to decide classes DF based on attributes and weights. Calculated as a weighted average of attributes and weights (of training data) (Zave score) E.g. (GMAT + 200 * UGPA) > 1200 Given a new set of data, calculate Z score Compare with Zav *a simplified version DF – Discriminant Function 41 CLASSIFICATION WITH BAYESIAN ANALYSIS 42 REFRESHER ON PROBABILITY Probability is the chance that an event/outcome will take place. E.g. A baseball team had 44 wins 36 losses in one season Probability of winning? 43 PRIOR (CONDITIONAL) PROBABILITIES Prior probabilities are knowledge of other events which may help improve predictions E.g. p(IPO success) = 0.33 Suppose we know that a big company behind IPO p(IPO success/big company) = 0.99 Which is higher? P(successful cellphone call) or P(successful call/subscriber in service area) 44 CONDITIONAL PROBABILITIES.. College # of students CBT 1,800 COAS 2,300 COFAC 1,200 COE 6,700 Total 12,000 If we see a student in the union and he/she is a WIU student What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE? 45 BAYESIAN ANALYSIS Bayes theorem can be exploited for classification Bayes theorem P(A/B) = [P(B/A) x P(A)]/P(B) Classify into class I or II based on conditional probability P(class I/x) > P(class II/x)? * E.g. wiu student in union, which college are they? A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events *x is some condition e.g. surgery or being a shopper in a retail chain 46 INTRODUCTION TO BAYESIAN ANALYSIS We are interested in p(person becoming a manager/mba)* How can we use Baye’s theorem? ASSUME 300 m population; 100 m employees 500,000 are managers 10,000 managers go to college for an MBA 20 m go to college 100,000 do MBA *you need to write formula using terms from the problem 47 THIS SLIDE IS INTENTIONALLY BLANK 48 BAYESIAN ANALYSIS - EXAMPLE PROBLEM: An observer has collected information about Eagles & Hawks for a long time. If a new bird is spotted with a certain wingspan (x), need to know whether eagle or hawk From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski 49 BAYESIAN ANALYSIS - EXAMPLE CLASSIFICATION PROBLEM: Compare p(eagle/x) and p(hawk/x) Whichever is higher unknown bird We don’t know p(eagle/x) or p(hawk/x) Aquila rapax 50 BAYESIAN ANALYSIS - EXAMPLE Bayes‘ theorem p( x | eagle) P(eagle) P(eagle | x) p ( x) p( x | hawk ) P(hawk ) P(hawk | x) p( x) Decision rule? 51 PROBABILITY DENSITY FUNCTION Shows from observations of birds, the probability of a bird having a particular wingspan P(x|eagle), P(x| hawk) 52 PROBABILITY OF EAGLE OR HAWK N = Number of birds = neagle + nhawk P(eagle) neagles N = 0.8 P(hawk) nhawk N = 0.2 53 USING BAYES THEOREM New bird’s size = 45 cm p(45|eagle) = 2.22 x 10-2 p(45|hawk) = 1.10 x 10-2 (from known probability density functions) p( x | eagle) * P(eagle) vs p( x | hawk) * P(hawk) 2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2 0.01776 > 0.0021 Decision rule predicts eagle 54 CLUSTERING 55 DATA MINING FUNCTIONS Clustering Goal is to identify natural groupings of data. applicationications in market segmentation, discovering affinity groups, and defect analysis based on euclidean distance similar to centroid trial & error approach need to guess # of clusters 56 CLUSTER ANALYSIS OF CUSTOMERS CLUSTERING OF CAR BUYERS Income: Medium Children: 2 Income: High Children: 1 Car: Luxury Car: Sedan and Car: Truck Income: Medium Children:3 Cluster 1 Cluster 4 Cluster 3 Income: Low Children: 0 Car: Compact Cluster 2 58 A SIMPLE EXAMPLE ACCT# MONTHLY SPENDING 109905 $200 112343 $30 124205 $1,510 125687 $11,078 130984 $341 131987 $543 140001 $178 140387 $2,000 150933 $1,750 167912 $230 59 BIG DATA Big Data: name given to large volumes of data (in petabytes and exabytes) Characteristics Volume (PB, EB) Velocity (produced in real time) Variety (emails, social media posts, sensors etc) Technologies Apache Hadoop (open source) Hadoop cluster servers Hadoop Distributed File System (HDFS) for data storage MapReduce for distributed processing of queries 61 DISCUSSION QUESTIONS What kind of data is stored in a data warehouse? What are the characteristics of a data warehouse? How is data organized in a warehouse? What tool(s) can we use to analyze data in a warehouse? What type of multidimensional organization of data is also called a “sparse matrix”? What is a concept hierarchy? What technique can you use to predict the probability that a cancer patient will respond to radiation therapy? What type of mining identifies natural characteristics of data? What technique is useful for predicting trends? What is minimum transaction support? How is Bayes theorem used for classification? 62