* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Management for Decision Support
Survey
Document related concepts
Transcript
Data Management for Decision Support Session-2 Prof. Bharat Bhasker What is a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant Introduction- Business Problem Definition Business Application domain of Decision Support • Customer Retention- modeling those who defected to identify patterns that led their defection. Apply it to present set and devise prevention. • Sales & Customer Service - By aggregating the proper information to frontline sales and service professionals. Based on customer profile, rule based recommendation of products, service plans. Cross-selling • Marketing- Accurate information for retention campaigns, lifetime value analysis, trending, targeted promotions • Risk assessment and fraud detection- Accessible customer base reduces the risk. A mail-order company can identify payment patterns from a mailing address. Insurance company can aggregate all drawn out policies, thus total exposure. A bank can identify risky companies Introduction- Business Problem Definition Class of Problems • Retrospective Analysis - Analysis of past and present events. Analysis of zone wise, product wise sales • Predictive Analysis- Predicting certain events or behavior based on historical data. Attrition rate of customers due to competition Introduction- Business Problem Classifications Application Techniques for Business Problems • Classification- based on predefined criteria, examples credit ratings of individuals • Clustering and Segmentation - Segment a database in subsets or clusters based on a set of attributes. In the process of understanding customer base, an organization may segment the known population. Thus, discover hitherto unknown attributes (school attended, # of vacations per year). Clustering may utilize statistical or AI techniques • Associations- These techniques identify affinity among the collection reflected in the examined records. These affinities are often referred to as rules. Foe example, 60% of all record that contain A & B also contain C & D. Product affinity in Market basket analysis • Sequencing- Identifies patterns over time,. For example a analysis of customer visits may reveal that a person who buys engine oil & filter in one visit, buys gas additive in next visit. ====> Introduction- Business Problem Classifications The organizations are faced with a wealth of data stored in archives,. It is the inability to discover, often previously unknown, information hidden in data prevents them from deriving the knowledge and value for the organization. The objective, therefore, should be to extract valid, previously unknown, and comprehensible knowledge/information from large database and use it for profit (Competitive advantage). To meet these objectives- – Capture and integrate both external and internal data into a comprehensive view that encompasses whole organization – Organize the present data and extracted information in ways that expedites the decision making – Mine the integrated data for information Introduction- Operational Data Stores For variety of On Line Transaction Processing (OLTP) systems (Financial, Order processing, Point of sale applications) the organizations maintain an operational data store. For example, bank withdrawals and deposits. – – – – – – – – Organized by application Daily business processing on a detailed transactional level Update intensive Current data Optimized for high performance Access is few records per transaction, often on primary key Large number of short duration transactions Large number of concurrent applications Introduction- Informational Data Stores Need to organized around subjects such as customers, vendors, products. Focuses on answering questions like “ What two products resulted in most frequent calls on hotline?”. Tend to have redundant and non-updateable. Characteristics exhibited include: – Data Model- to meet End-use analysis needs. No focus on ACID – Data Access- Ad hoc queries – Time base- Recent, Aggregated, derived and historical data – Data changes- Periodic, scheduled batch updates. – Unit of Work- Queries rather than concurrent updates – Record range accessed- Millions for a query vs. few for operational – Number of Concurrent users- Low for IDS vs. High for ODS. – Transaction Volumes- Low for IDS vs. High for ODS. – Type of users- Analytical & Managerial – Number of Indexes- Often many complex, compound vs. few, simple. Introduction- Managing Data for DS Data Warehouse Common characteristics: – Database designed to meet analytical tasks comprising of data from multiple applications – Small number of users with intense and long interactions – Read intensive usage – Periodic updates to the contents – Consists of current as well as historical data – Relatively fewer but large tables – Queries results is large results sets, involving full table scan and joins spanning several tables – Aggregation, vector operation and summarization are common – The data frequently resides in external heterogeneous sources Introduction- Terminology Current Detail Data- data acquired directly from operational databases, often representing entire enterprise Old Detail Data- Aged current detail data, historical data organized by subjects, it helps in trend analysis Data Marts- A large data store for informational needs where scope is limited to a department, SBUs etc., In a phased implementation data marts are a way to build a warehouse. Summarized Data- Aggregated data along the lines required for executive reporting,trend analysis and decision support. Drill Down- Ability of knowledge worker to perform business analysis in a top down fashion,starting from the highly summarized data to the current and old detail data. Metadata- It is data about the data, description of contents, location, structure, end-user views, identification of authoritative data, history of updates, security authorizations Introduction- Architecture Management External Meta data Current Information Delivery System Report, Query & EIs Data Mining Tools Extract, Cleanup & Load Realized or Virtual MDDB Currentl Repository OLAP Tools What is a Data Warehouse? An Alternative Viewpoint “A DW is a – – – – subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making.” -- W.H. Inmon, Building the Data Warehouse, 1992 A Data Warehouse is... • Stored collection of diverse data – A solution to data integration problem – Single repository of information • Subject-oriented – Organized by subject, not by application – Used for analysis, data mining, etc. • Optimized differently from transactionoriented db • User interface aimed at executive … Cont’d • Large volume of data (Gb, Tb) • Non-volatile – Historical – Time attributes are important • Updates infrequent • May be append-only • Examples – All transactions ever at WalMart – Complete client histories at insurance firm – Stockbroker financial information and portfolios Summary Business Information Guide Data Warehouse Catalog Business Information Interface Data Warehouse Data Warehouse Population Enterprise Modeling Operational Systems Warehouse is a Specialized DB • • • • • • • Standard DB Mostly updates Many small transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data Thousands of users (e.g., clerical users) • • • • • • • Warehouse Mostly reads Queries are long and complex Gb - Tb of data History Lots of scans Summarized, reconciled data Hundreds of users (e.g., decisionmakers, analysts) Warehousing and Industry • Warehousing is big business – $2 billion in 1995 – $3.5 billion in early 1997 – $8 billion in 1998 • WalMart has largest warehouse – 900-CPU, 2,700 disk, 23 TB Teradata system – ~7 TB in warehouse – 40-50GB per day