Download CS 345: Topics in Data Warehousing

CS 345: Topics in Data Warehousing Tuesday, November 23, 2004 Review of Thursday’s Class • Association Rule Mining – Market basket analysis – What is an association rule? – Frequent itemsets • Association rule mining algorithms – A-priori algorithm – Speeding up A-priori using hashing – One- and two-pass algorithms Overview of Today’s Class • Clustering – Types of clustering – Clustering large data sets • Administrative details – Final exam – Assignment #3 and #4 – Course Project • Miscellaneous topics – – – – Multiple fiscal calendars Heterogeneous products Sparse facts Audit dimensions Clustering • One of the most popular data mining tasks – Very general-purpose • Divide records into groups of “similar” records • What is a good clustering? – Records in same cluster are similar – Records in different clusters are dissimilar • There are many ways to do it – – – – Many notions of similarity have been proposed Many clustering algorithms have been proposed No consensus on which is “best” Often quality of results is subjective Clustering • Are there 2 clusters or 3 clusters? Types of Clustering Algorithms • Hierarchical – Agglomerative • Initially, each record is its own cluster • Repeatedly combine the 2 most similar clusters – Divisive • Initially, all records are in a single cluster • Repeatedly split 1 cluster into 2 smaller clusters • Partition-based – Partition records into k clusters – Optimize objective function measuring goodness of clustering – Heuristic optimization via “hill-climbing” • Random choice of initial clusters • Iteratively refine clusters by re-assigning records when doing so improves objective function • Eventually converge to local maximum Dendrograms • A hierarchical clustering can be represented by a dendrogram • Many different clusterings in one – Viewed at different levels of detail Clustering and Data Warehouses • Typically clustering is performed externally to database server – Similar to association rule mining – Data mining process: • • • • • Apply filters in data warehouse Export data records that pass filters Apply clustering algorithm to exported file Assign each record to a cluster Import cluster assignments back into warehouse • Scaling clustering algorithms to large data sets – Data summarization • • • • Grow clusters incrementally in single pass Summarize characteristics of each cluster using concise synopsis Decide to split / merge clusters based on synopses Example: BIRCH (Balanced Iterative Clustering and Reducing using Hierarchies) – Sampling 1 • Perform clustering on a random sample of the data – Sampling 2 • Represent cluster by a sample group of representative points • Example: CURE (Clustering Using REpresentatives) Final Exam • Final exam: – Wednesday, Dec. 8 – 7:00pm-10:00pm – Room TBD • Open book / open notes • Goals: – Test understanding of major concepts – Practice thinking critically about design alternatives • Practice final handed out next Tuesday – Format of real final will closely resemble practice final Assignment #3 • Assignment #3 was very time-consuming for many students – I misjudged the difficult of the assignment – Sorry about that! – Assignment #4 will be much easier • Grading policy for Assignment #3: – Reasonable effort = full credit – You won’t be penalized if your program wasn’t completely finished – Reward for finishing = don’t have to do Assignment #4 Assignment #4 • Two options: 1. Finish Assignment #3 2. Write decision tree learning program • • 2nd option is described in handout Can do both for extra credit – Not advised unless your course project is finished Project Presentations • Short presentations about course projects – Next Tuesday and Thursday in class – Sign up for your preferred day (handout) • What is expected? – 5-10 minute presentation • Give an overview of your project • Describe your results (preliminary results if necessary) – Implementation projects: • Overview of architecture • Give a demo (if possible) – Literature surveys: • Summarize the papers being surveyed – Informal presentation is fine • Not meant to be a big deal • No need to spend hours and hours preparing! – Project doesn’t need to be complete by presentation date • But you should have made good progress • At least enough progress to make a decent presentation! Project Write-Up • Project write-up – Main project deliverable – Literature survey: written report – Implementation project: whatever is appropriate • • • • Document describing project design Sample outputs Executable code or link to demo-able system, if feasible Contact the instructor if you’re not sure what to submit • Due date: December 8 – Extensions available if needed • Contact the instructor if you need one – Recommended time frame: • • • • Finish the project by your project presentation date Submit your write-up when you do your presentation Enjoy a low-stress finals week! This may or may not be possible depending on your schedule… Miscellaneous Topics • Covered in Ch. 7-15 of Data Warehouse Toolkit – These chapters are not required reading – Density of new material is low – Lots of application examples • • • • • Multiple calendars Heterogeneous products Unstructured text fields Sparse facts Audit dimension (data lineage) Multiple Fiscal Calendars • Businesses operate on fiscal calendars – Used for accounting purposes – Fiscal year may not correspond to calendar year • Stanford’s fiscal year: Sept. 1 – Aug. 31 • US Government fiscal year: Oct. 1 – Sept. 30 – Fiscal month may not correspond to calendar month • 5-4-4 calendar • ISO week • Date dimension includes fiscal calendar attributes – Fiscal year, fiscal month, fiscal week • Some organizations may use multiple fiscal calendars – Subsidiaries or individual lines of business may report differently – E.g. for historical reasons, following a merger – How to handle this in the date dimension? Three Options for Multiple Calendars • Three options – Include all attributes in Date dimension – Use outrigger tables – Use extended dimension tables • 1. Attributes for all fiscal calendars in Date dimension – E.g. FiscalYear, FederalFiscalYear, OldFiscalYear – Advantage: Simple data model (no new tables) – Disadvantage: Wide/cluttered data dimension • 2. Outrigger table for each non-standard fiscal calendar – 1-to-1 relationship between outrigger row and date dim. row – Advantage: Date dimension stays thin and simple – Disadvantages: • Extra tables • Queries using non-standard attributes require extra join Extended Dimension Tables • 3. Create several versions of the Date dimension table – The basic Date dimension contains all the “core” date attributes – Alternative versions of the Date dimension contain: • Core date attributes • Additional attributes from a particular fiscal calendar • All versions share the same surrogate keys – E.g. surrogate key of Nov. 23, '04 = 1287 • In Date dimension... • And also in all alternative Date dimensions • Using alternative dimensions – Most queries join fact to basic Date dimension – Some queries join fact table to extended Date dimension instead – Advantages: • Basic Date dimension stays thin and simple • No extra joins required – Disadvantages: • Additional management overhead to synchronize keys • Extra tables / extra copies of data Heterogeneous Products • Some businesses have a variety of dissimilar products – Example: Financial services – American Express products include: • • • • Credit cards Checking and Savings Brokerage accounts Insurance •Annuities •Financial Planning •Traveler’s checks •…and many more • Relevant attributes vary from one product to another – Mortgages have very different attributes from credit cards – Facts associated with each product may differ too • Two desired types of analysis – Global • Unified view of all customer interactions across all lines of business – Specific • Detailed analysis of individual lines of business Handling Heterogeneous Products • Design “core” and “custom” schemas • Core schema includes common data elements – Useful for global analyses that cut across product lines – Core facts include measures relevant to most or all product lines – Core dimensions have attributes relevant to most/all product lines • Custom schemas are specialized for specific product lines – One custom schema for mortgages, another for credit cards, etc. – Useful for detailed analyses of specific product lines – Custom facts include: • Core measurements – E.g. Balance • Product-specific measurements – E.g. Finance charge, interest rate, fees – Custom dimensions include: • Core attributes • Product-specific attributes Same Three Options Apply • All possible attributes included in “monster” dimension and fact tables – Infeasible if product lines are too diverse – Data becomes overly sparse, wastes space – Most entries in most tables will be NULL or N/A • Outrigger tables with product-specific dimension attributes – Separate outrigger table for each product line – Could have “fact outrigger tables” as well • Core measurements are stored in main fact table • “Outrigger fact tables” have additional measurements for specific product lines • Detailed analysis requires joining outrigger fact to main fact • Saves space compared to storing separate core + extended fact tables • Core and extended dimension and fact tables Option 1: Include All Attributes Extended measurements Extended attributes Dimension Table Fact Table Dimension Table Core measurements Dimension foreign keys Core attributes Option 2: Use Outriggers Extended attributes Extended measurements Dim. Table Fact Table Dim. Table Core attributes Core measurements Dimension foreign keys Option 3: Core + Extended Extended attributes Extended measurements Core attributes Fact Dim. Dim. Table Fact Table Dim. Fact Dim. Dim. Dim. Table Fact Dim. Dim. Core measurements Dimension foreign keys Core attributes Core measurements Dimension foreign keys Unstructured Text Fields • Often data is collected as free-form text – Or “semi-structured” text – Examples: • Comments fields • Text of product reviews • Résumés • Querying free-form text requires information retrieval techniques – Keyword search • Special keyword indexing techniques (inverted indexes) – Stemming • Derived = derive = deriving – “Sounds like” operator (Soundex) • Smith = Smythe • Ascroft = Ashcraft • Sometimes query answering paradigm is different – Relational query: Exact match – IR query: Ranked results, fuzzy match – Combining IR & relational query styles = research area Sparse Facts • Some applications have: – – – – Many possible measurements Only a few of them are made per fact record Which ones varies from record to record Examples: • Data from scientific experiments • Medical diagnostics for hospital patients • Standard approach – Include all measurement columns in fact – Most measured values are 0 or NULL – Can be wasteful of space Sparse Facts • Alternative approach – One fact row with n measurements → n fact rows with one measurement each • Grain of fact table is changed – Include just a single measurement column • Or one column per data type: int_value, money_value, decimal_value – Special “fact type” dimension • Indicates which measurement is being recorded • All queries must filter to a single fact type or aggregates are meaningless! – Advantage: • Saves space • Queries are faster – Disadvantages: • Users must take care not to SUM incompatible fact types – Temperature + Blood Pressure = Meaningless • Difficult to compare multiple measures associated with same event Audit Dimension • Data warehouses are never perfect – Mistakes can be made during load – Incorrect data can be discovered later and corrected • Modeling change history of warehouse data can be useful – Where did this row come from? – Has it been updated since it was first created? • Known as the data lineage or data provenance problem – Accurate tracking can be difficult when data from multiple sources is combined and transformed – Active area of research • One solution: add audit dimension to fact – Audit dimension includes data about when and how the fact record was generated – Goal: Understand and record process by which data gets into the warehouse – Answer questions like: • Why did the answer to this query change between last week and this week? • What fact rows were produced by last Thursday’s load? The data was bad! – Sample attributes: SourceSystem, ExtractDate, AllocationPolicy, ChangeDate, ChangeReason, IsOutlierValue

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS 345: Topics in Data Warehousing