Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
D A T A B A S E Data Warehouse Predefined reports Interactive data analysis Operations data Daily data transfer OLTP Database 3NF tables Data warehouse Star configuration Flat files 1 D A T A B A S E Data Warehouse Goals Existing databases optimized for Online Transaction Processing (OLTP) Online Analytical Processing (OLAP) requires fast retrievals, and only bulk writes. Different goals require different storage, so build separate dta warehouse to use for queries. Extraction, Transformation, Transportation (ETT) Data analysis Ad hoc queries Statistical analysis Data mining (specialized automated tools) 2 D A T A B A S E Extraction, Transformation, and Transportation (ETT) Customers Convert Client to Customer Apply standard product numbers Convert currencies Fix region codes Transaction data from diverse systems. Data warehouse: All data must be consistent. 3 D A T A B A S E OLTP v. OLAP Category Data storage Indexes Joins Duplicated data Updates Queries OLTP 3NF tables Few Many Normalized, limited duplication Constant, small data Specific OLAP Multidimensional cubes Many Minimal Denormalized DBMS Overnight, bulk Ad hoc 4 D A T A B A S E Multidimensional Cube Pet Store Item Sales Amount = Quantity*Sale Price Customer Location 5 D A T A B A S E Sales Date: Time Hierarchy Year Levels Quarter Roll-up To get higher-level totals Month Week Drill-down To get lower-level details Day 6 D A T A B A S E Star Design Dimension Tables Products Sales Date Fact Table Sales Quantity Amount=SalePrice*Quantity Customer Location 7 D A T A B A S E City Snowflake Design Merchandise Sale ItemID Description QuantityOnHand ListPrice Category SaleID SaleDate EmployeeID CustomerID SalesTax OLAPItems SaleID ItemID Quantity SalePrice Amount CityID ZipCode City State Customer CustomerID Phone FirstName LastName Address ZipCode CityID Dimension tables can join to other dimension tables. 8 D A T A B A S E OLAP Computation Issues Quantity 3 2 5 Price 5.00 4.00 9.00 Quantity*Price 15.00 8.00 45.00 or 23.00 Compute Quantity*Price in base query, then add to get $23.00 If you use Calculated Measure in the Cube, it will add first and multiply second to get $45.00, which is wrong. 9 D A T A B A S E OLAP Data Browsing 10 D A T A B A S E Microsoft Excel Pivot Table 11 D A T A B A S E Excel Pivot Table Reports Quarter Month Quarter 1 Quarter 2 Quarter 3 Quarter 4 Grand Total LastName EmployeeIDData Carpenter 8 Sum of Animal 1,668.91 Sum of Merchandise 324.90 Eaton 6 Sum of Animal 522.37 Sum of Merchandise 30.60 Farris 7 Sum of Animal 5,043.36 Sum of Merchandise 826.92 Gibson 2 Sum of Animal 4,983.51 Sum of Merchandise 668.25 Hopkins 4 Sum of Animal 3,747.96 Sum of Merchandise 476.91 James 5 Sum of Animal 3,282.77 Sum of Merchandise 505.89 O'Connor 9 Sum of Animal 2,643.69 Sum of Merchandise 263.70 Reasoner 3 Sum of Animal 4,577.43 Sum of Merchandise 762.30 Reeves 1 Sum of Animal 1,120.93 Sum of Merchandise 263.88 Shields 10 Sum of Animal 1,008.76 Sum of Merchandise 62.10 Total Sum of Animal 28,599.69 Total Sum of Merchandise 4,185.45 606.97 78.30 426.39 99.00 341.85 54.90 1,059.70 188.10 1,549.83 238.50 1,194.88 252.90 2,373.08 693.45 180.91 83.70 625.74 89.10 372.65 121.50 437.88 99.00 510.12 55.80 589.68 116.80 7,591.11 1,624.05 162.15 22.50 2,840.72 569.50 7.20 128.70 562.50 107.10 796.47 306.00 2,556.10 450.90 128.41 7.20 150.11 99.00 2,500.24 396.90 6,701.03 1,495.80 2,709.47 630.90 1,426.72 192.60 6,899.53 1,321.02 9,089.44 1,357.65 5,443.90 858.51 6,243.84 1,397.34 3,334.72 403.20 8,293.09 1,365.10 1,120.93 263.88 1,170.91 84.60 45,732.55 7,874.80 Can place data in rows or columns. By grouping months, can instantly get quarterly or monthly totals. 12 D A T A B A S E Category Month Amount OLAP in SQL 99 GROUP BY two columns Gives you totals for each month within each category. You do not get superaggregate totals for the category, or the month, or the overall total. Bird 1 $135.00 Bird 2 $45.00 Bird 3 $202.50 Bird 6 $67.50 Bird 7 $90.00 Bird 9 $67.50 Cat 1 $396.00 Cat 2 $113.85 Cat 3 $443.70 Cat 4 $2.25 SELECT Category, Month(SaleDate) AS Month, Sum(Quantity*SalePrice) AS Amount FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem ON Merchandise.ItemID = SaleItem.ItemID) ON Sale.SaleID = SaleItem.SaleID GROUP BY Category, Month(SaleDate); 13 D A T A B A S E SQL ROLLUP SELECT Category, Month…, Sum … FROM … GROUP BY ROLLUP (Category, Month...) Category Month Amount Bird Bird … Bird Cat Cat … Cat … (null) 1 2 135.00 45.00 (null) 1 2 607.50 396.00 113.85 (null) 1293.30 (null) 8451.79 14 D A T A B A S E Missing Values Cause Problems If there are missing values in the groups, it can be difficult to identify the super-aggregate rows. Category Month Amount Bird Bird … Bird Bird Cat Cat … Cat … (null) 1 2 135.00 45.00 (null) (null) 1 2 32.00 607.50 396.00 113.85 (null) 1293.30 (null) 8451.79 Missing date Super-aggregate 15 D A T A B A S E GROUPING Function SELECT Category, Month…, Sum …, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY ROLLUP (Category, Month...) Category Month Amount Bird 1 135.00 Bird 2 45.00 … Bird (null) 32.00 Bird (null) 607.50 Cat 1 396.00 Cat 2 113.85 … Cat (null) 1293.30 … (null) (null) 8451.79 Gc 0 0 Gm 0 0 0 1 0 0 0 0 0 0 1 0 1 1 16 D A T A B A S E CUBE Option SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY CUBE (Category, Month...) Category Month Bird 1 Bird 2 … Bird (null) Bird (null) Cat 1 Cat 2 … Cat (null) (null) 1 (null) 2 (null) 3 … (null) (null) Amount 135.00 45.00 Gc 0 0 Gm 0 0 32.00 607.50 396.00 113.85 0 1 0 0 0 0 0 0 1293.30 1358.8 1508.94 2362.68 1 0 0 0 0 1 1 1 8451.79 1 1 17 D A T A B A S E GROUPING SETS: Hiding Details SELECT Category, Month, Sum FROM … GROUP BY GROUPING SETS ( ROLLUP (Category), ROLLUP (Month), () ) Category Month Bird (null) Cat (null) … (null) 1 (null) 2 (null) 3 … (null) (null) Amount 607.50 1293.30 1358.8 1508.94 2362.68 8451.79 18 D A T A B A S E SQL OLAP Analytical Functions VAR_POP VAR_SAMP STDDEV_POP STDEV_SAMP COVAR_POP COVAR_SAMP CORR REGR_R2 REGR_SLOPE REGR_INTERCEPT variance standard deviation covariance correlation regression r-square regression data (many) 19 D A T A B A S E SQL RANK Functions SELECT Employee, SalesValue RANK() OVER (ORDER BY SalesValue DESC) AS rank DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense FROM Sales ORDER BY SalesValue DESC, Employee; Employee SalesValue rank dense Jones 18,000 1 1 Smith 16,000 2 2 Black 16,000 2 2 White 14,000 4 3 DENSE_RANK does not skip numbers 20 D A T A B A S E SQL OLAP Windows SELECT Category, SaleMonth, MonthAmount, AVG(MonthAmount) OVER (PARTITION BY Category ORDER BY SaleMonth ASC ROWS 2 PRECEDING) AS MA FROM qryOLAPSQL99 ORDER BY SaleMonth ASC; Category Bird Bird Bird Bird … Cat Cat Cat Cat … SaleMonth 200101 200102 200103 200104 MonthAmount 1500.00 1700.00 2000.00 2500.00 200101 200102 200103 200104 4000.00 5000.00 6000.00 7000.00 MA 1600.00 1850.00 4500.00 5500.00 21 D A T A B A S E Ranges: OVER SELECT SaleDate, Value SUM(Value) OVER (ORDER BY SaleDate) AS running_sum, SUM(Value) OVER (ORDER BY SaleDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum2, SUM (Value) OVER (ORDER BY SaleDate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS remaining_sum; FROM … Sum1 computes total from beginning through current row. Sum2 does the same thing, but more explicitly lists the rows. Sum3 computes total from current row through end of query. 22 D A T A B A S E LAG and LEAD Functions LAG or LEAD: (Column, # rows, default) SELECT SaleDate, Value, LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day FROM … ORDER BY SaleDate SaleDate 1/1/2003 1/2/2003 1/3/2003 … 1/31/2003 Value prior_day 1000 0 1500 1000 2000 1500 3500 3200 next_day 1500 2000 2300 Prior is 0 from default value 0 Not part of standard yet? But are in SQL Server and Oracle. 23 D A T A B A S E Data Mining Goal: To discover unknown relationships in the data that can be used to make better decisions. Transactions and operations Reports Specific ad hoc questions Queries Aggregate, compare, drill down OLAP Databases Unknown relationships Data Mining 24 D A T A B A S E Exploratory Analysis Data Mining usually works autonomously. Supervised/directed Unsupervised Often called a bottom-up approach that scans the data to find relationships Some statistical routines, but they are not sufficient Statistics relies on averages Sometimes the important data lies in more detailed pairs 25 D A T A B A S E Common Techniques Classification/Prediction/Regression Association Rules/Market Basket Analysis Clustering Data points Hierarchies Neural Networks Deviation Detection Sequential Analysis Time series events Websites Textual Analysis Spatial/Geographic Analysis 26 D A T A B A S E Classification Examples Examples Which borrowers/loans are most likely to be successful? Which customers are most likely to want a new item? Which companies are likely to file bankruptcy? Which workers are likely to quit in the next six months? Which startup companies are likely to succeed? Which tax returns are fraudulent? 27 D A T A B A S E Classification Process Clearly identify the outcome/dependent variable. Identify potential variables that might affect the outcome. Supervised (modeler chooses) Unsupervised (system scans all/most) Use sample data to test and validate the model. System creates weights that link independent variables to outcome. Income Married Credit History Job Stability Success 50000 Yes Good Good Yes 25000 Yes Bad Bad No 75000 No Good Good No 28 D A T A B A S E Classification Techniques Regression Bayesian Networks Decision Trees (hierarchical) Neural Networks Genetic Algorithms Complications Some methods require categorical data Data size is still a problem 29 D A T A B A S E Association/Market Basket Examples What items are customers likely to buy together? What Web pages are closely related? Others? Classic (early) example: Analysis of convenience store data showed customers often buy diapers and beer together. Importance: Consider putting the two together to increase crossselling. 30 D A T A B A S E Association Details (two items) Rule evaluation (A implies B) Support for the rule is measured by the percentage of all transactions containing both items: P(A ∩ B) Confidence of the rule is measured by the transactions with A that also contain B: P(B | A) Lift is the potential gain attributed to the rule—the effect compared to other baskets without the effect. If it is greater than 1, the effect is positive: P(A ∩ B) / ( P(A) P(B) ) P(B|A)/P(B) Example: Diapers implies Beer Support: P(D ∩ B) = .6 Confidence: P(B|D) = .857 Lift: P(B|D) / P(B) = 1.714 P(D) = .7 P(B) = .5 = P(D ∩ B)/P(D) = .6/.7 = .857 / .5 31 D A T A B A S E Association Challenges If an item is rarely purchased, any other item bought with it seems important. So combine items into categories. Item Freq. Item Freq. 1 “ nails 2% Hardware 15% 2” nails 1% Dim. Lumber 20% 3” nails 1% Plywood 15% 4” nails 2% Finish lumber 15% Lumber 50% Some relationships are obvious. Burger and fries. Some relationships are meaningless. Hardware store found that toilet rings sell well only when a new store first opens. But what does it mean? 32 D A T A B A S E Cluster Analysis Examples Are there groups of customers? (If so, we can cross-sell.) Do the locations for our stores have elements in common? (So we can search for similar clusters for new locations.) Do our employees (by department?) have common characteristics? (So we can hire similar, or dissimilar, people.) Problem: Many dimensions and large datasets Large intercluster distance Small intracluster distance 33 D A T A B A S E Geographic/Location Examples Customer location and sales comparisons Factory sites and cost Environmental effects Challenge: Map data, multiple overlays 34