Download Chapter 8 - Jerry Post

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
All Powder Board and Ski
Oracle 9i Workbook
Chapter 8: Data Warehouses and Data Mining
Jerry Post
Copyright © 2003
1
Oracle Relational Approach
Meta-Data
Relational
Tables
Customer
Sale
SaleItem
Materialized
Views
Star Design
Dimension
Dimension
Sale +
Fact
Measure
Customer
Item
Dimension
Dimension
2
Desired Sales Cube Dimensions
Sales Dimensions
State (ship)
Month
Category
Style
SkillLevel
Size
Color
Manufacturer
BindingStyle
WeightMax?
ItemMaterial?
WaistWidth?
3
Early Data: Spreadsheets
4
External Tables: Attach to CSV
create or replace directory csv_dir as
‘D:\students\BuildAllPowder\AllPowderSampleDataCSV';
create table OldSale_Ext
( SaleID
INTEGER,
SaleDate
DATE,
ShipState
VARCHAR2(50),
ShipZIP
VARCHAR2(50),
PaymentMethod
VARCHAR2(50),
SKU
VARCHAR2(50),
QuantitySold
INTEGER,
SalePrice
NUMBER(10,2)
ModelID
VARCHAR2(250),
ItemSize
NUMBER,
ManufacturerID
INTEGER,
Category
Color
ModelYear
Graphics
ItemMaterial
ListPrice
Style
SkillLevel
WeightMax
WaistWidth
BindingStyle
)
VARCHAR2(50),
VARCHAR2(50),
INTEGER,
VARCHAR2(50),
VARCHAR2(50),
NUMBER(10,2),
VARCHAR2(50),
INTEGER,
NUMBER,
NUMBER,
VARCHAR2(50)
Continued on next slide
Warning: currency columns cannot have $ symbols or commas
5
External File Definition
organization external (
type oracle_loader
default directory csv_dir
access parameters (
records delimited by newline
fields terminated by ','
optionally enclosed by '"' lrtrim
missing field values are null
(
SaleID,
SaleDate char date_format date mask "mm/dd/yyyy",
ShipState, ShipZIP, PaymentMethod, SKU, QuantitySold, SalePrice,
odelID, ItemSize, ManufacturerID, Category, Color, ModelYear,
Graphics, ItemMaterial, ListPrice, Style, SkillLevel, WeightMax,
WaistWidth, BindingStyle
)
)
location ('Lab 08-01 Early Sales.csv')
)
reject limit unlimited;
6
Create Customer and Employee
CustomerID and EmployeeID are missing from the old data.
Instead of relying on blank cell values, create a new customer called
“Walk-in” and a new employee called “Employee”
Write down the ID numbers generated for these anonymous entries.
If you use SQL, you can assign a value of zero to these entries.
INSERT INTO Customer (CustomerID, LastName)
Values (0,'Walk-in')
INSERT INTO Employee (EmployeeID, LastName)
Values (0,'Staff')
7
Extract Model Data
SELECT DISTINCT OldSale_ext.ModelID, OldSale_ext.ManufacturerID,
OldSale_ext.Category, OldSale_ext.Color, OldSale_ext.ModelYear,
OldSale_ext.Graphics, OldSale_ext.ItemMaterial, OldSale_ext.ListPrice,
OldSale_ext.Style, OldSale_ext.SkillLevel, OldSale_ext.WeightMax,
OldSale_ext.WaistWidth, OldSale_ext.BindingStyle
FROM OldSale_ext;
8
UNION Query for Models
SELECT DISTINCT ModelID, ManufacturerID, Category, …
FROM OldSales_ext
UNION
SELECT DISTINCT ModelID, ManufacturerID, Category, …
FROM OldRentals_ext
9
Insert Model Data into ItemModel
INSERT INTO ItemModel (ModelID, ManufacturerID, Category, Color,
ModelYear,
Graphics, ItemMaterial, ListPrice, Style, SkillLevel, WeightMax, WaistWidth,
BindingStyle)
SELECT DISTINCT qryOldModels.ModelID, qryOldModels.ManufacturerID,
qryOldModels.Category,
qryOldModels.Color, qryOldModels.ModelYear, qryOldModels.Graphics,
qryOldModels.ItemMaterial,
qryOldModels.ListPrice, qryOldModels.Style, qryOldModels.SkillLevel,
qryOldModels.WeightMax,
qryOldModels.WaistWidth, qryOldModels.BindingStyle
FROM qryOldModels;
10
Insert SKU Data into Inventory
CREATE VIEW qryOldInventory AS
SELECT DISTINCT ModelID, SKU, ItemSize
FROM OldSale_ext
UNION
SELECT DISTINCT ModelID, SKU, ItemSize
FROM OldRental_ext;
INSERT INTO Inventory (ModelID, SKU, ItemSize, QuantityOnHand)
SELECT DISTINCT qryOldInventory.ModelID, qryOldInventory.SKU,
qryOldInventory.ItemSize, 0 As QuantityOnHand
FROM qryOldInventory;
Note the use of the column
alias to force a zero value for
QuantityOnHand for each row
11
Copy Sales Data
INSERT INTO Sale (SaleID, SaleDate, ShipState, ShipZIP, PaymentMethod)
SELECT DISTINCT OldSales_ext.SaleID, OldSales_ext.SaleDate,
OldSales_ext.ShipState, OldSales_ext.ShipZIP, OldSales_ext.PaymentMethod
FROM OldSales_ext;
Note that if you have added
data to your Sales table, your
existing SaleID values might
conflict with these
You can solve the problem by
adding a number to these
values so they are all larger
than your highest ID
INSERT INTO Sale (SaleID, SaleDate, ShipState, ShipZIP, PaymentMethod)
SELECT DISTINCT OldSales_ext.SaleID+5000, OldSales_ext.SaleDate,
OldSales_ext.ShipState, OldSales_ext.ShipZIP, OldSales_ext.PaymentMethod
FROM OldSales_ext;
12
Copy SaleItem Rows
INSERT INTO SaleItem (SaleID, SKU, QuantitySold, SalePrice)
SELECT DISTINCT OldSale_ext.SaleID+5000, OldSale_ext.SKU,
OldSale_ext.QuantitySold, OldSale_ext.SalePrice
FROM OldSale_ext;
If you transformed the SaleID
in the prior step for the Sale
data, you must do the exact
same calculation for SaleID in
the SaleItem table
13
Copy Rental Data
INSERT INTO Rental (RentID, RentDate, ExpectedReturn, PaymentMethod)
SELECT DISTINCT OldRental_ext.RentID+5000, OldRental_ext.RentDate,
OldRental_ext.ExpectedReturn, OldRental_ext.PaymentMethod
FROM OldRental_ext;
INSERT INTO RentItem (RentID, SKU, RentFee, ReturnDate)
SELECT DISTINCT OldRental_ext.RentID+5000, OldRental_ext.SKU,
OldRental_ext.RentFee, OldRental_ext.ReturnDate
FROM OldRental_ext;
14
Discoverer Administrator:
Load Business Area
Schema
Tables
and views
Select tables
15
Load Wizard Options: LOV
Most options
are selected
by default
Select the LOV
option to have
Discoverer build
lookup lists
16
Discoverer: Business Area
Tables shown as folders
and named so managers
understand them
Columns shown as items
Add a calculated item
17
Create a Data Hierarchy
Select Category and Style
from the SkiBoardStyle
lookup table
18
Discoverer Desktop: New Workbook
Select the
dimensions and
the fact item
19
Initial Crosstab Layout
Page area
Column area
Row area
20
Discoverer Crosstab Browser
Totals
Select all items
Format options
21
Time Series Analysis:
Moving Average
22
Time Series Analysis: Discoverer
23
Sales by State for Regression
Note that some states are
missing from the list.
24
Regression Data Query
CREATE VIEW StateSales2004 AS
SELECT StateName, Income2001, Pop2002,
Sum(SalePrice*QuantitySold) AS Sales2004
FROM Sale INNER JOIN StateDemographics
ON Sale.ShipState = StateDemographics.StateCode
INNER JOIN SaleItem ON Sale.SaleID = SaleItem.SaleID
WHERE ShipState IS NOT NULL AND SaleDate Between
'01-Jan-2004' And '31-Dec-2004'
GROUP BY StateName, Income2001, Pop2002
ORDER BY StateName;
25
Regression Setup
You should include the label
row but be sure to check the
box to show you included it
26
Regression Results
Relatively high R-square
Population is a significant
predictor, Income is not
27
Association Rules/Market Basket
Locate folders
Item to find
Possible location
Data mining samples
D:\Oracle\ora92\dm\demo\sample
ORACLE_HOME
D:\Oracle\ora92
JAVA_HOME
C:\OracleData\Ora92DS\jdk
28
Copy Files to Protect Original
compileSampleCode.bat
executeSampleCode.bat
Sample_AssociationRules.java
Sample_AssociationRules_Transactional.property
Sample_Global.property
29
Edit Sample_Global.property File
miningServer.url=jdbc:oracle:thin:@YourServerName:1521:DBName
miningServer.userName=odm
miningServer.password=password
inputDataSchemaName=powder
outputSchemaName=powder
timeout=120
If necessary, use enterprise manager to unlock and assign
new passwords to accounts: odm and odm_mtr
30
Create New Table To Hold
Transaction Basket Data
CREATE TABLE MARKET_BASKET_TX_BINNED
( SEQUENCE_ID
INTEGER,
ATTRIBUTE_NAME
VARCHAR2(35),
VALUE
NUMBER
);
GRANT SELECT ON MARKET_BASKET_TX_BINNED TO odm;
commit;
If you use these names, you do not have to
edit the Transactional.property file
31
Copy SaleItem Data
INSERT INTO MARKET_BASKET_TX_BINNED
(SEQUENCE_ID, ATTRIBUTE_NAME, VALUE)
SELECT SaleID,
ItemModel.Category || '_' || ItemModel.Style AS AName,
1 As Value
FROM SaleItem Inner Join Inventory
ON SaleItem.SKU = Inventory.SKU
Inner Join ItemModel
ON Inventory.ModelID = ItemModel.ModelID
GROUP BY SaleID, ItemModel.Category || '_' || ItemModel.Style;
32
Copy Sale Data
INSERT INTO MARKET_BASKET_TX_BINNED
(SEQUENCE_ID, ATTRIBUTE_NAME, VALUE)
SELECT SaleID, 'ID', SaleID
FROM Sale;
commit;
33
Remove Dashes from Attribute
UPDATE MARKET_BASKET_TX_BINNED
SET ATTRIBUTE_NAME =
substr(ATTRIBUTE_NAME,1,instr(ATTRIBUTE_NAME,'-')-1)
|| '_' || substr(ATTRIBUTE_NAME,instr(ATTRIBUTE_NAME,'-')+1)
WHERE instr(ATTRIBUTE_NAME,'-') > 0;
commit;
Run at least twice—until you get zero changes.
Because a row might have more than one dash.
34
Limit Size of Attribute_Name
UPDATE MARKET_BASKET_TX_BINNED
SET ATTRIBUTE_NAME = substr(ATTRIBUTE_NAME,1,20);
commit;
This is critical—but is probably due to a bug in Oracle’s
code. There is a slight chance it arises because of the
30 character name limitation in Oracle.
35
Compile and Run the Code
SET ORACLE_HOME = D:\Oracle\ora92
SET JAVA_HOME = C:\OracleData\ora92DS\jdk
compileSampleCode.bat Sample_AssociationRules.java
executeSampleCode.bat Sample_AssociationRules
Sample_AssociationRules_Transactional.property
Type as all one line—do not hit <Enter> until the end
To redirect the output to a file, at the end, add:
>myfile.txt
36
Sample Results
Getting top 5 rules for model: Sample_AR_Model_tx sorted by support.
Rule 124: If Boots_=1 then Clothes_=1 [support: 0.17285714, confidence: 0.44814816]
Rule 38: If Clothes_=1 then Boots_=1 [support: 0.17285714, confidence: 0.35276967]
Rule 101: If Board_Half_Pipe=1 then Clothes_=1 [support: 0.11357143, confidence: 0.4622093]
Rule 9: If Clothes_=1 then Board_Half_Pipe=1 [support: 0.11357143, confidence: 0.23177843]
Rule 100: If Ski_Freestyle=1 then Clothes_=1 [support: 0.09785714, confidence: 0.48070174]
Get rules by support: Sample_AR_Model_tx, with minimum support of 0.16.
Rule 124: If Boots_=1 then Clothes_=1 [support: 0.17285714, confidence: 0.44814816]
Rule 38: If Clothes_=1 then Boots_=1 [support: 0.17285714, confidence: 0.35276967]
Get rules by confidence: Sample_AR_Model_tx, with confidence of 0.56 or more.
Investigate and think about the results.
Do you have too many clothes targeted to half-pipe
boards and freestyle skiers, or not enough?
37
GIS: Microsoft MapPoint
The Discoverer worksheet
places the data into rows and
columns
A dynamic copy of this sheet is
used to remove the top rows
38
MapPoint Data Wizard
39
GIS Analysis of Sales
40