Download N - Rakesh Agrawal

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Database Technologies For Electronic Commerce
Rakesh Agrawal, Ramakrishnan Srikant, Yirong Xu
Searching with Numbers
Reflectivity
Empirical Results
Non-Reflective
50
40
Catalog Database
100
20
90
10
80
0
70
0
10
20
30
40
50
Precision
Dell Computer
700 MHz Celeron,
256 MB SDRAM, …
IBM Thinkpad
750 MHz Pentium 3,
196 MB DRAM, …
• If we get a close match on numbers, how
likely is it that we have correctly matched
attribute names?
– Likelihood  Non-reflectivity (of data)
• Let
– D: dataset, ni : co-ordinates of point xi ,
– reflections(xi ): permutations of ni
– (ni ): # of points within distance r of ni
– (ni ): # of reflections within distance r of ni
30
Low Reflectivity
50
40
60
50
40
30
30
20
20
10
10
0
0
0
1
 (ni )
Reflectivi ty  1 

| D | xiD  (ni )
10
20
30
40
1
50
2
3
4
5
Trans
Wine
Auto
Query Size
High Reflectivity
DRAM
Credit
50
LCD
Glass
Proc
Housing
40
.
800 200
IBM Thinkpad
(750 MHz, 196 MB)
…
Dell (700 MHz, 256 MB)
800 200 3 lb
• Non-overlapping attributes  Non-reflective.
– Memory: 64 - 512 Mb, Disk: 10 - 40 Gb
• Correlations or Clustering  Low reflectivity.
– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb
30
20
R. Agrawal and R. Srikant,
“Searching with Numbers”, W
W W 2002
10
0
0
10
20
30
40
50
eCommerce Applications
Storage & Querying of
eCommerce Data
eCommerce Applications
SELECT name, output FROM H
Data stored in conventional way
monitor recharge
PANL75 7 inch
KLH 221
-
Built-in
-
scan …
output
S-Video
-
…
…
1. Problem with
Conventional Schema
•
•
•
•
Large number of Columns
Sparsity
Constant schema evolution
Performance
ICs
ICs
Logic
output
scan …
Query Parsing
Query Mapping Layer
Transformation
Optimized Operator
Implementation
Pure SQL-92 Transform: SELECT V1.val, V2.val
FROM V V1, V V2
Vertical Table (V)
WHERE V1.key = ‘name’
oid
key
val
AND V2.key = ‘output’
AMD V1.oid = V2.oid
0 name
PANL75
0
0
0
…
1
1
monitor
Recharge
Output
…
name
Output
7 inch
Built-in
Digital
…
KLH 221
S-Video
2. Advantages of
Vertical Schema
b
c
d
e
Cat1
f
x
Master Catalog
y
z
Goal
• Use affinity information in new catalog.
– Products in same category are similar.
• Accuracy boost depends on match between two
categorizations.
w
New Catalog
ICs
a
b
x
Mem.
y
c
Logic
d
e
f
Stores for XML, RDF, LDAP and Data Mining
R. Agrawal, A. Somani and Y.
Xu, “Storage and Querying of
E-Commerce Data”, VLDB 2001
Enhanced Naïve Bayes classifier
Pr(Ci | S ) Pr( d | Ci )
Pr(Ci | d , S ) 
Pr( d | S )
| Ci | (# docs in S predicted to be in Ci ) w
Pr(Ci | S ) 
w
(|
C
|

(#
docs
in
S
predicted
to
be
in
C
)
)
j j
j
Cat2
• After integration:
DSP
Other Applications:
• Objects can have large
number of attributes
• Handles sparseness
well
• Easy schema evolution
Problem Statement
a
Recommendations for
Database Vendors:
Partial Indices
Enhanced Table Functions (TF)
First Class treatment of TF
Native Support for v2h operation
• Writing SQL is painful
• B2B electronics portal: 2000 categories, 200K
datasheets
Mem.
monitor recharge
But …
Catalog Integration
DSP
name
• Hides complexity of vertical
representation
• Fast performance
z
w
• Given
– master categorization M:
• categories C1, C2, …, Cn
• set of documents in each category
– new categorization N:
• categories S1, S2, …, Sn
• set of documents in each category
• Standard Alg: Compute Pr(Ci | d)
• Enhanced Alg: Compute Pr(Ci | d, S)
• Use tuning set to determine w.
– Defaults to standard Naïve Bayes if w = 0.
• Only affects classification of borderline documents.
Accuracy Improvement on Pangea Data
100
Perfect
90-10
80-20
GaussianA
GaussianB
Base
90
Accuracy
name
3. Solution: Query
Mapping Layer
SELECT name, output FROM HV
Horizontal View (HV)
80
70
60
R. Agrawal and R. Srikant, “On
Integrating Catalogs”, W W W 2001
1
2
5
10 25 50 100 200
Weight
Related documents