Download Test Case : 0 - Center For Information Management, Integration and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining and Visualization
for E-Commerce
Presented By
Vandana Janeja
Presentation Outline
 Website Usage Data

JDK1.3, JavaScript, Java Servlets, Java based
web servers, Database MS Access
 Data Mining

Algorithms- K-Means, Apriori, Text Mining
 Visualization for Website management

Java3D, JDK1.3
Outline
Gather Data
Analyze Data
Visualize Data
•Java3D Visualization Algorithm
• Simulation Programs
•Data Mining
•Text Mining
•Clustering
•Decision Support System-Reporting System
•Web Crawler
•Servlets- For Server side Data
•JavaScript and Java Programs - for Client side data
Web Site Management
Client Side
Web Site Reading Component
Encrypted
Data
Matrix
Structure
3D Representation of
Static Web Site
Server Side
User Tracking and Log File
Reading Components
Encrypted
Data
Matrix
Structure
3D Representation of
Usage Of Web Site
Other Server
side components
like web site
Remediation
Model
Gather Data
Analyze Data
Visualize Data
•Java3D Visualization Algorithm
• Simulation Programs
•Data Mining
•Text Mining
•Clustering
•Decision Support System-Reporting System
•Web Crawler
•Servlets- For Server side Data
•JavaScript and Java Programs - for Client side data
•Collaboration
Data Gathering
Users
Browser
Application Server
Server Side
Programs
Data mining
Data Base
Data storage
User Log
Files
+
Info from
Programs
Client Side
Programs
WEB SITE
Static Site Map
•http://www.library.njit.edu/etd/njit-mt2001-010/thesis.html
Usage Map:
http://www.visualinsights.com
Usage Database
UsageDB Database
Input Data:
Servlet data
Table:
UsageDataTable
Table:
Cookies
Host names
Host Traced
Intermediary Hosts
Along connection path
Table:
RouterInfo
Table:
UserAgent
Input Data:
Javascript data
Host names >1 hit
Host names >1 hit
Host Pinged
Table:
PrefRouterInfo
Tables:
Url; Scripts; Meta;
Applets; ...
Results of host pinging
(done 4x’s per day)
Reports
Input Data:
Client side
website parsing
Outline
Gather Data
Analyze Data
Visualize Data
•Java3D Visualization Algorithm
• Simulation Programs
•Data Mining
•Text Mining
•Clustering
•Decision Support System-Reporting System
•Web Crawler
•Servlets- For Server side Data
•JavaScript and Java Programs - for Client side data
•Collaboration
Visualization
The objective of the project was to develop a 3-Dimensional (3-D)
visualization tool from an adjacency matrix representing
connectivity between elements and usage of connectivity paths
between these elements.
The visualization of connectivity could be for elements like
routers and websites.
Web Crawler Web Site Link Reader
******************
Index.html
Url1.html
Url4
Url1
Url5
Url2
Url6
Url3
URL2.html
Url7
Url8
Url9
Url3.html
Url10
Url11
Url12
Matrix
Structure
2
5
1
3
Adjacency
6
Matrix:
4
1 : [2,3,4]
2 : [5]
3 : [6]
4 : [7,8,1]
5 : [1]
6 : [9]
7 : []
8 : []
9 : []
9
7
8
1
2
3
4
5
6
7
8
9
1
1
1
1
1
0
0
0
0
0
2
0
1
0
0
1
0
0
0
0
3
0
0
1
0
0
1
0
0
0
4
1
0
0
1
0
0
1
1
0
5
1
0
0
0
1
0
0
0
0
6
0
0
0
0
0
1
0
0
1
7
0
0
0
0
0
0
1
0
0
8
0
0
0
0
0
0
0
1
0
9
0
0
0
0
0
0
0
0
1
Web Page
Connectivity
/ Hyperlink
Example 2:
Adjacency Matrix:
1 : [2,6]
2 : [3,7]
3 : [4,8]
4 : [5,9]
5 : [1,10]
6 : [8]
7 : [9]
8 : [10]
9 : [6]
10: [7]
Generating the N x N Gmatrix For Peterson’s Graph:
Adjacency
1 : [2,6]
2 : [3,7]
3 : [4,8]
4 : [5,9]
5 : [1,10]
6 : [8]
7 : [9]
8 : [10]
9 : [6]
10: [7]
1
2
3
4
5
6
7
8
9
10
1
0
1
0
0
0
1
0
0
0
0
2
0
0
1
0
0
0
1
0
0
0
3
0
0
0
1
0
0
0
1
0
0
4
0
0
0
0
1
0
0
0
1
0
5
1
0
0
0
0
0
0
0
0
1
6
0
0
0
0
0
0
0
1
0
0
7
0
0
0
0
0
0
0
0
1
0
8
0
0
0
0
0
0
0
0
0
1
9
0
0
0
0
0
1
0
0
0
0
10
0
0
0
0
0
0
1
0
0
0
Matrix:
3D Representation as a cylinder
1
2
3
4
5
6
7
8
9
1
1
1
1
1
0
0
0
0
0
2
0
1
0
0
1
0
0
0
0
3
0
0
1
0
0
1
0
0
0
4
1
0
0
1
0
0
1
1
0
5
1
0
0
0
1
0
0
0
0
6
0
0
0
0
0
1
0
0
1
7
0
0
0
0
0
0
1
0
0
8
0
0
0
0
0
0
0
1
0
9
0
0
0
0
0
0
0
0
1
Possible Applications
 Ad Placement
 Network Diagnostic
 Collaboration
 Detecting Anomalies
Measuring viewer usage is done in an indirect fashion. The
advantage of Internet advertising is increased feedback
to advertisers though the use of greater levels of
interactivity, targeting and precise measurement of user
behavior.Various pricing models used for currently in use
are:
 cost per thousand (and a related mechanism, flat fee /sponsors
 click through(CPM, CPC, CPL);
 hybrid models;
 outcomes.
Cost Per Thousand and Flat Fee /Sponsorship
One Look at the BANNER = 1 Impression
1000
Impressions
Cost Of Advertisement
Factors:
Usage Traffic
Profiles
Higher Traffic
Higher CPM
Network Diagnostic
Time /Date
<<input>>
UsageDatabase
«process»
Generates
Most preferred
User report
Connectiv ity Program
Connectiv ityDatabase
HistoryCheck
RouterList
<<input>>
ResponseIndex
UML Model of Network Diagnostic
Collaboration
Website Collaboration based on
Affiliate Model
Web Site A
Web Site B
Entry
Point
&
source
Exit point
1. Consolidated central schema
Web site A
Web site B
Web site C
User crosses over to Site B and a
complete dataset of the users
activity at web site A is passed to
web site B and so on.
The consolidated datasets of
transactions of the user across
web sites are written to a central
database
2. Cooperating central schema
Web site A
Web site B
Web site C
Distributed Central Database: This database is the same
database for all web site but it could be available in the
form of distribuited elements to each web site
Central
Database
To be able to pass Session id for single window scenario(where the link appears on the URL).
Web SiteB
Web SiteA
URL 1A
SessionID as
URL rewriting
URL 1B
1> object pool for multiple windows - the object containing the entire data about the session passed as a bean
to the collaborating site,
Web SiteA
SessionID in a bean
along with other data
URL 1A
Web SiteB
URL 1B
1> cookies for multiple windows with a cookie table in the shared pool,Here both collaborating sites can access the cookies for both
web sites.
Web SiteA
Web SiteB
URL 1A
URL 1B
Cookie Table in
Shared Pool
1> Table for an entire log file(generated by Servlet programs) along with Session Id for each user which can be used either as a
shared pool or as an element in a join query on the databases : for eg :
select * from SiteATable,SiteBTable where SiteATable.SessionID= SiteBTable.SessionID
Web SiteA
Web SiteB
URL 1A
URL 1B
LogFile
SiteB from
Servlet
programs
LogFile
SiteA from
Servlet
programs
DatabaseA
DatabaseB
Query with Join =
Temporary Table
Collaboration Reports
Outline
Gather Data
Analyze Data
Visualize Data
•Java3D Visualization Algorithm
• Simulation Programs
•Data Mining
•Text Mining
•Clustering
•Decision Support System-Reporting System
•Web Crawler
•Servlets- For Server side Data
•JavaScript and Java Programs - for Client side data
•Collaboration
Text Mining and Association Rule
Mining on the web
Some Types of Text Data
Mining
 Keyword-based association analysis
 Similarity detection


Cluster documents by a common author
Cluster documents containing information from a
common source
 Link analysis: unusual correlation between
entities
 Anomaly detection: find information that
violates usual patterns
Test Case : njit.edu
HTML Text Of pages traversed
List of pages traversed
Keyword list after pruning
Count of keywords for each HTML page
Sample Apriori Rules
3 <- 2 (70.0%, 85.7%)
2 <- 3 (70.0%, 85.7%)
2 <- 1 (60.0%, 83.3%)
4 <- 5 (30.0%, 100.0%)
3 <- 2 1 (50.0%, 80.0%)
2 <- 3 1 (40.0%, 100.0%)
4 <- 3 5 (10.0%, 100.0%)
4 <- 1 5 (10.0%, 100.0%)
2 <- 3 4 1 (20.0%, 100.0%)
Mining Association Rules—An
Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Reference: http://www.cs.sfu.ca/~han/DM_Book.html
Data Mining
Clustering Using K-Means
K-Means the clusters are formed based
on the basis of distance from a centroid
•K-means cluster analysis. K-means cluster analysis uses
Euclidian distance.
•Initial cluster centers are chosen in a first pass of the data, then
each additional iteration groups observations based on nearest
Euclidian distance to the mean of the cluster.
•Thus cluster centers change at each pass.
•The process continues until cluster means do not shift more than
a given cut-off value or the iteration limit is reached.
The K-Means Clustering Method
1.Test Case: 0 - 2,3
1 - 4,5
•Test Case: 0 – 2,6
1 – 4,5
But what if the number of clusters changes
•Test Case : 0 – 3, 5
(Case in which K changes )
1 – 6, 2
Text Mining and Visualization:
•
The web site is inherently made up with a directory structure, which is essentially a
tree structure. This is a kind of inherent similarity based grouping; All the related
pages are kept in a directory.
•
The web pages can also be grouped or clustered together based on other
similarity features which can be generated by text mining.
•
All the web pages can be similar to each other by the appearance of certain
keywords in them. These can be extracted and pruned using certain text mining
algorithms. Once this is done the web pages can be logically grouped in such a
way that it will be a “Bottom Up Approach” a set of pages can be input into the text
mining engine. This engine can come up with the most similar pages based on
appearance of keywords (which are also gathered using an algorithm).
•
This engine works on each directory and subdirectory structure. Subsequently “X”
such web pages can be grouped together. This will form a hierarchy of sets of “X”
pages arranged in a hierarchy.
Cylinder Visualization of Very Large Sites
Highest level with a
cluster of clusters
Cluster of “X” such
pages at the same
level based on the
similarity measure
Individual Pages clustered
based on a similarity
measure
Putting It all together
Mining
Data Gathered from
Different sources
Mining Result
Visualization
References:









Sudipto Guha, R.Rastogi, K.Shim :A clustering algorithm for categorical attributes. Technical report,
Bell laboratories, Murray Hill 1997
ROCK : A Robust Clustering Algorithm for Categorical Attributes: Sudipto Guha, Rajeev Rastogi,
Kyuseok Shim. Published in the Proceedings of the IEEE Conference on Data Engineering, 1999
Discussion on K-Means
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. [16] O.
Egecioglu and H. Ferhatosmanoglu. Circular data-space partitioning for similarity queries and
parallel disk allocation. In Proc. of IASTED International Conference on Parallel and Distributed
Computing and Systems, pages 194-200, November 1999.
•
A.K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings
of the Fifth Berkeley Symposium on Math. Stat. and Prob, volume 1, pages 281-196, 1967.
http://www.cs.sfu.ca/~han/DM_Book.html
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept.
1996.
Acknowledgements and Disclaimers
Advisors:
 Dr.Manikopoulos

Associate Professor,Electrical and Computer Engineering Department, New
Jersey Institute of Technology
 Dr.Jay Jorgenson

Professor, Mathematics Department,City University Of New York
 Software Development team at Network Security Solutions: Some of the
material is a copyright of NSS,Inc and SiteGain,Inc.
 Thesis in visualization was done during the Master’s at NJIT
Related documents