Download Project milestone 5 Answer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL shortening wikipedia , lookup

URL redirection wikipedia , lookup

Transcript
ACCTG 6910, Spring 2003
DESB, University of Utah
Project Milestone 5 (April 3 – 17)
Question 1 (75%): Discover access patterns in web logs.
The supervisory council for University of Utah’s web portal has contacted the e.bis
Research Lab to discover user access patterns from its web logs.
As a volunteer in the Lab, you have been asked to perform association rule and
sequential pattern mining tasks on a small sample web log. It contains 4736 users,
10000 sessions and 11042 visit with the following attributes:
1-5
7-11
13-17
user id
session id
URL id
Step 1: Download from the Project section in the class website the data set –
weblog.txt and a text file – urlmapping.txt that describes mappings of URL codes in
weblog.txt to URLs in UU’s web site.
Step 2: Use IBM Intelligent Miner to mine the data set for large item sets, association
rules and large sequential patterns. Use 0.3 % for support level for association rule
and sequential pattern mining and 50 % for confidence level for association rule
mining. Mine the data set again using two different support levels for both
association rule and sequential pattern mining.
Step 3: Report and analyze the results. Please identify 10 interesting association
rules and 10 large sequential patterns respectively. Use the urlmapping.txt to help find
the URLs that match URL ids in the rules/patterns Write up a short (one to two
paragraphs) of analysis of these rules/patterns and any actions you recommend the
supervisory council to consider.
1. Select five (instead of ten) different association rules with multiple items on the
left-hand-side to interpret. If you don't have enough qualified rules, please adjust
your support and/or confidence to find sufficient rules. Repeat steps 2 and 3 for each
select rule.
2. Think of an explanation why the rule might exist (e.g., for A, B -> C, think of why
would UU website visitors tend to access page C if they visit A and B.
3. Discuss your assessment of whether your explanation is or is not interesting.
4. Select five (instead of ten) large 3-sequences. If you don't have enough large
3-sequencies, please adjust the support level to find sufficient qualified sequences.
For each select large sequence, repeat steps 2 and 3.
At 0.3% support and 50% confidence level, IM discovered 76 rules and 139 item-sets.
At 0.2% support and 50% confidence level, IM discovered 128 rules and 235
item-sets.
At 0.1% support and 50% confidence level, IM discovered 1072 rules and 781
item-sets.
Note: The objective of this milestone is for you to better understand what you
may or may not expect from data mining and the efforts and domain knowledge
required to interpret and leverage data mining results. Think about how much
time it took you to work on the milestone 5 (A software like Link Selector can
save a lot of web master’s time to interpret user access patterns for website
redesign decisions.) Was it hard to interpret the results and make
recommendations without some knowledge of website design/administration?
When the data mining task is somewhat data-driven initially, you must find
experts or acquire the relevant knowledge to analyze and leverage the patterns.
Here are some relevant knowledge and ways to interpret and analyze association
rules and sequential patterns from a web log:
Web content management decisions include which links to be included in a page
(especially the portal page).
An association rule, A, B -> C may suggest the following linkage because visits to
A and B tend to go thru C:
A
C
B
A sequential pattern, <{A}, {B}, {C}> may suggest the following linkage because
users tend to reference these urls throughout sessions:
A
A
B
B
C
C
Interpretations of interestingness of association rules and sequential patterns:
1. Uninteresting if the patterns are induced by the design of a website only.
2. Somewhat interesting if the patterns show common user interests that are
well recognized and supported by the website design because they validate
the effectiveness of the design.
3. Interesting if the patterns show common user interests that are not well
recognized and supported by the website design because some redesign
actions may follow.
Association Rule Analysis:
1) [06155]+[06165]+[06128]==>[06153]
support = 0.3552% confidence = 100% lift = 266.68
corresponding URLs are
[/upap/main.html]+[/upap/top.html]+[/upap/left.html]==>[/upap]
This rule exists because when users visited page /upap (utah physian assistant
program) in utah website, the website would automatically load /upap/main.html,
upap/top.html, and /upap/left.html and combine them into one html page as response.
Uninteresting.
2) [00489]+[02049] ==> [02091]
support = 0.3158% confidence = 80% lift = 29.37
corresponding URLs are
[/academics/index.html]+[/graduate_school/admissions.html]
==>[/graduate_school/index.html]
The rule exists because if users visited academic program index page
(/academics/index.html) and admissions information of graduate school
(/graduate_school/admissions.html), they most probably used graduate school index
page (/graduate_school/index.html) to navigate. Somewhat interesting.
3) [02087]+[02085] ==> [02091]
support = 0.2368% confidence = 85.71%
lift = 31.47
corresponding URLs are
[/graduate_school/graduate_handbook/handbook.html]+[/graduate_school/graduate_h
andbook/grad.degrees.html]==>[/graduate_school/index.html]
The rule exists because users clicked the link in the graduate school home page
[/graduate_school/index.html]
to
browse
the
graduate
handbook
[/graduate_school/graduate_handbook/handbook.html], they then clicked the
hyperlink in the handbook page to view the degrees available in UU. Somewhat
interesting.
4) [00489]+[00680] ==> [00687]
support = 0.2171% confidence = 68.75%
lift = 31.96
corresponding URLs are
[/academics/index.html]+[/calendar/index.html]==>[/calendar/oct2002.html]
The rule exists because users may visit the page of event of October 2002
[/calendar/oct2002.html] by clicking event calendar link [/calendar/index.html] in the
academics program [/academics/index.html] home page. Somewhat interesting.
5) [04560]+[00566] ==> [03773]
support = 0.1184% confidence = 50%
lift = 32.07
[/students/index.html]+[/alumni_visitors/index.html]==>[/quicklinks/index.html]
The rule exists because user may switch between the student index page
[/students/index.html] and alumni visitor index page [/alumni_visitors/index.html]
through the quick links index page [/quicklinks/index.html]. However, the quick links
index page is removed now, and user will be redirected to the UU homepage if they
still visit that page. The UU home page also contains the links to the student and
alumni visitor homepages Somewhat interesting.
Under minimum support 0.1%, IM mined 773 sequential patterns. Five of them are
selected to explain as follow.
1)<{[04560]}, {[00489]}, {[00472]}>
support = 0.313
corresponding URLs are
<{[/students/index.html]}, {[/academics/index.html]}, {[/a_z/index.html] }>
The pattern exists because users may check student information by student index page
and browse academics information by academics index page, then using a-z index to
quick locate specific information they may be interested in. Interesting.
2) <{[04560]}, {[04560]}, {[05955]}>
support = 0.281%
corresponding URLs are
<{[/students/index.html]}, {[[/students/index.html]}, {[/unews/releases/02/oct/cauldron.html]}>
The pattern exists because users may notice and visit the hot news link in the student
index page after they visit the student index page twice. Interesting.
3) <{[00489]}, {[00489]}, {[00489]}>
support = 8.13%
corresponding URLs are
<{[/academics/index.html]}, {[/academics/index.html]}, {[/academics/index.html]}>
The pattern exists because some users used to use academics index page to locate
academics information in the different sessions.
3) <{[00472]}, {[00489]}, {[04560]}>
support =0.219%
corresponding URLs are
<{[/a_z/index.html ]}, {[/academics/index.html]}, {[/students/index.html]}>
The pattern exists because some users may find it not easy to find the information
they required through a-z index page. Therefore, they may choose to use academic
and student index page to locate the information of their interest in the following
sessions. Interesting.
4) <{[00472]}, {[00472]}, {[00680] [00687]}>
support =0.188%
corresponding URLs are
<{[[/a_z/index.html]}, {[[/a_z/index.html]},
{[/calendar/index.html] [/calendar/oct2002.html]}>
The pattern exists because users may like to review the news of UU after they browse
the specific information of their interest through a-z index page. We can also derive
that the web logs may be collected in October, 2002 since users visit the news index
page of that time. Interesting.
5) <{[00489]}, {[01681]}, {[00489]}>
support =0.188%
corresponding URLs are
<{[/academics/index.html]}, {[/employment/index.html]}, {[/academics/index.html]}>
The pattern exists because students may want to find a job in UU to support their
academics learning. Interesting.
Question 2 (25 %): If the data file includes referrer and visit duration information for
each visit, please discuss how you might use clustering to help identify clusters in the
data file.
Note: Clustering uses more than one attributes and doesn’t specify what the
clusters (e.g., clusters of urls with average visit duration longer than 1 minute)
should be.
The following clustering could produce interesting results of pages, users or sessions
with similar patterns. To further analyze how they are similar, additional data mining
such as association rule and sequential pattern, more granular clustering and
classification may be applied.
Cluster by url with attributes - average visit duration, top 3 referrers, average # of
visitors/day, # of in-links, and # of out-links
Cluster by visitor with attributes – location (e.g., zip code, coordinates, or wireless
cell), frequent visit time of day (e.g., early am, mid am, late am, early pm, mid pm
and late pm), average session duration, average page visit duration, top n large-item
sets and top n sequential patterns.
Cluster by session with attributes – location of visitor, average session duration, time
of day, average number of links, and top n large-item sets