Download A Novel Approach for Web Intelligence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ISSN (Online) : 2319 - 8753
ISSN (Print) : 2347 - 6710
International Journal of Innovative R esearch in Science, E ngineering and T echnology
Volume 3, Special Issue 3, March 2014
2014 IEEE International Conference on Innovations in Engineering and Technology (ICIET’14)
On 21st & 22nd March Organized by
K.L.N. College of Engineering, Madurai, Tamil Nadu, India
A Novel Approach for Web Intelligence
M. Ambika, Dr. K. Latha
Dept of CSE, Anna University, BIT Campus, Tiruchirappalli, Tamil Nadu, India.
Dept of CSE, Anna University, BIT Campus, Tiruchirappalli, Tamil Nadu, India.
Abstract— Web Intelligence is a fast-growing area of
research that combines multiple disciplines including
artificial intelligence, machine learning, data mining,
natural language processing. Making the web intelligent is
the art of customizing items in response to the needs of
the users. Predicting users’ behaviors will expedite and
enhance browsing experience, which could be achieved
through personalization. Web Intelligence provides a
platform which empowers internet users to find the most
appropriate and best information for their interest. This
paper proposes a novel approach for making the web
adroit.
Keywords— web mining, web usage mining, web
personalization.
I.INTRODUCTION
In the today’s era of information technology, the Web
provides every Internet citizen with access to the
abundance of information. We can say that “We are
drowning in data, but starving for knowledge.” Thus
considering the impressive variety of the web, retrieving
interesting, relevant and required information has become
a very difficult task. A popular and successful technique
which has shown many promises is web mining.
Web mining has become a hot research topic, which
combines two of the prominent research areas comprising
the data mining and the World Wide Web. “Web mining is
a technique to explore, collate and extract patterns in the
data content of web sites by using traditional data mining
techniques”. More over web mining research is a
multidisciplinary field from several research communities,
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
such as database, information retrieval, and artificial
intelligence research communities, especially from
machine learning and natural language processing.One
main issue to be considered in web mining is “Reach the
Right Person with the Right Message at the Right Time”.
The Web should have the ability to sense
and adapt to the needs and preference of the user. One
effective approach to this is to make the web
intelligent. Here each and every individual visitor will be
treated as individuals with their own goal and target when
searching for information on the web. In the paper we
proposed a novel approach to perform web
personalization with intelligence.This paper is structured
as follows: Section II, give an overview about web
mining, Section III, shows deep and intense study of web
intelligence. Section IV and V elaborates the proposed
methodology and evaluation techniques. Finally, we
concluded the paper in section VI and outlined some
promising area of future research.
II. WEB MINING
According to Oren Etzioni (1996), web mining is the
application of data mining techniques to automatically
discover and extract information from World Wide Web
documents. The information gathered through web mining
is evaluated by using traditional data mining parameters
such as clustering and classification, association, and
examination of sequential patterns. Web mining can also
be used to understand customer behaviour, evaluate the
effectiveness of a particular web site. Some of the issues
related to the web [1] are information retrieval,
information extraction, personalization of the web
information, learning the consumers and individual user’s
behaviour which can be used for user modelling and
profiling.
Web mining deals with the discovery and analysis of
useful information from the www. As suggested by
Kosala and Blockeel (2000) and Qingyu Zhang et al.
(2008), Fig. 1 depicts the Web mining sub tasks. The
1796
A Novel Approach for Web Intelligence
relevant data are retrieved from the web and then
selection and pre-processing are done. Machine language
or data mining techniques can be used for discovering the
patterns, and finally, it is validated and visualization [1]
[13].
Web
Data
Information
Retrieval
Preprocessing
Generalization
Analysis
Visualization
Knowledge
Fig. 1 Web mining task
Web mining process is of three different types based
on the nature of the data such as [1][3][5],
Web Content mining: extract useful information or
knowledge from the web page. It deals with text, image,
records, etc.
Web Structure Mining: is the specialization of web
content mining except it is tag content, mainly deals with
the hyperlink structure of the web sites.
Web Usage mining: deals with the data mining techniques
applied to web usage information such as web server logs,
proxy logs, http logs, etc. It can be used to predict user
behavior while the user interacts with the web.
III. WEB INTELLIGENCE
The Web has been considered as the largest repository
of information. In web based learning gathering accurate,
effective and useful information or knowledge has
become a challenging issue. Each and every user has their
goal when searching for information on the Web [19]
[20]. Some of the issues in the current web search engines
are,
 Lack of user adaption
 Retrieve results based on web popularity rather
than user's interests
 Relevant results beyond first few pages have a
much lower chance of being visited by user
 Provides the same result for the same query.
 Relevance estimate is system centered approach.
One solution to above issues is to make the web,
intelligent with the ability to sense and adapt to the users
need. Every individual visitor is treated as individuals,
with targeted content and offers that appeal to their
implicit or explicit needs [9]. As a result it can achieve
 tailoring search results to individuals based on
knowledge of their interests
 identify relevant documents and put them on top
of the result list
 filter irrelevant search results
 relevance is crucial
 making visitor’s time more productive and
engaging.
Thus Web intelligence is an art, which can make the
web as Wisdom web. Ning Zhong et.al (2000) coined the
term Web Intelligence. It is the area of study and research
of the application of artificial intelligence and information
technology on the web in order to create the next
generation of web empowered system [22]. It can be
defined as a process of helping users by providing
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
customized or relevant information on the basis of web
experience to a particular user or set of users. It can also
be defined as a recommendation system that performs
information filtering with intelligence. The general phases
of web intelligence process are (Fig. 2),
 Learning phase,
 Pattern extraction phase
 Information retrieval or Recommendation.
Learning
Phase
PreProcessing
Pattern
Extraction
Recommendation
Fig. 2 The phases of web intelligence
A. Learning Phase
Data collection phase is also known as learning phase.
This phase is used to develop a user profile definition and
user segments. A user profile is a collection of attributes
that user will need to either maintain or derive in order to
support personalization [17]. It is very important phase.
Other phase depend on this phase. The user profile can be
collected by two ways.
 implicit profile
 explicit profile
Implicit profile: The attributes can be derived from
browsing patterns, cookies, and other sources. No
feedback from the user is collected. It can also be derived
from Geo Locations, Behavioural Learning, Contextual
Learning, Social Collaborative Learning, and Simulated
Feedback Learning. It is based on the characteristic of an
individual’s such as Interest, Social Category, Context,
role, functional area. It can also make use of current
session activities, past session activities.
Explicit profile: The attributes come from online
questionnaires, registration forms, integrated CRM or
sales force automation tools, and legacy or existing
databases. External feedback is collected from the user
which provides rating or preferences. User personal
information like age, occupation, religion, economic,
environment etc can be used to generate user profile.
Pre-processing consists of elaborating the raw web
access logs to produce data in a format usable by the
Pattern Discovery phase.
B. Pattern Discovery
Pattern discovery aims to detect interesting patterns
in the pre-processed Web usage data by deploying various
data mining techniques. These methods usually consist of
(Eirinaki & Vazirgiannis, 2003):
 Association rule mining: A technique used for finding
frequent patterns, associations and correlations
between sets of items. In the Web personalization
domain, this method may indicate correlations
between pages not directly connected and reveal
previously unknown associations between groups of
users with same interests.
 Clustering: a method used for grouping together
items that have similar characteristics. In our case
items may either be users (that demonstrate similar
online behavior) or pages (that are similarity utilized
by users).
1797
A Novel Approach for Web Intelligence


Classification: A process that assigns data items to
one of several predefined classes. Classes usually
represent different user profiles.
Sequential pattern discovery: An extension to the
association rule mining technique, used for revealing
patterns of co-occurrence, thus incorporating the
notion of time sequence. A pattern in this case may
be a Web page or a set of pages accessed immediately
after another set of pages.
C. Recommendation Phase
The aim of a recommender system is to determine
which Web pages are more likely to be accessed by the
user in the future [12]. In this phase active user’s
navigation history is compared with the discovered
Navigation patterns in order to recommend a new page or
pages to the user in real time. Generally not all the items
in the active session path are taken into account while
making a recommendation. A very earlier page that the
user visited is less likely to affect the next page since
users generally make the decision about what to click by
the most recent pages. Therefore the concept of window
count is introduced. Window count parameter ‘n’ defines
the maximum number of previous page visits to be used
while recommending a new page.
IV. PROPOSED METHODOLOGY
This paper proposes a new novel approach to perform
web intelligence. The entire process is parted into two
phase such as online phase and offline phase. The
subsequent fig. 3 gives the general framework of the
proposed method.
Browsing
histories
Web server
Preprocessing
User
profile
Pattern
Extraction
Download
histories
Learning Phase
Offline Process
User
Search
Engine
Search
Results
ReRanking
Personalized
Results
Online Process
Fig. 3 The framework of web personalization
The user profile is generated by collecting data from
various sources and by performing necessary
transformation and pre-processing. Personalized and useful
patterns are extracted in the off line process. During online
phase, the current active session of the user are taken into
account, and based on the patterns generated from the uses
previous experience, the personalized information will be
provided to the user.
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
A. Learning Phase or profile generation
In our investigation the effectiveness of personalized
search is based upon the generation of user profile. The
user profile has information that can distinguish one user
from a multitude of other users. Profiles normally include
topics of interests but may also include topics of disinterest
by taking into account relevant and non-relevant
documents [26].
1) Creation of user profile: There are two different ways
to collect users’ interest.
 Implicit feedback
 Explicit feedback
Our system uses the implicit feedback approach to
decipher the interests of a user. The information sources
currently used are:
1. The documents on the user's machine are taken as being
of interest to him/her. The assumption is that if a user
keeps a document on his/her machine there is a strong
possibility that the user is interested in those documents.
2. The browsing history is another source of information.
Although not all web pages browsed are of interest to the
user but pages that are frequently visited and those on
which more time is spent can be taken as useful
documents for the user's interest.
3. The bookmarked pages are definitely interesting for a
user.
4. The download history can be used as a starting point for
deciding favorite directories for download. This would be
more useful in case of those users who organize files on
their machines properly.
We believe that this approach has several advantages over
previous work in which user profiles were built from user
browsing hierarchies [18].
2) User profile representation: A profile constitutes the
user's long-term information desires. There are two main
distinct types of user profiles [27]:
a) Content-based profile: the user profile is depicted
similar to a query, as a vector of terms.
b) Collaborative profile: this approach is based on the
rating patterns of similar users. It is assumed that
individuals with similar rating patterns seem to like the
same kind of information - “like minded people”. Hence,
a collaborative profile may be expressed as a list of
similar users.
3) Vector Space Model:
In our approach content based profile generation is
used. In this model the documents are represented as
feature vectors where features are the keywords extracted
from a given set of the documents. If a term is more
frequent in a document it is expected to be given more
weight than a less frequent word as it represents the
document better. In the term frequency model the value of
a feature is term frequency divided by the total number of
words in the document [29]. The term frequency tfij of a
term ti in a document dj is defined as:
tf ij 
number of occurencesof ti in d j
sum of number of occurencesof all terms in d j
(1)
The inverse document frequency of a term ti in a set of
documents D is defined as:
1798
A Novel Approach for Web Intelligence

number of documents in D
IDFi  log
number
of documentsin D containing ti




(2)
A TF-IDF score wij is the product of tfij and idfi. The
weight is calculated using the equation (3):
wij  tf ij  IDFi
(3)
B. Pre - Processing
Pre-processing involves interpreting raw user data
and selecting which data to use so that is makes better
sense for the rest of the system. The input data can be of
different types (pdf, doc, ppt, html etc). They are first
converted to text and then preprocessed. The
preprocessing step involves stop-word removal and
stemming. These are then converted to feature vectors
where the features are the terms in the documents after the
preprocessing step.
Filtered user data from this process is then the input
for initial pattern extraction for new users. If the user is
not new this filtered data can be combined with existing
user data for discovery of new user patterns.
C. Pattern Extraction
The unique behavior of web user has to be analyzed.
Clustering technique is used to find common patterns,
group similar objects, or to organize them in hierarchies.
The grouping of similar objects should be such that
members within a group are closer to each other than to
members of a different group.
Evolutionary particle swarm optimization based
clustering EPSO-Clustering [28] is based on the idea of
the generation based evolution of the swarm. The swarm
evolves through different intermediate generations to
reach a final generation. Particles are initialized in the first
generation and after each generation the swarm evolves to
a stronger swarm by consuming the weaker particles of
that generation by the stronger ones. The stronger the
particle is, the greater its chance of survival to the next
generation. Stronger particles make mature and stable
generations. The strongest generation reached with an
optimal number of clusters and lowest intra-cluster
distance represent an optimal solution of the problem. The
pseudo code of the process is given in Algorithm 1.
Algorithm 1: Evolutionary Particle Swarm Clustering
1. Initialization of the particles
i. Initialize Vi (t) , Xi (t) , Vmax ,φ, q1 and q2
ii. Initialize swarm size, generation
iii. Initialize particles to input data
2. Iterate generation
i. Iterate swarm
a. Find won data vectors
b. Update velocity and position
ii. Evaluate the strength of the swarm
a. Iterate generation
b. Consume the weaker particles
c. Recalculated positions
3. Exit on number of generation exceed or stopping
criteria fulfill
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
Where, Vi (t) is the current velocity, Vi (t +1) is the new
velocity, w is the inertia weight, q1 and q2 are the vectors
which weigh the cognitive and social components and r1
and r2 are two randomly generated numbers ranging from
0 to 1. The range for the velocities of the particles is from
-Vmax toVmax .
After each iteration the swarm adjusts the positions of
all particles by associating the nearest data vectors
calculated using Euclidian distance equation 4,
recalculates the attributes of the particles, organizing itself
according to the new data vectors won by each particle.
(4)
d ( xn , zi )  xn  zi
D. Recommendation Phase
The user query has been processes based on the user
profile and the personalized results have been suggested
to the use. It is also known as retrieval process. During
web search the results by the search engine are converted
to feature vectors using the same pre-processing
techniques that were used for the user profile. Thus each
search result is represented by a feature vector. These
feature vectors are then passed to Similarity Scorer which
assigns them scores based on their similarity to interest
vectors. Each result is assigned a score equal to the
maximum of the similarity scores with each interest
vector. The results along with their scores are passed to
Re-Ranker which sorts the results based on the scores
assigned and the modifies the ordering that is ultimately
presented to the user.
V. RESULTS AND EVALUATION
To assess the performance of the approach we tested
the algorithm on the NASA web log file and analyzed the
logs containing one day of HTTP requests after passing
the log through all the preprocessing steps. For our
experiment we have considered 100 data vectors, which
have to be categories under 3 clusters. The results are
depicted in the following table1.
TABLE I
sEPSO AND K-MEANS – CLUSTERING
COMPARISON
EPSO
K- Means
Clusters No of
Avg.
Clusters No of
Avg.
nos.
Data
Distance
nos.
Data
Distance
vectors
from
vectors
from
centroid
centroid
1
89
33.1459
1
89
33.15978
2
9
87.1016
2
9
88.73784
3
2
123..5667
3
2
123.5658
Fitness (sum of
243.8142
245.46342
distance)
Mean intra-cluster
81.2712
81.82114
distance
1799
A Novel Approach for Web Intelligence
[3]
[4]
[5]
[6]
Fig. 4 Comparitive graph for EPSO and K-Means
clustering
[7]
[8]
For comparison purpose the intra distance within the
cluster was taken. EPSO algorithm has been compared
with the K-Mean cluster algorithm. Compared to the Kmeans the average distance has been reduced such that
members within a group are closer to each other than to
members of a different group. Thus the swarm was
uniformly distributed along the input data space. Our
proposed method can also be evaluated based on the
following factors.
Precision: It is the ratio of the number of recommended
pages actually viewed by the user to the total number of
recommended pages.
No of recommended pages viewed
(5)
precision 
Total no of recommended pages
Coverage: It is the ratio of the number of recommended
pages actually viewed by the user to the number of
remaining pages in the user transaction leaving the active
session pages.
No of recommended pagesviewed
(6)
cov erage 
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
No of remaining pages
F1- measure: It is the combination of both precision and
recall.
( 2 * precision* recall)
(7)
F1  measure 
precision  recall
VI. CONCLUSION
Web is growing rapidly, but on the other hand the
user’s capability to access web content remains constant.
Currently, Web personalization is the most promising
approach to alleviate this problem and to provide users
with tailored experiences. We built a system that creates
user profiles based on implicitly collected information:
User browsing histories and download histories thereby
improving their performance by addressing the individual
needs and preferences of each user, increasing satisfaction
of user. Thus achieving 100% personalization is not
possible. But we can reduce the searching time to mine
massive amount of data by shifting internet from search
based to find based.
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
REFERENCES
[1]
[2]
Raymond Kosala, Hendrik Blockeel, Web Mining Research: A
Survey, ACM SIGKDD: Explorations: Newsletter of the Special
Interest Group (SIG) on Knowledge Discovery and Data Mining,
Vol. 2, Issue 1 – page 1, July 2000.
Sanghamitra Bandyopadhyay . Sankar K.Pal, Classification and
Learning with Genetic algorithms – Application in Bioinformatics
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
[25]
[26]
and Web Intelligence, ISSN – 1619-7127, Springer Berlin
Heidelberg – 2007 (pg: 242 – 256).
Kleinberg, J.M., Authoritative sources in a hyperlinked
environment. In Proceedings of ACM-SIAM Symposium on
Discrete Algorithms, 1998, pages 668-677 – 1998.
P. Ravi Kumar and Ashutosh Kumar Singh, Web Structure
Mining: Exploring Hyperlinks and Algorithms for Information
Retrieval, American Journal of Applied Sciences 7 (6): 840-845,
2010, ISSN 1546-9239.
Srivastava J, Desikan P and V Kumar, Web Mining - Concepts,
Applications & Research Direction, in 2002 Conference.
Rochester Institute of Technology, Web Usage Mining: Data
Preprocessing, Pattern Discovery and Pattern Analysis on the RIT
Web Data, MS Project Report, Rochester Institute of Technology,
2008.
Han, J., Kamber, M. Kamber. Data mining: concepts and
techniques. Morgan Kaufmann Publishers, 2000.
Manoj Swami, Manasi Kulkarni, Understanding Web
Personlization with Web Usage Mining and its Application :
Recommender System, International Journal of Emerging
Techonolgy and Advanced Engineering, vol. 3,Issue 5, May 2013,
pp.726-730.
P. Markellou, Maria Rigou, Spiros S., Mining for Web
Personalization,
Honghua Dai, Bamshad Mobasher, Integrating Semantic
Knowledge with Web Usage Mining for Personalization.
C. Ramesh, Dr. K. V. Chalapati Rao, Dr. A. Goverdhan, A
Semantically Enriched Web Usage Based Recommendation
Model. International Journal of Computer Science and Information
Technology (IJCSIT) Vol 3, No 5, Oct 2011.
Daniar Asanov, Algorithms and Methods in Recommender
Systems.
Emmanouil Vozalis, K.G. Margaritis, Analysis of Recommender
Systems’ Algorithms. Parallel and Distributed Processing
Laboratory.
A.C.M. Fong, B. Zhou, Jie Tang, Guan Y. Hong, Generation of
Personalized Ontology Based on Consumer Emotion and Behavior
Analysis. IEEE Transactions on Affective Computing, Vol 3, No 2,
April-June 2012.
Nizar R. Mabroukeh, C. I. Ezeife, Ontology-based Web
Recommendation from Tags. ICDE Workshop 2011. 2011 IEEE.
A.C.M. Fong, B. Zhou, Jie Tang, Guan Y. Hong, Web Content
Recommender System based on Consumer Behavior Modeling.
IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May
2011
Micro Speretta, Susan Gauch, Personalized Search Based on User
Serach Histories, Proceesings of IEEE/WIC/ACM International
Conference on Web Intelligence, 2005.
J. Chaffee, S.Gauch, Personal Ontologies for Web Navigation. In
proceedings of the 9th International Conference on Information and
Knowledge Management, pp 227-234, 2000.
Honghua Dai, Bamshad Mobasher, A road map to more web
personalization: Integrating Domain Knowledge with web usage
mining.
Gabriella Pasi, Issue in personalizing information retrieval, IEEE
Intelligent Informatics Bulletin, Vol. 11, No.1, 2010.
Cristóbal Romero, Sebastián Ventura, Amelia Zafra, Paul de Bra,
Applying Web usage mining for personalizing hyperlinks in Webbased adaptive educational systems, Computers & Education,
Elsevier, 53 (2009) 828–840.
Zhong, Ning, Liu Yao, Jiming, Yao, Y.Y. Ohsuga, S. (2000), Web
Intelligence (WI), Web Intelligence, Computer Software and
Applications Conference, 2000. COMPSAC 2000. The 24th
Annual International, p. 469, doi:10.1109/CMPSAC.2000.884768,
ISBN 0-7695-0792-1
Dingqi Yang, Daqing Zhang, Zhiyong Yu, Zhu Wang, A
Sentiment-Enhanced Personalized Location Recommendation
System, 24th ACM Conference on Hypertext and Social Media,
May 2013.
Tricia Rambharose, Alexander Nikov, Computational intelligencebased personalization of interactive web systems, WSEAS
Transactions on Information Science and Applications, ISSN:
1790-0832, Issue 4, Volume 7, April 2010.
Supiya Ujjin and Peter J. Bentley, Particle Swarm Optimization
Recommender System.
Susan Gauch, Mirco Speretta, Aravind Chandramouli, and
Alessandro Micarelli. User profiles for personalized information
access. pages 54-89, 2007
1800
A Novel Approach for Web Intelligence
[27] Tsvi Kuflik and Peretz Shoval, Generation of User Profiles for
Information Filtering – Research Agenda, SIGIR, ACM, 2007
[28] Shafiq Alam, Gillian Dobbie, Patricia Riddle, An Evolutionary
Particle Swarm Optimization Algorithm for Data Clustering, IEEE
Swarm Intelligence Symposium, 2008.
[29] B. Everitt, “Cluster Analysis,” 2nd Edition, Halsted Press, New
York, 1980.
[30] Sarabjot Singh Anand and Bamshad Mobasher, Intelligent
Techniques for Web Personalization, LNAI 3169, pp. 1–36,
Springer-Verlag Berlin Heidelberg 2005
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
1801
A Novel Approach for Web Intelligence
M.R. Thansekhar and N. Balaji (Eds.): ICIET’14
1802