Download Personalization Services in CADAL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Transcript
Personalization Services in CADAL
Zhang yin Zhuang Yuting Wu Jiangqin
College of Computer Science, Zhejiang
University
November 19,2006
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Background



The number of digital books meeting with OEB
standard is 1,023,425.
It’s a time consuming process finding the useful
information and knowledge in this large digital
collection of CADAL.
Personalization service is provided to help users
to quickly locate their interested things in the
collection of CADAL.
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Personal Agent Services
Personal
Portal
User
Metadata
Users
Link Generation
Services
Repository Services
Query
Service
Modification
Service
Personalized
Services
Search
Recommendation
Services
Repositories
Repository
A
Repository B
Metadata
Metadata
Repository C
Metadata
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Query Expansion



Many users often send one or two keywords
as a query
The search results can be improved by
expanding the query with additional search
keywords.
Query Expansion depends on the NLP
(Natural Language Processing)techniques
and relevance feedback methods
Keyword Expansion – The Trigger pairs
model


If a word S is significantly correlated to
another word T, then (S,T) is considered as a
trigger pair, with S the trigger, T the trigged
word.
When we see the S in the document, we
expect T to appear after S with some
confidence.
Trigger pairs selection algorithm(1)
We define that the keywords are K1 , K2 , Km ,
and the expected number of refinement
words is N. Initialize n  m, S is the empty set.
1. S1  s11 , s12 , , s1i   K1 is the trigger set to K1 .
s11, s12 , , s1i are sorted in decreasing order
of the mutual information.
S2  s21 , s22 , , s2 j   K 2is the trigger set to K 2

Sm  sm1 , sm 2 ,
, smk   K m
is the trigger set to Km
Trigger pairs selection algorithm(2)
S  , and
2. S  S  S , S , , S  S S
S  is one of the combinations of n
S S
sets out of m. The words in the S are sorted
in decreasing order of mutual information.
3. If S  N , let the top N words in S be the
refinement words and stop.
4. Otherwise, let n  n  1 , continue step 2.
p
p
q
q
r
r
p
q
r
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Implemented Information filtering techniques


A Content-based filtering method
A Collaborative filtering method
LR_Rocchio algorithm


The user profile is represented as a vector of
indicative words extracted from the contents
of all digitized books.
The LR_Rocchio algorithm set a bayesian
prior of the Logistic Regression model
parameter using the user profile calculated by
Rocchio algorithm.
Increasing Rocchio algorithm

A widely used user profile updating algorithm
is the increasing Rocchio algorithm, which
can be generalized as :

Q    Q  

xi R
R
xi


xi NR
xi
NR
Where Q is the initial profile vector, Q   wr1 , , wrk 
is the new profile vector, R is the set of
relevant documents, and NR is the set of
irrelevant documents.
Logistic regression

Logistic regression is one widely used
statistical algorithm that can provide an
estimation of posterior probability P  y | x  of an
unobserved variable y given an observed
variable x .
P  y  1 | x, w  
1
1  exp   ywT x 
w is the K dimensional logistical regression model
parameter learned from the training data.
LR prior(1)

The Bayesian-based learning algorithms
often begin with a certain prior belief p  w
about the distribution of the logistic regression
model parameter.


1
N
m
,
v
Gaussian distribution  w w 
A classifier learned with a non-informative
prior usually over fits the training data.
LR prior(2)

A prior mw that encodes Rocchio’s suggestion
about decision boundary can be learned via
constrained maximum likelihood estimation:


1

mw  arg max w  log 
T
 1  exp   yi w xi  
i 1


t
Under the constraint:
cos  w, wR   0
The Approaches of Collaborative filtering

Memory-based


Model-based



Pearson Correlation Coefficients
Clustering
Aspect model
Hybrid
A hybrid approach using the cluster-based
smoothing
1.
2.
Create the user clusters C using the k-means
method.
Given the user ua , and i rated items, an item t and
an integer K , the number of nearest neighbors.
Choose
users into G from groups that are
most similar to user ua .
Calculate similarity sim  u, ua  for each u in G in
which the rating of the user u is the combination of
Ru  t  and RCu t  .
Select the top-K most similar users as neighbors.
Predict the rating of the item t for ua by the
behaviors of the K nearest neighbors.
s
3.
4.
5.
Symbol definition







T  t1 , t2 ,
U  u1 , u2 ,
, tm  be a set of items
, un 
be a set of users
Each triple  ui , ii , ri  indicates the item ii is
rated as ri by the user.
Ru  t  denotes the rating of item t by user u
Ru denotes his average rating.
the clustering results of the users U  u1 , u2 , , un 
1
2
n
C
,
C
,
,
C

u
are represented as u u
ua user for whom recommender service
similarity measure function


the Pearson correlation-coefficient function is
taken as the similarity measure function.
The similarity between user u and user u  is
defined as :
simu ,u  
  R t   R    R t   R 
tT  u  T  u 
u
u
  R t   R 
tT  u  T  u 
u
u
2
u
u
  R t   R 
tT  u  T  u 
u
u
Reducing Data Sparsity

At the early stage of system running, the collected
rating data is sparse. To fill the missing values in
data set, clusters are explicitly exploited to smooth
the sparse data.
R  t  if user u rate the item t
Ru  t   { u
Rˆ  t  else
u
ˆ  t   R  R  t 
R
u
u
Cu
RCu  t  

  R t   R 
u Cu  t 
u
u
Cu  t 
Where Cu t   Cu is the user set in user cluster Cu that
have rated item t. Cu  t  is the number of users in cluster
who have rated the item t
Increasing System Scalability


make use of the user cluster in neighbor
selection to increase system scalability.
The centroid of cluster is represented as the
average rating over the cluster. The similarity
between the cluster C and user is defined as:
 RC t    Ru t   Ru 
simua ,C 
tT  ua  T  C 
  R t 
tT  ua  T  C 
After
a
C
2
a
  R t   R 
tT  ua  T  C 
ua
2
ua
calculating the similarity, the users in the most
similar cluster are taken as the candidates that need to be
recalculated similarity with the active user on the smoothed
data.
Weighting

The different weights wut are placed on the
original data and smoothing data when
calculating the similarity between the cluster
users and the active user.
1 if user u rate the item t
wut  {
 else
Where  is the tuning parameter between original rating
and group rating, its value varied from 0 to 1.
Reformed similarity measure function

The system will select the top K most similar
users based on the following similarity
function:
simua ,u 

tT  ua 

wut   Ru  t   Ru   Rua  t   Rua

  w   R t   R    R t   R 
2
tT  ua 
ut
u
u
tT  ua 
ua
ua
2
Prediction for the active user

After the neighbor selection, a weighted aggregate
of the deviations from the neighbor’s mean is used
to generate the prediction for the active user as the
following:
Rua  t   Rua


K
i 1

wut  simua ,ui  Rui  t   Rui
K
w
i 1
ut
 simua ,ui

收藏的图书可以在用户登录的首页上找到,如下图:
Modify the user’s
information;
Set the rule;
The complete list of
the user’s collections
My bookshelf:the
books user has
collected
Outline





Introduction
The Architecture of Personalization Services
Personalized Search
Recommendation based on the Information
Filtering techniques
Future plan
Future Plan


Extend the architecture of personalization
services to incorporate the semantic web
techniques.
Put more effort on the web usage mining
techniques to discover the user pattern from
the web data.
Thanks!
Email: [email protected]