Download Why Data Mining - start [kondor.etf.rs]

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
VIRTUAL PRESENCE
Authors:
Voislav Galić, [email protected]
Dušan Zečević, [email protected]
Đorđe Đurđević, [email protected]
Veljko Milutinović, [email protected]
http://galeb.etf.bg.ac.yu/~vm/tutorial
1/99
DEFINITION
Virtual presence is a term
with various shades of meanings in different industries,
but its essence remains constant;
it is a new tool that enables some form of telecommunication
in which the individual may substitute their physical presence
with an alternate, typically, electronic presence
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
2/99
SUMMARY
- Introduction to Virtual Presence
- Data Mining for Virtual Presence
- A New Software Paradigm
- Selected Case Studies
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
3/99
INTRODUCTION TO VP
- Definitions
- VP applications
- Psychological aspects
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
4/99
DATA MINING FOR VP
- Why Data Mining?
- What can Data Mining do?
- Growing popularity of Data Mining
- Algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
5/99
SOFTWARE AGENTS
- A new software paradigm
- Standardization
- FIPA specifications
- Agent management
- Agent Communication Language
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
6/99
GoodNews (CMU*)
- Categorization of financial news articles
- Co-located phrases
- Domain Experts
- Implementation and results
* Carnegie Mellon University, Pittsburgh, USA
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
7/99
iMatch (MIT*)
- The idea
- associate MIT students and staff
in order to ease their cooperation;
- help students find resources they need
- Implementation
- advanced, agent-based system architecture
- Tomorrow?
* Massachusetts Institute of Technology, USA
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
8/99
“Tourist city” (ETF*)
• A qualitative step forward
in the domain of maximization of customer satisfaction
• Technologies:
• Data Mining
• Software Agents (mobile)
* Faculty of Electrical Engineering, University of Belgrade, Serbia and Montenegro
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
9/99
CONCLUSION
This tutorial will attempt to familiarize you with:
- The concept of VP (Virtual Presence)
as a new technological challenge
- The new paradigms and technologies
that will bring the VP to everyday life:
- Data Mining
- Software Agents
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
10/99
INTRODUCTION
Virtual presence will arguably be
one of the most important aspects of personal
communication in the twenty-first century
Essence of VP
• The usefulness and reliability of virtual presence
• The ability to conduct everyday tasks by being virtually
or electronically present
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
12/99
How to Accomplish it?
• The presence is accomplished through the Internet, video,
or other communications, perhaps even psychically one day
• Technological advance will sophisticate virtual presence,
altering the very meaning of the word “presence”
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
13/99
VP Applications
• VP in government
– “Sunshine laws”
– Voting
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
14/99
VP Applications
• VP in business
– Online board meetings
– Shareholder voting online
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
15/99
VP Applications
• VP in education
– interactive lectures and courses
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
16/99
VP Applications
• VP in medicine
– Telemedicine
• Diagnostics
• Remote surgery
– Risks
• Privacy
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
17/99
VP Applications
• VP in everyday life
– Telecommuting/Telework
– Software agents as our virtual “shadows”
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
18/99
Psychological Aspects
• Cyberspace and Mind
• Presence in Virtual Space
• Communal Mind and Virtual Community
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
19/99
DATA MINING
Knowledge discovery is a non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data
Many Definitions
• Data mining is also called data or knowledge discovery
• It is a process of inferring knowledge
from large oceans of data
• Search for valuable information in large volumes of data
• Analyzing data from different perspectives
and summarizing it into useful information
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
21/99
Why Data Mining ?
• DM allows you to extract knowledge from historical data
and predict outcomes of future situations
• Optimize business decisions
and improve customers’ satisfaction with your services
• Analyze data from many different angles, categorize it,
and summarize the relationships identified
• Reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
22/99
What Can Data Mining Do?
• Identify your best prospects
and then retain them as customers
• Predict cross-sell opportunities and make recommendations
• Learn parameters influencing trends in sales and margins
• Segment markets and personalize communications
etc.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
23/99
The Power of Data Mining
• Having a database is one thing,
making sense of it is quite another
• It does not rely on narrow human queries to produce results,
but instead uses AI related technology and algorithms
• Inductive reasoning
• Using more than one type of algorithm
to search for patterns in data
• Data mining produces usually more general (=more powerful)
results than those obtained by traditional techniques
• Relational DB storage and management technology is OK
for data mining applications less than 50 gigabytes
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
24/99
Reasons for the Growing
Popularity of Data Mining
• Growing Data Volume
• Low Cost of Machine Learning
• Limitations of Human Analysis
…
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
25/99
Tasks Solved by Data Mining
•
•
•
•
•
•
•
Predicting
Classification
Detection of relations
Explicit modeling
Clustering
Market basket analysis
Deviation detection
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
26/99
Algorithms
• Generally, their complexity is around n (log n)
(n is the number of records)
• Data mining includes three major components,
with corresponding algorithms:
– Clustering (Classification)
– Association Rules
– Sequential Analysis
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
27/99
Classification Algorithms
•
•
The aim is to develop a description or model
for each class in a database, based on the features
present in a set of class-labeled “training data”
Data Classification Methods:
–
–
–
–
–
–
Statistical algorithms
Neural networks
Genetic algorithms
Nearest neighbor method
Rule induction
Data visualization
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
28/99
Classification-rule Learning
• Data abstraction
• Classification-rule learning – finding rules or decision trees
that partition given data into predefined classes
– Hunt’s method
• Decision tree building algorithms:
– ID3 / C4.5 algorithm
– SLIQ / SPRINT algorithm (IBM)
• Other algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
29/99
Parallel Algorithms
• Basic Idea: N training data items are randomly distributed
to P processors. All the processors cooperate
to expand the root node of the decision tree
• There are two approaches for future progress
(the remaining nodes):
– Synchronous approach
– Partitioned approach
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
30/99
Association Rule Algorithms
• Association rule implies certain association relationship
among the set of objects in a database
• These objects “occur together”, or “one implies the other”
• Formally: X  Y, where X and Y are sets of items (itemsets)
• Key terms
– Confidence
– Support
• The goal – to find all association rules
that satisfy user-specified minimum support
and minimum confidence constraints.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
31/99
Association Rule Algorithms
• Apriori algorithm and its variations
– AprioriTid
– AprioriHybrid
– FT (Fault-tolerant) Apriori
• Distributed / Parallel algorithms (FDM, …)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
32/99
Sequential Analysis
• Sequential Patterns
• The problem – finding all sequential patterns
with user-specified minimum support
• Elements of a sequential pattern need not to be:
– consecutive
– simple items
• Algorithms for finding sequential patterns
– “count-all” algorithms
– “count-some” algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
33/99
Conclusion
• Drawbacks of existing algorithms
– Data size
– Data noise
• There are two critical technological drivers:
– Size of the database
– Query complexity
• The infrastructure has to be significantly enhanced
to support larger applications
• Solutions
– Adding extensive indexing capabilities
– Using new HW architectures
to achieve improvements in query time
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
34/99
THE NEW SOFTWARE
PARADIGM
All software agents are programs, but not all
programs are agents
Many Definitions
• Computational systems that inhabit some dynamic environment,
sense and act autonomously and realize a set of goals or tasks
for which they are designed
• Hardware or (more usually) software-based computer system
that enjoys the following properties:
-
Reactive (sensing and acting)
Autonomous
Goal-oriented (pro-active purposeful)
Temporally continuous
Communicative (socially able)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
-
Learning (adaptive)
Mobile
Flexible
Character
36/99
Interesting Topic of Study
• They draw on and integrate many diverse disciplines
of computer science and other areas:
–
–
–
–
–
–
–
–
objects and distributed object architectures
adaptive learning systems
artificial intelligence and expert systems
collaborative online social environments
security
knowledge based systems, databases
communications networks
cognitive science and psychology
…
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
37/99
What Problems do Agents
Solve ?
• Client/server network bandwidth problem
• In the design of a client/server architecture
• The problems created by intermittent
or unreliable network connections
• Attempts to get computers to do real thinking for us
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
38/99
The New Software Paradigm
• Unless special care has been taken in the design of the code,
two software programs cannot interoperate
• The promise of agent technology is to move
the burden of interoperability from software programmers
to programs themselves
This can happen if two conditions are met:
– A common language (Agent Communication Language – ACL)
– An appropriate architecture
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
39/99
The Need for Standards
• Anywhere, anytime consumer access to the Universal bouquet
of information and services is the new goal of the information
revolution
• The scope of Internet standards
makes the scope of choices extreme
• The Foundation for Intelligent Physical Agents (FIPA),
established in 1996 in Geneva
• international non-profit association of companies and
organizations
• specifications of generic agent technologies.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
40/99
FIPA Specifications
•
•
•
•
•
•
•
•
•
Agent Management
Agent Communication Language
Agent/Software Integration
Agent Management Support for Mobility
Human-Agent Interaction
Agent Security Management
Agent Naming
FIPA Architecture
Agent Message Transport
etc.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
41/99
Agent Management
• Provides the normative framework within which FIPA agents
exist and operate
• Establishes the logical reference model for the creation,
registration, location, communication, migration
and retirement of agents
- The entities contained in the
reference model are logical
capability sets and do not imply
any physical configuration
- Additionally, the implementation
details of individual APs and agents
are the design choices of the
individual agent system developers
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
42/99
Components of the Model
•Agent
- computational process
- fundamental
actor on an AP
•Directory
Facilitator
- as a physical software process has a life cycle
- yellow pages to other agents
that has to be managed by the AP
- supported function are:
•Agent-register
Management System
- white
pages services to other agents
-deregister
- maintains
-modify a directory of AIDs which contain transport addresses
•Message
Transport
-search
- supported
function Service
are:
-register
- communication
method between agents
-deregister
•Agent-modify
Platform
-searchinfrastructure in which agents can be deployed
- physical
-get-description
-operations for underlying AP
•Software
- all non-agent, executable collections of instructions
accessible through an agent
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
43/99
Agent Life Cycle
• FIPA agents exist physically on an AP and utilize the facilities
offered by the AP for realising their functionalities
• In this context, an agent, as a physical software process,
has a physical life cycle that has to be managed by the AP
The state transitions of
agents can be described as:
-
create
invoke
destroy
quit
suspend
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
-
resume
wait
wake up
move*
execute*
44/99
Agent Communication
Language
• The specification consists of a set of message types
and the description of their meanings
• Requirements:
– Implementing a subset of the pre-defined message types
and protocols
– Sending and receiving the not-understood message
– Correct implementation of communicative acts
defined in the specification
– Freedom to use communicative acts with other names,
not defined in the specification
– Obligation of correctly generating messages in the transport form
– Language must be able to express propositions,
objects and actions
– The use of Agent Management Content Language and ontology
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
45/99
ACL Syntax Elements
• Pre-defined message parameters:
:sender acts:
• Communicative
:receiver
accept-proposal
agree
:content
cancel
:reply-with
cfp
:in-reply-to
confirm
:envelope
disconfirm
:language
failure
inform
:ontology
inform-if
:reply-by
inform-ref
:protocol
:conversation-id
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
not-understood
propose
query-if
query-ref
refuse
reject-proposal
request
request-when
request-whenever
subscribe
46/99
Communication Examples
-- Agent
to jagent
that it is, with domain server d1:
Agent ii confirms
asks agent
if j is jregistered
in fact, true that it is snowing today:
(query-if
jj replies
that
it can
reserve
trains,
-(confirm
i,
believing
agent
thinks
that
a sharkand
is a
:sender
i
- Agent
Agent
refuses
to that
i reserve
a jticket
for
i, planes
Auction
bid
:sender
i
automobiles:
mammal,
attempts
to change
j's
belief:
:receiver
j understand
(inform
-- Agent
i
did
not
an
query-if
since
i
there
are
insufficient
funds
in services:
i'smessage
account:
Agent
i
asks
agent
j
for
its
available
:receiver
j
(inform
(disconfirm
:content
:sender
agent_X
(refuse
because
it did
not recognize the ontology:
(query-ref
:content
"weather(
today,
:sender
j
i
(registered
(server
d1)snowing
(agent )"
j))
:receiver
auction_server_Y
:sender
j
(not-understood
:sender
i
:language
Prolog
:receiver
i
:reply-with
r09
:content
:receiver
i
:sender
i j
:receiver
j
)
:content
shark)150)
)
(price(mammal
(bid good02)
:receiver
j
:content
)
?x
(available-services
j ?x))
((= (iota
... :in-reply-to
round-4
:content
((query-if
:sender
j
:receiver
i …)
(iota
?xj(available-services
j MUC,
?x)) 27-sept-97))
(action
(reserve-ticket
LHR,
((reserve-ticket
train)
(inform
:reply-with
bid04 (ontology www)))
(unknown
…) (insufficient-funds
ac12345)
(reserve-ticket
plane)
:sender
j sl
:language
:language
sl
)
(reserve
automobile))
:receiver
i
:ontology
auction
)
:language
sl)
)
:content
(not (registered (server d1) (agent j)))
)
…)
:in-reply-to r09
)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
47/99
Agent/Software Integration
• Integration of services provided by non-agent software
into a multi-agent community
• Definition of the relationship between agents
and software systems
• Allowing agents to describe, broker and negotiate
over software systems
• Allowing new software services to be dynamically introduced
into an agent community
• Defining how software resources can be described,
shared and dynamically controlled in an agent community
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
48/99
New Agent Roles
• To support specification, two new agent roles
have been identified:
– Agent Resource Broker (ARB)
– WRAPPER Agent
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
49/99
GoodNews
A system that automatically categorizes
news reports that reflect positively or negatively
on a company’s financial outlook
Introduction
• Correlation between news reports on a company’s financial outlook
and its attractiveness as an investment
• Volume of such reports is huge
• A new text classification algorithm – “Domain Experts”
with “self-confident” sampling technique
• Two types of data
– (Human-)labeled
– Unlabeled
• The algorithm classifies financial news
into the predefined five categories
– (good)  (good, uncertain)  (neutral) 
 (bad, uncertain)  (bad)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
51/99
Introduction
• Text categorization task
• FCP (Frequently Co-located Phrase) the building element
for the categorization algorithm
• Text categorization – very difficult domain
for the use of machine learning
– Very large number of input features
– High level of attribute and class noise
– Large percent of irrelevant features
• Very expensive labeled data,
while unlabeled data are cheaply available
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
52/99
Categorization
• The algorithm categorizes each given news article
into the predefined categories
in terms of referred company’s financial well-being
• GOOD – strong and explicit evidences
of the company’s financial status
– …shares of ABC company rose 2 percent to $24-15/16…
• GOOD, UNCERTAIN – predictions and forecasts
of future profitability
– … ABC company predicts fourth-quarter earnings will be high…
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
53/99
Categorization
• NEUTRAL – nothing is mentioned
about the financial well-being of the company
– … ABC announced plans to focus on products based on recycled
materials…
• BAD, UNCERTAIN – predictions of future loses
– … ABC announced today that fourth-quarter results could
fall short of expectations…
• BAD – explicitly bad evidences
– … shares of ABC fell $0.57 to $44.65 in early NY trading…
• Problems with construction of the training (i.e. labeled)
data set – “inter-indexer inconsistency”
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
54/99
Co-located Phrase
• The proposed algorithm labels the “unlabeled” news articles
through voting process among experts that are FCP’s
• Definition – a co-located phrase is a sequence of nearby,
but not necessarily consecutive words
– … shares of ABC rose 8.5%… (shares, rose): GOOD
– …ABC presented its new product… (present, product): NEUTRAL
• Contextual information
• The use of heuristics to cope with enormous “phrase space”
(amount of possible phrases)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
55/99
Naive-Bayes v Domain Experts
• Naive-Bayes with EM (Expectation Maximization)
• Problems with small sets of labeled (training) data;
• EM (Expectation Maximization) – a class of iterative algorithms
for maximum likelihood estimation
in problems with incomplete data
• Domain Experts algorithm is able to deal with
inconsistent hypotheses
• Iterative building of the training set
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
56/99
Implementation and Results
• The experiment focused on two performance criteria:
– Using unlabeled data for improving categorization accuracy
– The categorization itself
• The accuracy is around 75% (total of 2000 news articles);
• Comparison of a few different methods (picture)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
57/99
Conclusions
• Domain Experts with SC sampling
outperform naive Bayes with EM
– collocation property and vote entropy
are appropriate to such a domain
• The accuracy of around 75% is the limit
with the techniques used
• Better performance could be achieved
by using some natural language processing techniques
• Such techniques are pretty rudimental today
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
58/99
iMatch
The vision of each MIT student
having a personal software agent,
which helps to manage its owner's academic life
Introduction
• The aim: bring together MIT students and staff who may
usefully collaborate with each other
• This collaboration can have several goals:
– completing final projects
– studying for exams
– tutoring one another
• iMATCH agents are supposed to facilitate students and faculty
matching for:
– Research
– Teaching
– Internship
opportunities within and across campuses
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
60/99
iMatch Agent Architecture
• iMatch agents are situated within an environment
• Sensors of the agent convert environmental inputs
into representations that can be manipulated within the agent
• Effectors translate actions planned by the agent
into executable statements for the environment
• The action planner selects the action with the highest utility
according to the owner’s preference specification
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
61/99
Impacts and Benefits
• MIT
–
–
–
–
Benefit MIT students by matching them to appropriate resources
Aid the recruitment of student researchers
Help students manage their lives
Use iMATCH in Medical Computing
• GLOBAL
– Facilitate Cross Community Collaboration
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
62/99
Research Topics
• Knowledge representation
– preference specification
• Multi-agents systems
– reputation management system
– static interest matching
– dynamic interest matching
• Infrastructure
– distributed security infrastructure
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
63/99
Ceteris Paribus Preference
• Ceteris paribus relations express a preference over sets of
possible outcomes
• All possible outcomes are considered to be describable by
some (large) set of binary features (true or false)
– The specified features are instantiated to either true or false
– Other features are ignored
I prefer train
I prefer ice cream
I prefer airplane
I prefer chocolate
I prefer cell phone
I prefer e-mail
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
64/99
CPP Agent Configuration
• Specify a domain for preference
– Agent methods of communication and notification
– Different security settings of different servers
• Preference statements themselves
– How to get users to easily adjust C.P. rules (graphical interface)
– Pose hypothetical preference questions to user to help complete
the preferences of an ambivalent user
• People will only put down their true profile, if they know that
the system is secure
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
65/99
Static Interest Matching
• Group together similar users for specific context
• This enables viewing a human user as a resourcefor dynamic
resource discovery
(locate experts, enthusiasts,...)
• The approach:
– Keyword matching
– Ontological matching using Kulbeck-Leiber (KL) distance
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
66/99
Dynamic Interest Matching
• Location and/or temporal specific resource matching
• As students and their agents move from one physical location
to another, iMatch services for matching the closest resources
can be offered
• The idea: anything worthwhile is locatable
• The approach:
– Intentional naming scheme
– Reputation based resource discovery
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
67/99
Technology
• Components
– Distributed Multi-Agent Infrastructures
– Ceteris Paribus preference-based Interest Matching
– Reputation Management Infrastructure
• Technology
–
–
–
–
–
Microsoft.Net
Bluetooth
IEEE 802.11
Smartcards (PC/SC)
INS (International Naming System)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
68/99
Conclusion
• Benefit MIT students
by matching them to appropriate resources
• Static interest matching
– Group together similar users for specific context
– This enables viewing a human user as a resource
for dynamic resource discovery (locate experts, enthusiasts,...)
• Dinamic interest matching
– Location and/or temporal specific resource matching
As students and their agents move from one physical location to another,
iMatch services for matching the closest resources can be offered
• Help students manage their lives
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
69/99
The near future…
The focus of the research is on e-tourism
after the year 2005, but the applications
of the proposed infrastructure are multifold
Introduction
• The assumptions:
– after the year 2005, each tourist in Europe will be equiped with a
cell phone of the power same or better than the Pentium IV
– whenever a tourism-based service or product is purchased, a
mobile agent is assigned to that cell phone PC, to monitor the
behaviour of the customer
– all tourist cell phone PCs create an AD-HOC network
around the points of touristic attractions, and link to a data mine
that collects all information of interest
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
71/99
How to accomplish it?
• The information of interest is not collected by asking the
customer to fill out the forms, but by monitoring the
behaviour of the customer
• The collected information, sorted in the data mine,
is made available to other tourists, as an on-line ownerindependent source of information about the given services
and/or products
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
72/99
What can be done…
• If a tourist would like to know, at that very moment, what
restaurant has good food/atmosphere and happy customers,
he/she can access the data mine (via the Internet) and obtain
the information that is linked to that very moment, and is not
created by the owner of the business, but by the customers
themselves
• Accessing the given restaurant’s website has two drawbacks:
– the information is not fresh - periodically updated
– the information is made by the owner of the restaurant,
and therefore not completely objective
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
73/99
Conclusion
• Consequently, the proposed approach works much better ,
and represents a qualitative step forward
in the domain of maximization of customer satisfaction
• This may mean that the privacy of the person is jeopardized,
however, if the monitored behaviour is non-personalized,
and if the customer obtains a discount based on the fact that
mobile agents are welcome, the privacy stops to be an issue,
and people will sign up voluntarily
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
74/99
Appendix
A Survey of the Data Mining Algorithms
Apriori Algorithm
• The task – mining association rules by finding large itemsets
and translating them to the corresponding association rules;
• A  B, or A1  A2 … Am  B1  B2 … Bn, where A  B = 
• The terminology
–
–
–
–
Confidence
Support
k-itemset – a set of k items;
Large itemsets – the large itemset {A, B} corresponds to the
following rules (implications): A  B and B  A;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
76/99
Apriori Algorithm
• The  operator definition
– n = 1: S2 = S1  S1 = {A}, {B}, {C}}  {{A}, {B}, {C}} =
{{AB}, {AC}, {BC}}
– n = k: Sk+1 = Sk  Sk = {X  Y| X, Y  Sk, |X  Y| = k-1}
– X and Y must have the same number of elements, and must have
exactly k-1 identical elements;
– Every k-element subset of any resulting set element (an element
is actually a k+1 element set) has to belong to the original set of
itemsets;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
77/99
Apriori Algorithm
• Example:
TID
elements
10
A
C
D
20
B
C
E
30
A
B
C
40
B
E
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
E
78/99
Apriori Algorithm
• Step 1 – generate a candidate set of 1-itemsets C1
– Every possible 1-element set from the database is potentially a
large itemset, because we don’t know the number of its
appearances in the database in advance (á priori );
– The task adds up to identifying (counting) all the different
elements in the database; every such element forms a 1-element
candidate set;
– C1 = {{A}, {B}, {C}, {D}, {E}}
– Now, we are going to scan the entire database, to count the
number of appearances for each one of these elements (i.e. oneelement sets);
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
79/99
Apriori Algorithm
• Now, we are going to scan the entire database, to count the
number of appearances for each one of these elements (i.e.
one-element sets);
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
80/99
Apriori Algorithm
• Step 2 – generate a set of large 1-itemsets L1
– Each element in C1 with support that exceeds some adopted
minimum support (for example 50%) becomes a member of L1;
– L1 = {{A}, {B}, {C},{E}}
and we can omit D in further
steps (if D doesn’t have
enough support alone,
there is no way it could
satisfy requested support
in a combination with some
other element(s));
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
81/99
Apriori Algorithm
• Step 3 – generate a candidate set of large 2-itemsets, C2
– C2 = L1  L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}}
– Count the corresponding appearances
• Step 4 – generate a set of large 2-itemsets, L2;
– Eliminate the candidates
without minimum support;
– L2 = {{AC}, {BC}, {BE}, {CE}}
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
{AB}
1
{AC}
2
{AE}
1
{BC}
2
{BE}
3
{CE}
2
82/99
Apriori Algorithm
• Step 5 (C3)
– C3 = L2  L2 = {{BCE}}
– Why not {ABC} and {ACE} – because their 2-element subsets
{AB} and {AE} are not the elements of large 2-itemset set L2
(calculation is made according to the operator  definition);
• Step 6 (L3)
– L3 = {{BCE}}, since {BCE} satisfies the required support of 50%
(two appearances);
• There can be no further steps in this particular case,
because L3  L3 = ;
• Answer = L1  L2  L3;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
83/99
Apriori Algorithm
L1 = {large 1-itemsets}
for (k=2; Lk-1  ; k++)
Ck = apriori-gen(Lk-1);
forall transactions t  D do begin
Ct = subset (Ck, t);
forall candidates c  Ct do
c.count++;
end;
Lk = {c  Ck | c.count  minsup}
end;
Answer = k Lk
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
84/99
Apriori Algorithm
• Enhancements to the basic algorithm
• Scan-reduction
– The most time consuming operation in Apriori algorithm is the
database scan; it is originally performed after each candidate set
generation, to determine the frequency of each candidate in the
database;
– Scan number reduction – counting candidates of multiple sizes in
one pass;
– Rather than counting only candidates of size k in the kth pass, we
can also calculate the candidates C’k+1, where C’k+1 is generated
from Ck (instead Lk), using the  operator;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
85/99
Apriori Algorithm
– Compare: C’k+1 = Ck  Ck
Ck+1 = Lk  Lk
– Note that C’k+1  Ck+1
– This variation can pay off in later passes, when the cost of
counting and keeping in memory additional C’k+1 - Ck+1 candidates
becomes less than the cost of scanning the database;
– There has to be enough space in main memory for both Ck and
C’k+1;
– Following this idea, we can make further scan reduction:
• C’k+1 is calculated from Ck for k > 1;
• There must be enough memory space for all Ck’s (k > 1);
– Consequently, only two database scans need to be performed (the
first to determine L1, and the second to determine all the other
Lk’s);
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
86/99
Apriori Algorithm
• Abstraction levels
– Higher level associations are stronger (more powerful), but also
less certain;
– A good practice would be adopting different thresholds for
different abstraction levels (higher thresholds for higher levels of
abstraction)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
87/99
DHP Algorithm
• DHP = Direct Hashing and Pruning – another algorithm for
mining association rules;
• Based on the Apriori algorithm (Ck/Lk generation in the kth
step);
• Empirical analysis of the Apriori algorithm shows that
candidate sets (Ck) are much larger than corresponding sets
of large itemsets (Lk), especially in a first few iterations;
• DHP introduces more efficient candidate set generation
method;
• The idea is to insert into Ck only those candidate sets that are
likely to become large itemsets;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
88/99
DHP Algorithm
• Additional improvement is accomplished through “twodimensional” search base reduction – “length”(number of
records in the search base) and “width” (number of relevant
attributes in a record);
• Large itemsets’ characteristics:
– Every non-empty subset of a large itemset is a large itemset as
well,
for example, {BCD}  L3  {{BC}, {CD}, {BD}}  L2;
– It implies that a record is relevant for discovering large k+1itemsets only if it contains at least k+1 large k-itemsets;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
89/99
DHP Algorithm
– During the Ck  Lk phase we might count large k-itemsets in each
record; if their number in a particular record is less than k+1, we
omit that record during the Ck+1 generation;
– Similarly, if a record contains one or more large k+1-itemsets,
each element (item) of these itemsets appears in, at least, k
candidates from Ck
• Hashing
– Hashing boosts the performance of the DHP algorithm;
– The algorithm does not specify any hash function in particular, it
depends on the application;
– Likewise, it does not specify the size of the hash table (number of
groups/addresses);
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
90/99
DHP Algorithm
• Application example
TID
elements
10
A
C
D
20
B
C
E
30
A
B
C
40
B
E
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
E
91/99
DHP Algorithm
• Step 1 – generate a candidate set of 1-itemsets C1
– C1 = {{A}, {B}, {C}, {D}, {E}}
– Simultaneously with counting each element’s support, a hash tree
is generated that contains all the elements from the database, in
order to improve the counting performance;
• For each new element, DHP checks whether the element is already in
the tree or not;
• If yes, DHP increments the current number of appearances for that
element; otherwise, the element is added to the hash tree, and the
number of its appearances is set to 1;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
92/99
DHP Algorithm
• Having counted each C1 element appearances, all possible 2element subsets are generated and inserted into H2 hash
table;
TID
2-element subsets
10
{AC}, {AD}, {CD}
20
{BC}, {BE}, {CE}
30
{AB}, {AC}, {AE}, {BC}, {BE}, {CE}
40
{BE}
– The address of a particular subset could be calculated with respect
to the position of its elements in C1 candidate set, using chosen
hash function h(x, y);
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
93/99
DHP Algorithm
– For example, let’s adopt the following hash function:
h({x y}) = (posC1(x)*10 + posC1(y)) mod 7;
• The corresponding H2 hash table is shown below:
address
weight
0
3
{AD}
1
1
{AE}
2
2
{BC}
{BC}
3
0
4
3
{BE}
{BE}
{BE}
5
1
{AB}
6
3
{AC}
{CD}
{AC}
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
{CE}
{CE}
94/99
DHP Algorithm
• Whenever a new element is added to the hash table, the
weight of the particular address is increased by one;
• C2 is generated out of L1 (just like in Apriori case);
• Besides that, only those elements that map to the addresses
whose weight is greater or equal than specified minimum
support (let the minimum support be 50%), will be taken into
consideration during the C2 generation;
• C2 = {{AC}, {BC}, {BE}, {CE}};
• It contains two elements less (!) than the C2 set generated by
the Apriori algorithm for the same example database;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
95/99
DHP Algorithm
• In general, the Hk hash table is used for the Ck candidate set
generation in the kth step of the algorithm; Hk is created in the
previous (k-1)th step;
• Each address of the Hk hash table contains a number of kelement subsets as elements; its weight denotes the number
of elements;
• The fact that an address doesn’t satisfy minimum support
requirement means that neither element (set) that is mapped
to the address can satisfy the requirement alone  all the
elements (sets) at such Hk addresses are omitted for the Ck
generation;
• During the kth step, Ck is generated starting from Lk-1, with the
restrictions described above;
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
96/99
DHP Algorithm
• Conclusions:
– DHP outperforms Apriori, for the same input data;
– The time spent for the hash tables generation (especially H2) is
overcome by extremely reduced candidate sets (C2, …);
– The same improvements applied on Apriori, may as well be
applied here (scan reduction, abstraction levels, …)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
97/99
References
•
•
•
•
•
•
http://www.marconi.com
http://www.blueyed.com
http://www.fipa.org
http://www.rpi.edu
http://research.microsoft.com
http://imatch.lcs.mit.edu
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
98/99
THE END
Quatenus nobis denegatum diu vivere,
relinquamus aliquid, quo nos vixisse testemur
Authors:
Voislav Galić, [email protected]
Dušan Zečević, [email protected]
Đorđe Đurđević, [email protected]
Veljko Milutinović, [email protected]
http://galeb.etf.bg.ac.yu/~vm/tutorial