Download data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
VIRTUAL PRESENCE
Authors:
Voislav Galić, [email protected]
Dušan Zečević, [email protected]
Đorđe Đurđević, [email protected]
Veljko Milutinović, [email protected]
http://galeb.etf.bg.ac.yu/~vm/tutorial
1/48
SUMMARY
- Introduction to Virtual Presence
- Data Mining for Virtual Presence
- A New Software Paradigm
- Selected Case Studies
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
2/48
INTRODUCTION TO VP
- Definitions
- VP applications
- Psychological aspects
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
3/48
DATA MINING FOR VP
- Definitions
- What can Data Mining do?
- Growing popularity of Data Mining
- Algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
4/48
SOFTWARE AGENTS
- A new software paradigm
- Standardization
-FIPA specifications
- Agent management
- Agent Communication Language
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
5/48
CASE STUDIES
• GoodNews (CMU*)
– Categorization of financial news articles
• iMatch (MIT**)
– help students find resources they need
– advanced, agent-based system architecture
• “Tourist city” in the future (ETF***)
– represents a qualitative step forward in the domain of
maximization of customer satisfaction
– technologies:
• Data Mining
• Software Agents (mobile)
*
Carnegie Mellon University, Pittsburgh, USA
** Massachusetts Institute of Technology, USA
*** Faculty of Electrical Energinering, University of Belgrade, Serbia and Montenegro
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
6/48
CONCLUSION
This tutorial will attempt to familiarize you with:
- The concept of VP (Virtual Presence)
as a new technological challenge
- The new paradigms and technologies
that will bring the VP to everyday life:
- Data Mining
- Software Agents
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
7/48
INTRODUCTION
Virtual presence will arguably be
one of the most important aspects of personal
communication in the twenty-first century
Definition
Virtual presence is a term
with various shades of meanings in different industries,
but its essence remains constant;
it is a new tool that enables some form of telecommunication
in which the individual may substitute their physical presence
with an alternate, typically, electronic presence
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
9/48
How to Accomplish it?
• The presence is accomplished through the Internet, video,
or other communications, perhaps even psychically one day
• Technological advance will sophisticate virtual presence,
altering the very meaning of the word “presence”
• The ability to conduct everyday tasks by being virtually
or electronically present
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
10/48
VP Applications
• in government
– “Sunshine laws”
– Voting
• in business
– Online board meetings
– Shareholder voting online
• in education
– interactive lectures and courses
• in medicine
– Telemedicine (Diagnostics, Remote surgery)
– Risks (Privacy)
• in everyday life
– Telecommuting/Telework
– Software agents as our virtual “shadows”
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
11/48
Psychological Aspects
• Cyberspace and Mind
• Presence in Virtual Space
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
12/48
DATA MINING
Knowledge discovery is a non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data
Many Definitions
• Data mining is also called data or knowledge discovery
• It is a process of inferring knowledge
from large oceans of data
• Search for valuable information in large volumes of data
• Analyzing data from different perspectives
and summarizing it into useful information
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
14/48
What Can Data Mining Do?
• DM allows you to extract knowledge from historical data
and predict outcomes of future situations
• Optimize business decisions
and improve customers’ satisfaction with your services
• Analyze data from many different angles, categorize it,
and summarize the relationships identified
• Reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage
• Predict cross-sell opportunities
and make recommendations
etc.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
15/48
The Power of Data Mining
• Having a database is one thing,
making sense of it is quite another
• It does not rely on narrow human queries to produce results,
but instead uses AI related technology and algorithms
• Data mining produces usually more general (=more powerful)
results than those obtained by traditional techniques
• Using more than one type of algorithm
to search for patterns in data
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
16/48
Reasons for the Growing
Popularity of Data Mining
• Growing Data Volume
• Low Cost of Machine Learning
• Limitations of Human Analysis
…
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
17/48
Tasks Solved by Data Mining
•
•
•
•
•
•
•
Predicting
Classification
Detection of relations
Explicit modeling
Clustering
Market basket analysis
Deviation detection
Data mining includes three major components,
with corresponding algorithms:
–Clustering (Classification)
–Association Rules
–Sequential Analysis
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
18/48
Classification Algorithms
•
•
•
•
•
•
•
•
Statistical algorithms
Neural networks algorithms
Genetic algorithms
Nearest neighbor method
Rule induction
Data visualization
Decision tree building algorithms
Parallel algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
19/48
Association Rule Algorithms
• Association rule implies certain association relationship
among the set of objects in a database
• These objects “occur together”, or “one implies the other”
• Formally: X  Y, where X and Y are sets of items (itemsets)
• Key terms
– Confidence
– Support
• The goal – to find all association rules
that satisfy user-specified minimum support
and minimum confidence constraints
• Apriori algorithm and its variations
• Distributed / Parallel algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
20/48
Sequential Analysis
• Sequential Patterns
• The problem – finding all sequential patterns
with user-specified minimum support
• Elements of a sequential pattern need not to be:
– consecutive
– simple items
• Algorithms for finding sequential patterns
– “count-all” algorithms
– “count-some” algorithms
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
21/48
Conclusion
• Various applications (market, banking, sports)
• Drawbacks of existing algorithms
– Data size
– Data noise
– Query complexity
• The infrastructure has to be significantly enhanced
to support larger applications
• Solutions
– Adding extensive indexing capabilities
– Using new HW architectures
to achieve improvements in query time
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
22/48
THE NEW SOFTWARE
PARADIGM
All software agents are programs, but not all
programs are agents
Many Definitions
• Computational systems that inhabit some dynamic environment,
sense and act autonomously and realize a set of goals or tasks
for which they are designed
• Hardware or (more usually) software-based computer system
that enjoys the following properties:
-
Reactive (sensing and acting)
Autonomous
Goal-oriented (pro-active purposeful)
Temporally continuous
Communicative (socially able)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
-
Learning (adaptive)
Mobile
Flexible
Character
24/48
What Problems do Agents
Solve ?
• Client/server network bandwidth problem
• In the design of a client/server architecture
• The problems created by intermittent
or unreliable network connections
• Attempts to get computers to do real thinking for us
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
25/48
The New Software Paradigm
• Unless special care has been taken in the design of the code,
two software programs cannot interoperate
• The promise of agent technology is to move
the burden of interoperability from software programmers
to programs themselves
This can happen if two conditions are met:
– A common language (Agent Communication Language – ACL)
– An appropriate architecture
• They draw on and integrate many diverse disciplines
of computer science and other areas
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
26/48
FIPA Specifications
• The Foundation for Intelligent Physical Agents (FIPA),
established in 1996 in Geneva
• FIPA specifications:
–
–
–
–
–
–
–
–
–
Agent Management
Agent Communication Language
Agent/Software Integration
Agent Management Support for Mobility
Human-Agent Interaction
Agent Security Management
Agent Naming
FIPA Architecture
Agent Message Transport
etc.
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
27/48
Agent Management
• Provides the normative framework within which FIPA agents
exist and operate
• Establishes the logical reference model for the creation,
registration, location, communication, migration
and retirement of agents
- The entities contained in the
reference model are logical
capability sets and do not imply
any physical configuration
- Additionally, the implementation
details of individual APs and agents
are the design choices of the
individual agent system developers
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
28/48
Components of the Model
•Agent
- computational process
- fundamental
actor on an AP
•Directory
Facilitator
- yellow
pages software
to other agents
as a physical
process has a life cycle
- supported
are: by the AP
that has tofunction
be managed
•Agent-register
Management System
- white
pages services to other agents
-deregister
- maintains
-modify a directory of AIDs which contain transport addresses
•Message
Transport
-search
- supported
function Service
are:
-register
- communication
method between agents
-deregister
•Agent-modify
Platform
-searchinfrastructure in which agents can be deployed
- physical
-get-description
-operations for underlying AP
•Software
- all non-agent, executable collections of instructions
accessible through an agent
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
29/48
Agent Life Cycle
• FIPA agents exist physically on an AP and utilize the facilities
offered by the AP for realising their functionalities
• In this context, an agent, as a physical software process,
has a physical life cycle that has to be managed by the AP
The state transitions of
agents can be described as:
-
create
invoke
destroy
quit
suspend
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
-
resume
wait
wake up
move*
execute*
30/48
Agent Communication
Language
• The specification consists of a set of message types
and the description of their meanings
• Requirements:
– Implementing
a subsetparameters:
of the pre-defined message types and protocols
• Pre-defined
message
– Sending
and receiving the not-understood message
:sender
• Communicative
acts:
– Correct implementation of communicative acts
:receiver
confirm
defined in the specification
disconfirm
:content
– Freedom to use communicative acts with other names,
inform
:reply-with
not defined
in the specification
not-understood
:in-reply-to
– Obligation
of correctly generating messages in the transport form
query-if
:language
– Language
must be able to express propositions, objects and actions
query-ref
– The use
of Agent Management Content Language and ontology
:ontology
refuse
etc.
:reply-by
:protocol
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
31/48
Communication Examples
asks
agent
for its
available
- Agent-ji Agent
refuses
to i jreserve
a ticket
for
i,
i, believing
that
agent
jservices:
thinks
that
a shark
is a d1:
Agent
i
asks
agent
j
if
j
is
registered
with
domain
server
(query-ref
since
i there
mammal,
are insufficient
attempts tofunds
change
in i's
j's account:
belief:
(query-if
:sender
-i
bid it can reserve trains, planes and
(disconfirm
- Agent
j Auction
replies
(refuse
:sender
inotthat
Agent
i
did
understand an query-if message
(inform
:receiver
jAgent
:sender
i
:sender
j
i
confirms to agent j that it is,
automobiles:
:receiver
j not
because
it
did
recognize the ontology:
:sender
agent_X
:content
:receiver
i in fact,j true
:receiver
(inform
:content
that it is snowing
today:
(not-understood
auction_server_Y
(iota ?x :receiver
(available-services
j ?x))
:content
:content
shark)
:sender
j (mammal
(confirm
(registered
(server
d1) (agent j))
:sender
i
:content
…)()
:receiver
i r09 i
:sender
:reply-with
(action
:receiver
j (reserve-ticket
j
LHR, MUC,
(price
(bid good02)
150) 27-sept-97))
:content :receiver ac12345)
j
) (insufficient-funds
:content
((query-ifround-4
:sender j :receiver i …)
:in-reply-to
(=
(iota
?x
(available-services
jsnowing
?x)) )"
:content
"weather(
today,
...
)
(unknownbid04
(ontology
www)))
:reply-with
((reserve-ticket
train)
:language
Prolog
(inform
:language
sl)
:language
sl
:language
sl
plane)
) (reserve-ticket
j
) :sender
:ontology auction
(reserve
automobile))
:receiver
i
)
)
:content
(not (registered (server d1) (agent j)))
…)
:in-reply-to r09
)
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
32/48
GoodNews
A system that automatically categorizes
news reports that reflect positively or negatively
on a company’s financial outlook
Introduction
• Correlation between news reports on a company’s financial outlook
and its attractiveness as an investment
• Text categorization – very difficult domain
for the use of machine learning
– Very large number of input features
– High level of noise (metaphors, irony,…)
– Large percent of irrelevant features
• A new text classification algorithm – “Domain Experts”
• Two types of data
– (Human-)labeled
– Unlabeled
• The algorithm classifies financial news
into the predefined five categories
• FCP (Frequently Co-located Phrase) the building element
for the categorization algorithm
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
34/48
Categorization
• The algorithm categorizes each given news article
into the predefined categories
– GOOD – strong and explicit evidences of the company’s financial
status
• …shares of ABC company rose 2 percent…
– GOOD, UNCERTAIN – predictions and forecasts of future
profitability
• … ABC company predicts fourth-quarter earnings will be high…
– NEUTRAL – nothing is mentioned about the financial well-being
of the company
• … ABC announced plans to focus on products based on recycled
materials…
– BAD, UNCERTAIN – predictions of future loses
• … ABC announced today that fourth-quarter results could
fall short of expectations…
– BAD – explicitly bad evidences
• … shares of ABC fell $0.57 to $44.65 in early NY trading…
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
35/48
Co-located Phrase
• The proposed algorithm labels the “unlabeled” news articles
through voting process among experts that are FCP’s
• Definition – a co-located phrase is a sequence of nearby,
but not necessarily consecutive words
– …shares of ABC rose 8.5%… (shares, rose): GOOD
– …ABC presented its new product… (present, product): NEUTRAL
class
+
“share & gains | rose”, “profit | revenue & rose”
+/?
“except | forecasts & earnings”
+/-
“alliance & company”, “deal | present & product”
-/?
“short & expectation”
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
selected FCP
“share & down | lost”, “profit | sales & decrease”
36/48
Conclusion
• Problems with construction of the training (i.e. labeled)
data set – “inter-indexer inconsistency”
• Problems with small sets of labeled (training) data
– Very expensive labeled data,
while unlabeled data are cheaply available
• The accuracy is around 75% (total of 2000 news articles);
• Comparison of a few different methods (picture)
Naive-Bayes v Domain Experts
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
37/48
iMatch
The vision of each MIT student
having a personal software agent,
which helps to manage its owner's academic life
Introduction
• The aim - bring together MIT students
and staff who may usefully collaborate with each other
– completing final projects
– studying for exams
– tutoring one another
• Facilitate students and faculty matching for:
– Research
– Teaching
– Internship
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
39/48
Ceteris Paribus Preference
• Ceteris paribus relations express a preference over sets of
possible outcomes
• All possible outcomes are considered to be describable by
some (large) set of binary features (true or false)
– The specified features are instantiated to either true or false
– Other features are ignored
I prefer train
I prefer ice cream
I prefer airplane
I prefer chocolate
I prefer cell phone
I prefer e-mail
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
40/48
CPP Agent Configuration
• Specify a domain for preference
– Agent methods of communication and notification
– Different security settings of different servers
• Preference statements themselves
– How to get users to easily adjust C.P. rules (graphical interface)
– Pose hypothetical preference questions to user to help complete
the preferences of an ambivalent user
• People will only put down their true profile, if they know that
the system is secure
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
41/48
Conclusion
• Benefit MIT students
by matching them to appropriate resources
• Static interest matching
– Group together similar users for specific context
– This enables viewing a human user as a resource
for dynamic resource discovery (locate experts, enthusiasts,...)
• Dinamic interest matching
– Location and/or temporal specific resource matching
As students and their agents move from one physical location to another,
iMatch services for matching the closest resources can be offered
• Help students manage their lives
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
42/48
The near future…
The focus of the research is on e-tourism
after the year 2005, but the applications
of the proposed infrastructure are multifold
Introduction
• The assumptions:
– after the year 2005, each tourist in Europe will be equiped with a
cell phone of the power same or better than the Pentium IV
– whenever a tourism-based service or product is purchased, a
mobile agent is assigned to that cell phone PC, to monitor the
behaviour of the customer
– all tourist cell phone PCs create an AD-HOC network
around the points of touristic attractions, and link to a data mine
that collects all information of interest
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
44/48
How to accomplish it?
• The information of interest is not collected by asking the
customer to fill out the forms, but by monitoring the
behaviour of the customer
• The collected information, sorted in the data mine,
is made available to other tourists, as an on-line ownerindependent source of information about the given services
and/or products
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
45/48
What can it do…
• If a tourist would like to know, at that very moment, what
restaurant has good food/atmosphere and happy customers,
he/she can access the data mine (via the Internet) and can
obtain the information that is linked to that very moment, and
is not created by the owner of the business, but by the
customers
• Accessing the given restaurant’s website has two drawbacks:
– the information is not fresh - periodically updated
– the information is made by the owner of the restaurant,
and therefore not completely objective
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
46/48
Conclusion
• Consequently, the proposed approach works much better,
and represents a qualitative step forward
in the domain of maximization of customer satisfaction
• This may mean that the privacy of the customers is jeopardized,
however, if the monitored behaviour is non-personalized,
and if the customer obtains a discount based on the fact that
mobile agents are welcome, the privacy stops to be an issue,
and people will sign up voluntarily
Voislav Galić, Dušan Zečević,
Đorđe Đurđević, Veljko Milutinović
47/48
THE END
Quatenus nobis denegatum diu vivere, relinquamus aliquid, quo nos vixisse testemur
References:
http://www.marconi.com
http://www.blueyed.com
http://www.fipa.org
http://www.rpi.edu
http://research.microsoft.com
http://imatch.lcs.mit.edu
………
Authors:
Voislav Galić, [email protected]
Dušan Zečević, [email protected]
Đorđe Đurđević, [email protected]
Veljko Milutinović, [email protected]
http://galeb.etf.bg.ac.yu/~vm/tutorial