Download Why Data Mining - start [kondor.etf.rs]

VIRTUAL PRESENCE Authors: Voislav Galić, [email protected] Dušan Zečević, [email protected] Đorđe Đurđević, [email protected] Veljko Milutinović, [email protected] http://galeb.etf.bg.ac.yu/~vm/tutorial 1/99 DEFINITION Virtual presence is a term with various shades of meanings in different industries, but its essence remains constant; it is a new tool that enables some form of telecommunication in which the individual may substitute their physical presence with an alternate, typically, electronic presence Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 2/99 SUMMARY - Introduction to Virtual Presence - Data Mining for Virtual Presence - A New Software Paradigm - Selected Case Studies Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 3/99 INTRODUCTION TO VP - Definitions - VP applications - Psychological aspects Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 4/99 DATA MINING FOR VP - Why Data Mining? - What can Data Mining do? - Growing popularity of Data Mining - Algorithms Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 5/99 SOFTWARE AGENTS - A new software paradigm - Standardization - FIPA specifications - Agent management - Agent Communication Language Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 6/99 GoodNews (CMU*) - Categorization of financial news articles - Co-located phrases - Domain Experts - Implementation and results * Carnegie Mellon University, Pittsburgh, USA Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 7/99 iMatch (MIT*) - The idea - associate MIT students and staff in order to ease their cooperation; - help students find resources they need - Implementation - advanced, agent-based system architecture - Tomorrow? * Massachusetts Institute of Technology, USA Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 8/99 “Tourist city” (ETF*) • A qualitative step forward in the domain of maximization of customer satisfaction • Technologies: • Data Mining • Software Agents (mobile) * Faculty of Electrical Engineering, University of Belgrade, Serbia and Montenegro Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 9/99 CONCLUSION This tutorial will attempt to familiarize you with: - The concept of VP (Virtual Presence) as a new technological challenge - The new paradigms and technologies that will bring the VP to everyday life: - Data Mining - Software Agents Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 10/99 INTRODUCTION Virtual presence will arguably be one of the most important aspects of personal communication in the twenty-first century Essence of VP • The usefulness and reliability of virtual presence • The ability to conduct everyday tasks by being virtually or electronically present Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 12/99 How to Accomplish it? • The presence is accomplished through the Internet, video, or other communications, perhaps even psychically one day • Technological advance will sophisticate virtual presence, altering the very meaning of the word “presence” Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 13/99 VP Applications • VP in government – “Sunshine laws” – Voting Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 14/99 VP Applications • VP in business – Online board meetings – Shareholder voting online Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 15/99 VP Applications • VP in education – interactive lectures and courses Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 16/99 VP Applications • VP in medicine – Telemedicine • Diagnostics • Remote surgery – Risks • Privacy Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 17/99 VP Applications • VP in everyday life – Telecommuting/Telework – Software agents as our virtual “shadows” Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 18/99 Psychological Aspects • Cyberspace and Mind • Presence in Virtual Space • Communal Mind and Virtual Community Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 19/99 DATA MINING Knowledge discovery is a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data Many Definitions • Data mining is also called data or knowledge discovery • It is a process of inferring knowledge from large oceans of data • Search for valuable information in large volumes of data • Analyzing data from different perspectives and summarizing it into useful information Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 21/99 Why Data Mining ? • DM allows you to extract knowledge from historical data and predict outcomes of future situations • Optimize business decisions and improve customers’ satisfaction with your services • Analyze data from many different angles, categorize it, and summarize the relationships identified • Reveal knowledge hidden in data and turn this knowledge into a crucial competitive advantage Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 22/99 What Can Data Mining Do? • Identify your best prospects and then retain them as customers • Predict cross-sell opportunities and make recommendations • Learn parameters influencing trends in sales and margins • Segment markets and personalize communications etc. Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 23/99 The Power of Data Mining • Having a database is one thing, making sense of it is quite another • It does not rely on narrow human queries to produce results, but instead uses AI related technology and algorithms • Inductive reasoning • Using more than one type of algorithm to search for patterns in data • Data mining produces usually more general (=more powerful) results than those obtained by traditional techniques • Relational DB storage and management technology is OK for data mining applications less than 50 gigabytes Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 24/99 Reasons for the Growing Popularity of Data Mining • Growing Data Volume • Low Cost of Machine Learning • Limitations of Human Analysis … Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 25/99 Tasks Solved by Data Mining • • • • • • • Predicting Classification Detection of relations Explicit modeling Clustering Market basket analysis Deviation detection Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 26/99 Algorithms • Generally, their complexity is around n (log n) (n is the number of records) • Data mining includes three major components, with corresponding algorithms: – Clustering (Classification) – Association Rules – Sequential Analysis Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 27/99 Classification Algorithms • • The aim is to develop a description or model for each class in a database, based on the features present in a set of class-labeled “training data” Data Classification Methods: – – – – – – Statistical algorithms Neural networks Genetic algorithms Nearest neighbor method Rule induction Data visualization Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 28/99 Classification-rule Learning • Data abstraction • Classification-rule learning – finding rules or decision trees that partition given data into predefined classes – Hunt’s method • Decision tree building algorithms: – ID3 / C4.5 algorithm – SLIQ / SPRINT algorithm (IBM) • Other algorithms Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 29/99 Parallel Algorithms • Basic Idea: N training data items are randomly distributed to P processors. All the processors cooperate to expand the root node of the decision tree • There are two approaches for future progress (the remaining nodes): – Synchronous approach – Partitioned approach Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 30/99 Association Rule Algorithms • Association rule implies certain association relationship among the set of objects in a database • These objects “occur together”, or “one implies the other” • Formally: X  Y, where X and Y are sets of items (itemsets) • Key terms – Confidence – Support • The goal – to find all association rules that satisfy user-specified minimum support and minimum confidence constraints. Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 31/99 Association Rule Algorithms • Apriori algorithm and its variations – AprioriTid – AprioriHybrid – FT (Fault-tolerant) Apriori • Distributed / Parallel algorithms (FDM, …) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 32/99 Sequential Analysis • Sequential Patterns • The problem – finding all sequential patterns with user-specified minimum support • Elements of a sequential pattern need not to be: – consecutive – simple items • Algorithms for finding sequential patterns – “count-all” algorithms – “count-some” algorithms Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 33/99 Conclusion • Drawbacks of existing algorithms – Data size – Data noise • There are two critical technological drivers: – Size of the database – Query complexity • The infrastructure has to be significantly enhanced to support larger applications • Solutions – Adding extensive indexing capabilities – Using new HW architectures to achieve improvements in query time Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 34/99 THE NEW SOFTWARE PARADIGM All software agents are programs, but not all programs are agents Many Definitions • Computational systems that inhabit some dynamic environment, sense and act autonomously and realize a set of goals or tasks for which they are designed • Hardware or (more usually) software-based computer system that enjoys the following properties: - Reactive (sensing and acting) Autonomous Goal-oriented (pro-active purposeful) Temporally continuous Communicative (socially able) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović - Learning (adaptive) Mobile Flexible Character 36/99 Interesting Topic of Study • They draw on and integrate many diverse disciplines of computer science and other areas: – – – – – – – – objects and distributed object architectures adaptive learning systems artificial intelligence and expert systems collaborative online social environments security knowledge based systems, databases communications networks cognitive science and psychology … Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 37/99 What Problems do Agents Solve ? • Client/server network bandwidth problem • In the design of a client/server architecture • The problems created by intermittent or unreliable network connections • Attempts to get computers to do real thinking for us Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 38/99 The New Software Paradigm • Unless special care has been taken in the design of the code, two software programs cannot interoperate • The promise of agent technology is to move the burden of interoperability from software programmers to programs themselves This can happen if two conditions are met: – A common language (Agent Communication Language – ACL) – An appropriate architecture Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 39/99 The Need for Standards • Anywhere, anytime consumer access to the Universal bouquet of information and services is the new goal of the information revolution • The scope of Internet standards makes the scope of choices extreme • The Foundation for Intelligent Physical Agents (FIPA), established in 1996 in Geneva • international non-profit association of companies and organizations • specifications of generic agent technologies. Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 40/99 FIPA Specifications • • • • • • • • • Agent Management Agent Communication Language Agent/Software Integration Agent Management Support for Mobility Human-Agent Interaction Agent Security Management Agent Naming FIPA Architecture Agent Message Transport etc. Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 41/99 Agent Management • Provides the normative framework within which FIPA agents exist and operate • Establishes the logical reference model for the creation, registration, location, communication, migration and retirement of agents - The entities contained in the reference model are logical capability sets and do not imply any physical configuration - Additionally, the implementation details of individual APs and agents are the design choices of the individual agent system developers Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 42/99 Components of the Model •Agent - computational process - fundamental actor on an AP •Directory Facilitator - as a physical software process has a life cycle - yellow pages to other agents that has to be managed by the AP - supported function are: •Agent-register Management System - white pages services to other agents -deregister - maintains -modify a directory of AIDs which contain transport addresses •Message Transport -search - supported function Service are: -register - communication method between agents -deregister •Agent-modify Platform -searchinfrastructure in which agents can be deployed - physical -get-description -operations for underlying AP •Software - all non-agent, executable collections of instructions accessible through an agent Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 43/99 Agent Life Cycle • FIPA agents exist physically on an AP and utilize the facilities offered by the AP for realising their functionalities • In this context, an agent, as a physical software process, has a physical life cycle that has to be managed by the AP The state transitions of agents can be described as: - create invoke destroy quit suspend Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović - resume wait wake up move* execute* 44/99 Agent Communication Language • The specification consists of a set of message types and the description of their meanings • Requirements: – Implementing a subset of the pre-defined message types and protocols – Sending and receiving the not-understood message – Correct implementation of communicative acts defined in the specification – Freedom to use communicative acts with other names, not defined in the specification – Obligation of correctly generating messages in the transport form – Language must be able to express propositions, objects and actions – The use of Agent Management Content Language and ontology Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 45/99 ACL Syntax Elements • Pre-defined message parameters: :sender acts: • Communicative :receiver accept-proposal agree :content cancel :reply-with cfp :in-reply-to confirm :envelope disconfirm :language failure inform :ontology inform-if :reply-by inform-ref :protocol :conversation-id Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović not-understood propose query-if query-ref refuse reject-proposal request request-when request-whenever subscribe 46/99 Communication Examples -- Agent to jagent that it is, with domain server d1: Agent ii confirms asks agent if j is jregistered in fact, true that it is snowing today: (query-if jj replies that it can reserve trains, -(confirm i, believing agent thinks that a sharkand is a :sender i - Agent Agent refuses to that i reserve a jticket for i, planes Auction bid :sender i automobiles: mammal, attempts to change j's belief: :receiver j understand (inform -- Agent i did not an query-if since i there are insufficient funds in services: i'smessage account: Agent i asks agent j for its available :receiver j (inform (disconfirm :content :sender agent_X (refuse because it did not recognize the ontology: (query-ref :content "weather( today, :sender j i (registered (server d1)snowing (agent )" j)) :receiver auction_server_Y :sender j (not-understood :sender i :language Prolog :receiver i :reply-with r09 :content :receiver i :sender i j :receiver j ) :content shark)150) ) (price(mammal (bid good02) :receiver j :content ) ?x (available-services j ?x)) ((= (iota ... :in-reply-to round-4 :content ((query-if :sender j :receiver i …) (iota ?xj(available-services j MUC, ?x)) 27-sept-97)) (action (reserve-ticket LHR, ((reserve-ticket train) (inform :reply-with bid04 (ontology www))) (unknown …) (insufficient-funds ac12345) (reserve-ticket plane) :sender j sl :language :language sl ) (reserve automobile)) :receiver i :ontology auction ) :language sl) ) :content (not (registered (server d1) (agent j))) ) …) :in-reply-to r09 ) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 47/99 Agent/Software Integration • Integration of services provided by non-agent software into a multi-agent community • Definition of the relationship between agents and software systems • Allowing agents to describe, broker and negotiate over software systems • Allowing new software services to be dynamically introduced into an agent community • Defining how software resources can be described, shared and dynamically controlled in an agent community Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 48/99 New Agent Roles • To support specification, two new agent roles have been identified: – Agent Resource Broker (ARB) – WRAPPER Agent Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 49/99 GoodNews A system that automatically categorizes news reports that reflect positively or negatively on a company’s financial outlook Introduction • Correlation between news reports on a company’s financial outlook and its attractiveness as an investment • Volume of such reports is huge • A new text classification algorithm – “Domain Experts” with “self-confident” sampling technique • Two types of data – (Human-)labeled – Unlabeled • The algorithm classifies financial news into the predefined five categories – (good)  (good, uncertain)  (neutral)   (bad, uncertain)  (bad) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 51/99 Introduction • Text categorization task • FCP (Frequently Co-located Phrase) the building element for the categorization algorithm • Text categorization – very difficult domain for the use of machine learning – Very large number of input features – High level of attribute and class noise – Large percent of irrelevant features • Very expensive labeled data, while unlabeled data are cheaply available Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 52/99 Categorization • The algorithm categorizes each given news article into the predefined categories in terms of referred company’s financial well-being • GOOD – strong and explicit evidences of the company’s financial status – …shares of ABC company rose 2 percent to $24-15/16… • GOOD, UNCERTAIN – predictions and forecasts of future profitability – … ABC company predicts fourth-quarter earnings will be high… Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 53/99 Categorization • NEUTRAL – nothing is mentioned about the financial well-being of the company – … ABC announced plans to focus on products based on recycled materials… • BAD, UNCERTAIN – predictions of future loses – … ABC announced today that fourth-quarter results could fall short of expectations… • BAD – explicitly bad evidences – … shares of ABC fell $0.57 to $44.65 in early NY trading… • Problems with construction of the training (i.e. labeled) data set – “inter-indexer inconsistency” Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 54/99 Co-located Phrase • The proposed algorithm labels the “unlabeled” news articles through voting process among experts that are FCP’s • Definition – a co-located phrase is a sequence of nearby, but not necessarily consecutive words – … shares of ABC rose 8.5%… (shares, rose): GOOD – …ABC presented its new product… (present, product): NEUTRAL • Contextual information • The use of heuristics to cope with enormous “phrase space” (amount of possible phrases) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 55/99 Naive-Bayes v Domain Experts • Naive-Bayes with EM (Expectation Maximization) • Problems with small sets of labeled (training) data; • EM (Expectation Maximization) – a class of iterative algorithms for maximum likelihood estimation in problems with incomplete data • Domain Experts algorithm is able to deal with inconsistent hypotheses • Iterative building of the training set Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 56/99 Implementation and Results • The experiment focused on two performance criteria: – Using unlabeled data for improving categorization accuracy – The categorization itself • The accuracy is around 75% (total of 2000 news articles); • Comparison of a few different methods (picture) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 57/99 Conclusions • Domain Experts with SC sampling outperform naive Bayes with EM – collocation property and vote entropy are appropriate to such a domain • The accuracy of around 75% is the limit with the techniques used • Better performance could be achieved by using some natural language processing techniques • Such techniques are pretty rudimental today Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 58/99 iMatch The vision of each MIT student having a personal software agent, which helps to manage its owner's academic life Introduction • The aim: bring together MIT students and staff who may usefully collaborate with each other • This collaboration can have several goals: – completing final projects – studying for exams – tutoring one another • iMATCH agents are supposed to facilitate students and faculty matching for: – Research – Teaching – Internship opportunities within and across campuses Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 60/99 iMatch Agent Architecture • iMatch agents are situated within an environment • Sensors of the agent convert environmental inputs into representations that can be manipulated within the agent • Effectors translate actions planned by the agent into executable statements for the environment • The action planner selects the action with the highest utility according to the owner’s preference specification Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 61/99 Impacts and Benefits • MIT – – – – Benefit MIT students by matching them to appropriate resources Aid the recruitment of student researchers Help students manage their lives Use iMATCH in Medical Computing • GLOBAL – Facilitate Cross Community Collaboration Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 62/99 Research Topics • Knowledge representation – preference specification • Multi-agents systems – reputation management system – static interest matching – dynamic interest matching • Infrastructure – distributed security infrastructure Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 63/99 Ceteris Paribus Preference • Ceteris paribus relations express a preference over sets of possible outcomes • All possible outcomes are considered to be describable by some (large) set of binary features (true or false) – The specified features are instantiated to either true or false – Other features are ignored I prefer train I prefer ice cream I prefer airplane I prefer chocolate I prefer cell phone I prefer e-mail Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 64/99 CPP Agent Configuration • Specify a domain for preference – Agent methods of communication and notification – Different security settings of different servers • Preference statements themselves – How to get users to easily adjust C.P. rules (graphical interface) – Pose hypothetical preference questions to user to help complete the preferences of an ambivalent user • People will only put down their true profile, if they know that the system is secure Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 65/99 Static Interest Matching • Group together similar users for specific context • This enables viewing a human user as a resourcefor dynamic resource discovery (locate experts, enthusiasts,...) • The approach: – Keyword matching – Ontological matching using Kulbeck-Leiber (KL) distance Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 66/99 Dynamic Interest Matching • Location and/or temporal specific resource matching • As students and their agents move from one physical location to another, iMatch services for matching the closest resources can be offered • The idea: anything worthwhile is locatable • The approach: – Intentional naming scheme – Reputation based resource discovery Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 67/99 Technology • Components – Distributed Multi-Agent Infrastructures – Ceteris Paribus preference-based Interest Matching – Reputation Management Infrastructure • Technology – – – – – Microsoft.Net Bluetooth IEEE 802.11 Smartcards (PC/SC) INS (International Naming System) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 68/99 Conclusion • Benefit MIT students by matching them to appropriate resources • Static interest matching – Group together similar users for specific context – This enables viewing a human user as a resource for dynamic resource discovery (locate experts, enthusiasts,...) • Dinamic interest matching – Location and/or temporal specific resource matching As students and their agents move from one physical location to another, iMatch services for matching the closest resources can be offered • Help students manage their lives Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 69/99 The near future… The focus of the research is on e-tourism after the year 2005, but the applications of the proposed infrastructure are multifold Introduction • The assumptions: – after the year 2005, each tourist in Europe will be equiped with a cell phone of the power same or better than the Pentium IV – whenever a tourism-based service or product is purchased, a mobile agent is assigned to that cell phone PC, to monitor the behaviour of the customer – all tourist cell phone PCs create an AD-HOC network around the points of touristic attractions, and link to a data mine that collects all information of interest Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 71/99 How to accomplish it? • The information of interest is not collected by asking the customer to fill out the forms, but by monitoring the behaviour of the customer • The collected information, sorted in the data mine, is made available to other tourists, as an on-line ownerindependent source of information about the given services and/or products Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 72/99 What can be done… • If a tourist would like to know, at that very moment, what restaurant has good food/atmosphere and happy customers, he/she can access the data mine (via the Internet) and obtain the information that is linked to that very moment, and is not created by the owner of the business, but by the customers themselves • Accessing the given restaurant’s website has two drawbacks: – the information is not fresh - periodically updated – the information is made by the owner of the restaurant, and therefore not completely objective Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 73/99 Conclusion • Consequently, the proposed approach works much better , and represents a qualitative step forward in the domain of maximization of customer satisfaction • This may mean that the privacy of the person is jeopardized, however, if the monitored behaviour is non-personalized, and if the customer obtains a discount based on the fact that mobile agents are welcome, the privacy stops to be an issue, and people will sign up voluntarily Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 74/99 Appendix A Survey of the Data Mining Algorithms Apriori Algorithm • The task – mining association rules by finding large itemsets and translating them to the corresponding association rules; • A  B, or A1  A2 … Am  B1  B2 … Bn, where A  B =  • The terminology – – – – Confidence Support k-itemset – a set of k items; Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A  B and B  A; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 76/99 Apriori Algorithm • The  operator definition – n = 1: S2 = S1  S1 = {A}, {B}, {C}}  {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} – n = k: Sk+1 = Sk  Sk = {X  Y| X, Y  Sk, |X  Y| = k-1} – X and Y must have the same number of elements, and must have exactly k-1 identical elements; – Every k-element subset of any resulting set element (an element is actually a k+1 element set) has to belong to the original set of itemsets; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 77/99 Apriori Algorithm • Example: TID elements 10 A C D 20 B C E 30 A B C 40 B E Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović E 78/99 Apriori Algorithm • Step 1 – generate a candidate set of 1-itemsets C1 – Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance (á priori ); – The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set; – C1 = {{A}, {B}, {C}, {D}, {E}} – Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. oneelement sets); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 79/99 Apriori Algorithm • Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 80/99 Apriori Algorithm • Step 2 – generate a set of large 1-itemsets L1 – Each element in C1 with support that exceeds some adopted minimum support (for example 50%) becomes a member of L1; – L1 = {{A}, {B}, {C},{E}} and we can omit D in further steps (if D doesn’t have enough support alone, there is no way it could satisfy requested support in a combination with some other element(s)); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 81/99 Apriori Algorithm • Step 3 – generate a candidate set of large 2-itemsets, C2 – C2 = L1  L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} – Count the corresponding appearances • Step 4 – generate a set of large 2-itemsets, L2; – Eliminate the candidates without minimum support; – L2 = {{AC}, {BC}, {BE}, {CE}} Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović {AB} 1 {AC} 2 {AE} 1 {BC} 2 {BE} 3 {CE} 2 82/99 Apriori Algorithm • Step 5 (C3) – C3 = L2  L2 = {{BCE}} – Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set L2 (calculation is made according to the operator  definition); • Step 6 (L3) – L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two appearances); • There can be no further steps in this particular case, because L3  L3 = ; • Answer = L1  L2  L3; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 83/99 Apriori Algorithm L1 = {large 1-itemsets} for (k=2; Lk-1  ; k++) Ck = apriori-gen(Lk-1); forall transactions t  D do begin Ct = subset (Ck, t); forall candidates c  Ct do c.count++; end; Lk = {c  Ck | c.count  minsup} end; Answer = k Lk Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 84/99 Apriori Algorithm • Enhancements to the basic algorithm • Scan-reduction – The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the frequency of each candidate in the database; – Scan number reduction – counting candidates of multiple sizes in one pass; – Rather than counting only candidates of size k in the kth pass, we can also calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk), using the  operator; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 85/99 Apriori Algorithm – Compare: C’k+1 = Ck  Ck Ck+1 = Lk  Lk – Note that C’k+1  Ck+1 – This variation can pay off in later passes, when the cost of counting and keeping in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of scanning the database; – There has to be enough space in main memory for both Ck and C’k+1; – Following this idea, we can make further scan reduction: • C’k+1 is calculated from Ck for k > 1; • There must be enough memory space for all Ck’s (k > 1); – Consequently, only two database scans need to be performed (the first to determine L1, and the second to determine all the other Lk’s); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 86/99 Apriori Algorithm • Abstraction levels – Higher level associations are stronger (more powerful), but also less certain; – A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 87/99 DHP Algorithm • DHP = Direct Hashing and Pruning – another algorithm for mining association rules; • Based on the Apriori algorithm (Ck/Lk generation in the kth step); • Empirical analysis of the Apriori algorithm shows that candidate sets (Ck) are much larger than corresponding sets of large itemsets (Lk), especially in a first few iterations; • DHP introduces more efficient candidate set generation method; • The idea is to insert into Ck only those candidate sets that are likely to become large itemsets; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 88/99 DHP Algorithm • Additional improvement is accomplished through “twodimensional” search base reduction – “length”(number of records in the search base) and “width” (number of relevant attributes in a record); • Large itemsets’ characteristics: – Every non-empty subset of a large itemset is a large itemset as well, for example, {BCD}  L3  {{BC}, {CD}, {BD}}  L2; – It implies that a record is relevant for discovering large k+1itemsets only if it contains at least k+1 large k-itemsets; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 89/99 DHP Algorithm – During the Ck  Lk phase we might count large k-itemsets in each record; if their number in a particular record is less than k+1, we omit that record during the Ck+1 generation; – Similarly, if a record contains one or more large k+1-itemsets, each element (item) of these itemsets appears in, at least, k candidates from Ck • Hashing – Hashing boosts the performance of the DHP algorithm; – The algorithm does not specify any hash function in particular, it depends on the application; – Likewise, it does not specify the size of the hash table (number of groups/addresses); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 90/99 DHP Algorithm • Application example TID elements 10 A C D 20 B C E 30 A B C 40 B E Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović E 91/99 DHP Algorithm • Step 1 – generate a candidate set of 1-itemsets C1 – C1 = {{A}, {B}, {C}, {D}, {E}} – Simultaneously with counting each element’s support, a hash tree is generated that contains all the elements from the database, in order to improve the counting performance; • For each new element, DHP checks whether the element is already in the tree or not; • If yes, DHP increments the current number of appearances for that element; otherwise, the element is added to the hash tree, and the number of its appearances is set to 1; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 92/99 DHP Algorithm • Having counted each C1 element appearances, all possible 2element subsets are generated and inserted into H2 hash table; TID 2-element subsets 10 {AC}, {AD}, {CD} 20 {BC}, {BE}, {CE} 30 {AB}, {AC}, {AE}, {BC}, {BE}, {CE} 40 {BE} – The address of a particular subset could be calculated with respect to the position of its elements in C1 candidate set, using chosen hash function h(x, y); Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 93/99 DHP Algorithm – For example, let’s adopt the following hash function: h({x y}) = (posC1(x)*10 + posC1(y)) mod 7; • The corresponding H2 hash table is shown below: address weight 0 3 {AD} 1 1 {AE} 2 2 {BC} {BC} 3 0 4 3 {BE} {BE} {BE} 5 1 {AB} 6 3 {AC} {CD} {AC} Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović {CE} {CE} 94/99 DHP Algorithm • Whenever a new element is added to the hash table, the weight of the particular address is increased by one; • C2 is generated out of L1 (just like in Apriori case); • Besides that, only those elements that map to the addresses whose weight is greater or equal than specified minimum support (let the minimum support be 50%), will be taken into consideration during the C2 generation; • C2 = {{AC}, {BC}, {BE}, {CE}}; • It contains two elements less (!) than the C2 set generated by the Apriori algorithm for the same example database; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 95/99 DHP Algorithm • In general, the Hk hash table is used for the Ck candidate set generation in the kth step of the algorithm; Hk is created in the previous (k-1)th step; • Each address of the Hk hash table contains a number of kelement subsets as elements; its weight denotes the number of elements; • The fact that an address doesn’t satisfy minimum support requirement means that neither element (set) that is mapped to the address can satisfy the requirement alone  all the elements (sets) at such Hk addresses are omitted for the Ck generation; • During the kth step, Ck is generated starting from Lk-1, with the restrictions described above; Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 96/99 DHP Algorithm • Conclusions: – DHP outperforms Apriori, for the same input data; – The time spent for the hash tables generation (especially H2) is overcome by extremely reduced candidate sets (C2, …); – The same improvements applied on Apriori, may as well be applied here (scan reduction, abstraction levels, …) Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 97/99 References • • • • • • http://www.marconi.com http://www.blueyed.com http://www.fipa.org http://www.rpi.edu http://research.microsoft.com http://imatch.lcs.mit.edu Voislav Galić, Dušan Zečević, Đorđe Đurđević, Veljko Milutinović 98/99 THE END Quatenus nobis denegatum diu vivere, relinquamus aliquid, quo nos vixisse testemur Authors: Voislav Galić, [email protected] Dušan Zečević, [email protected] Đorđe Đurđević, [email protected] Veljko Milutinović, [email protected] http://galeb.etf.bg.ac.yu/~vm/tutorial

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Why Data Mining - start [kondor.etf.rs]