Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Issues in Data Mining Infrastructure Authors: Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Voislav Galic, [email protected] Dusan Zecevic, [email protected] Sonja Tica, [email protected] Prof. Dr. Dusan Tosic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] Page Number: 1 Data Mining in the Nutshell Uncovering the hidden knowledge Huge n-p complete search space Multidimensional interface NOTICE: All trademarks and service marks mentioned in this document are marks of their respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted presentation of their process model. Page Number: 2 A Problem … You are a marketing manager for a cellular phone company Problem: Churn is too high Turnover (after contract expires) is 40% Customers receive free phone (cost 125$) You pay a sales commission of 250$ per contract Giving a new telephone to everyone whose contract is expiring is expensive Bringing back a customer after quitting is both difficult and expensive Page Number: 3 … A Solution Three months before a contract expires, predict which customers will leave If you want to keep a customer that is predicted to churn, offer them a new phone The ones that are not predicted to churn need no attention If you don’t want to keep the customer, do nothing How can you predict future behavior? Tarot Cards? Magic Ball? Data Mining? Page Number: 4 Still Skeptical? Page Number: 5 The Definition The automated extraction of predictive information from (large) databases Automated Extraction Predictive Databases Page Number: 6 History of Data Mining Page Number: 7 Repetition in Solar Activity 1613 – Galileo Galilei 1859 – Heinrich Schwabe Page Number: 8 The Return of the Halley Comet Edmund Halley (1656 - 1742) 1531 1607 1682 239 BC 1910 1986 2061 ??? Page Number: 9 Data Mining is Not Data warehousing Ad-hoc query/reporting Online Analytical Processing (OLAP) Data visualization Page Number: 10 Data Mining is Automated extraction of predictive information from various data sources Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines Page Number: 11 Data Mining can Answer question that were too time consuming to resolve in the past Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision Page Number: 12 Data Mining Models Page Number: 13 Neural Networks Characterizes processed data with single numeric value Efficient modeling of large and complex problems Based on biological structures - Neurons Network consists of neurons grouped into layers Page Number: 14 Neuron Functionality I1 W1 I2 W2 I3 W3 In f Output Wn Output = f (W1*I1, W2*I1, …, Wn*In) Page Number: 15 Training Neural Networks Page Number: 16 Neural Networks Once trained, Neural Networks can efficiently estimate the value of an output variable for given input Neurons and network topology are essentials Usually used for prediction or regression problem types Difficult to understand Data pre-processing often required Page Number: 17 Decision Trees A way of representing a series of rules that lead to a class or value Iterative splitting of data into discrete groups maximizing distance between them at each split Classification trees and regression trees Univariate splits and multivariate splits Unlimited growth and stopping rules CHAID, CHART, Quest, C5.0 Page Number: 18 Decision Trees Balance>10 Balance<=10 Age<=32 Married=NO Age>32 Married=YES Page Number: 19 Decision Trees Page Number: 20 Rule Induction Method of deriving a set of rules to classify cases Creates independent rules that are unlikely to form a tree Rules may not cover all possible situations Rules may sometimes conflict in a prediction Page Number: 21 Rule Induction If balance>100.000 then confidence=HIGH & weight=1.7 If balance>25.000 and status=married then confidence=HIGH & weight=2.3 If balance<40.000 then confidence=LOW & weight=1.9 Page Number: 22 K-nearest Neighbor and Memory-Based Reasoning (MBR) Usage of knowledge of previously solved similar problems in solving the new problem Assigning the class to the group where most of the k-”neighbors” belong First step – finding the suitable measure for distance between attributes in the data + Easy handling of non-standard data types - Huge models Page Number: 23 K-nearest Neighbor and Memory-Based Reasoning (MBR) Page Number: 24 Data Mining Algorithms Many other available models and algorithms Logistic regression Discriminant analysis Generalized Adaptive Models (GAM) Genetic algorithms The Apriori algorithm Etc… Many application specific variations of known models Final implementation usually involves several techniques Page Number: 25 The Apriori Algorithm The task – mining association rules by finding large itemsets and translating them to the corresponding association rules; A B, or A1 A2 … Am B1 B2 … Bn, where A B = The terminology – – – – Confidence Support k-itemset – a set of k items; Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A B and B A; Page Number: 26 The Apriori Algorithm The operator definition – – – – n = 1: S2 = S1 S1 = {A}, {B}, {C}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} n = k: Sk+1 = Sk Sk = {X Y| X, Y Sk, |X Y| = k-1} X and Y must have the same number of elements, and must have exactly k-1 identical elements; Every k-element subset of any resulting set element (an element is actually a k+1 element set) has to belong to the original set of itemsets; Page Number: 27 The Apriori Algorithm Example: TID elements 10 A C D 20 B C E 30 A B C 40 B E Page Number: 28 E The Apriori Algorithm Step 1 – generate a candidate set of 1-itemsets C1 – – – – Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance (á priori ); The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set; C1 = {{A}, {B}, {C}, {D}, {E}} Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets); Page Number: 29 The Apriori Algorithm Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets); {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Page Number: 30 The Apriori Algorithm Step 2 – generate a set of large 1-itemsets L1 – Each element in C1 with support that exceeds some adopted minimum support (for example 50%) becomes a member of L1; – L1 = {{A}, {B}, {C},{E}} and we can omit D in further steps (if D doesn’t have enough support alone, there is no way it could satisfy requested support in a combination with some other element(s)); Page Number: 31 {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 The Apriori Algorithm Step 3 – generate a candidate set of large 2-itemsets, C2 – C2 = L1 L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} – Count the corresponding appearances Step 4 – generate a set of large 2-itemsets, L2; – – Eliminate the candidates without minimum support; L2 = {{AC}, {BC}, {BE}, {CE}} Page Number: 32 {AB} 1 {AC} 2 {AE} 1 {BC} 2 {BE} 3 {CE} 2 The Apriori Algorithm Step 5 (C3) – – Step 6 (L3) – C3 = L2 L2 = {{BCE}} Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set L2 (calculation is made according to the operator definition); L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two appearances); There can be no further steps in this particular case, because L3 L3 = ; Answer = L1 L2 L3; Page Number: 33 The Apriori Algorithm L1 = {large 1-itemsets} for (k=2; Lk-1 ; k++) Ck = apriori-gen(Lk-1); forall transactions t D do begin Ct = subset (Ck, t); forall candidates c Ct do c.count++; end; Lk = {c Ck | c.count minsup} end; Answer = k Lk Page Number: 34 The Apriori Algorithm Enhancements to the basic algorithm Scan-reduction – – – The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the frequency of each candidate in the database; Scan number reduction – counting candidates of multiple sizes in one pass; Rather than counting only candidates of size k in the kth pass, we can also calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk), using the operator; Page Number: 35 The Apriori Algorithm – – – – – Compare: C’k+1 = Ck Ck Ck+1 = Lk Lk Note that C’k+1 Ck+1 This variation can pay off in later passes, when the cost of counting and keeping in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of scanning the database; There has to be enough space in main memory for both Ck and C’k+1; Following this idea, we can make further scan reduction: • C’k+1 is calculated from Ck for k > 1; • There must be enough memory space for all Ck’s (k > 1); – Consequently, only two database scans need to be performed (the first to determine L1, and the second to determine all the other Lk’s); Page Number: 36 The Apriori Algorithm Abstraction levels – – Higher level associations are stronger (more powerful), but also less certain; A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction) Page Number: 37 References Devedzic, V., “Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003. http://www.marconi.com http://www.blueyed.com http://www.fipa.org http://www.rpi.edu http://research.microsoft.com http://imatch.lcs.mit.edu Page Number: 38 DM Process Model 5A – used by SPSS Clementine (Assess, Access, Analyze, Act and Automate) SEMMA – used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess) CRISP – tends to become a standard Page Number: 39 CRISP - DM CRoss-Industry Standard for DM Conceived in 1996 by three companies: Page Number: 40 CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Specialized Tasks Process Instances Page Number: 41 Mapping generic models to specialized models Analyze the specific context Remove any details not applicable to the context Add any details specific to the context Specialize generic context according to concrete characteristic of the context Possibly rename generic contents to provide more explicit meanings Page Number: 42 CRISP – DM model Business understanding Data understanding Data preparation Modeling Business understanding Deployment Evaluation Deployment Evaluation Page Number: 43 Data understanding Data preparation Modeling Business Understanding Determine business objectives Assess situation Determine data mining goals Produce project plan Page Number: 44 Data Understanding Collect initial data Describe data Explore data Verify data quality Page Number: 45 Data Preparation Select data Clean data Construct data Integrate data Format data Page Number: 46 Modeling Select modeling technique Generate test design Build model Assess model Page Number: 47 Evaluation results = models + findings Evaluate results Review process Determine next steps Page Number: 48 Deployment Plan deployment Plan monitoring and maintenance Produce final report Review project Page Number: 49 At Last… Page Number: 50 Evolution of Data Mining Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) What was my average total revenue over the last 5 years? Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data Access (1980s) What were unit sales in New England last March? RDBMS, SQL, ODBC Oracle, Sybase Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Navigation (1990s) What were unit sales in New England last March? Drill down to Boston. OLAP, Multidimensional databases, data warehouses Pilot, IRI, Arbor, Redbrick, Evolutionary Technologies Retrospective, dynamic data delivery at multiple levels Data Mining (2000) What’s likely to happen to Boston unit sales next month? Why? Advanced algorithms, multiprocessors, massive databases Lockheed, IBM, SGI, numerous startups Prospective, proactive information delivery Page Number: 51 Examples of DM projects to stimulate your imagination Here are six examples of how data mining is helping corporations to operate more efficiently and profitably in today's business environment – Targeting a set of consumers who are most likely to respond to a direct mail campaign – Predicting the probability of default for consumer loan applications – Reducing fabrication flaws in VLSI chips – Predicting audience share for television programs – Predicting the probability that a cancer patient will respond to radiation therapy – Predicting the probability that an offshore oil well is actually going to produce oil Page Number: 52 Comparison of fourteen DM tools Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student, and a professional data mining consultant Run under the MS Windows 95, MS Windows NT, Macintosh System 7.5 Use one of the four technologies: Decision Trees, Rule Inductions, Neural, or Polynomial Networks Solve two binary classification problems: multi-class classification and noiseless estimation problem Price from 75$ to 25.000$ Page Number: 53 Comparison of fourteen DM tools The Decision Tree products were - CART - Scenario - See5 - S-Plus The Rule Induction tools were - WizWhy - DataMind - DMSK Neural Networks were built from three programs - NeuroShell2 - PcOLPARS - PRW The Polynomial Network tools were - ModelQuest Expert - Gnosis - a module of NeuroShell2 - KnowledgeMiner Page Number: 54 Criteria for evaluating DM tools A list of 20 criteria for evaluating DM tools, put into 4 categories: Capability measures what a desktop tool can do, and how well it does it - Handles missing data - Considers misclassification costs - Allows data transformations - Includes quality of tesing options - Has a programming language - Provides useful output reports - Provides visualisation Page Number: 55 - Visualisation + excellent capability good capability - some capability “blank” no capability Page Number: 56 Criteria for evaluating DM tools Learnability/Usability shows how easy a tool is to learn and use - Tutorials Wizards Easy to learn User’s manual Online help Interface Page Number: 57 Criteria for evaluating DM tools Interoperability shows a tool’s ability to interface with other computer applications - Importing data - Exporting data - Links to other applications Flexibility - Model adjustment flexibility - Customizable work enviroment - Ability to write or change code Page Number: 58 Data Input & Output Model + excellent capability good capability - some capability “blank” no capability Page Number: 59 A classification of data sets Pima Indians Diabetes data set – – Wisconsin Breast Cancer data set – – 699 instances of breast tumors some of which are malignant, most of which are benign 10 attributes plus the binary malignancy variable per case The Forensic Glass Identification data set – – 768 cases of Native American women from the Pima tribe some of whom are diabetic, most of whom are not 8 attributes plus the binary class variable for diabetes per instance 214 instances of glass collected during crime investigations 10 attributes plus the multi-class output variable per instance Moon Cannon data set – – 300 solutions to the equation: x = 2v 2 sin(g)cos(g)/g the data were generated without adding noise Page Number: 60 Evaluation of forteen DM tools Page Number: 61 Potentials of R&D in Cooperation with U. of Belgrade An Overview of Advanced Datamining Projects for High-Tech Computer Industry in the USA and EU VLSI Detection for Internet/Telephony Interfaces Goran Davidović, Miljan Vuletić, Veljko Milutinović, Tom Chen, and Tom Brunett * eT Page Number: 63 USERS... ... Superposition/DETECTION Superposition/DETECTION SPECIALIZED INTERNET REMOTE SITE SERVICE PROVIDER HOME/OFFICE/FACTORY AUTOMATION ON THE INTERNET Page Number: 64 Reconfigurable FPGA for EBI Božidar Radunović, Predrag Knežević, Veljko Milutinović, Steve Casselman, and John Schewel* * Virtual Page Number: 65 USERS ... SPECIALIZED INTERNET SERVICE PROVIDER VCC VCC CUSTOMER SATISFACTION vs CUSTOMER PROFILE Page Number: 66 BioPoP Veljko Milutinovic, Vladimir Jovicic, Milan Simic, Bratislav Milic, Milan Savic, Veljko Jovanovic, Stevo Ilic, Djordje Veljkovic, Stojan Omorac, Nebojsa Uskokovic, and Fred Darnell •isItWorking.com Page Number: 67 Testing the Infrastructure for EBI Phones Faxes Email Web links Servers Routers Software • Statistics • Correlation • Innovation Page Number: 68 CNUCE Integration and Datamining on Ad-Hoc Networks and the Internet Veljko Milutinović, Luca Simoncini, and Enrico Gregory *University of Pisa, Santanna, CNUCE Page Number: 69 GSM DM Ad-Hoc Page Number: 70 Internet Genetic Search with Spatial/Temporal Mutations Jelena Mirković, Dragana Cvetković, and Veljko Milutinović *Comshare Page Number: 71 Drawbacks of INDEX-BASED: Time to index + ranking Advantages of LINKS-BASED: Mission critical applications + customer tuned ranking Well organized markets: Best first search If elements of disorder: G w DB mutations Chaotic markets: G w S/T mutations Provider Page Number: 72 e-Banking on the Internet Miloš Kovačević, Bratislav Milic, Veljko Milutinović, Marco Gori, and Roberto Giorgi *University of Siena Page Number: 73 Bottleneck#1: Searching for Clients and Investments 1472++ *University of Siena + Banco di Monte dei Paschi Page Number: 74 WaterMarking for e-Banking on the Internet Darko Jovic, Ivana Vujovic, Veljko Milutinovic Fraunhofer, IPSI, Darmstadt, Germany Page Number: 75 Bottleneck#1: SpeedUp Page Number: 76 SSGRR Organizing Conferences via the Internet Zoran Horvat, Nataša Kukulj, Vlada Stojanović, Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović, Veljko Milutinović, and Frederic Patricelli *SSGRR, L’Aquila Page Number: 77 http://www.ssgrr.it 2000: Arno Penzias 2001: Bob Richardson 2002: Jerry Friedman 2003: Harry Kroto Page Number: 78 Summary Books with Nobel Laureates: Kenneth Wilson, Ohio (North-Holland) Leon Cooper, Brown (Prentice-Hall) Robert Richardson, Cornell (Kluwer-Academics) Herb Simon (Kluwer-Academics) Jerome Friedman, MIT (IOS Press) Harold Kroto (IOS Press) Arno Penzias (IOS Press) Page Number: 79 http://galeb.etf.bg.ac.yu/~vm/ e-mail: [email protected] Page Number: 80