Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
D2K – Data To Knowledge March 19, 2004 Duke University Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 [email protected] Outline • • • Overview of Data Mining Overview of D2K Functionality D2K Toolkit • • D2K Driven Application • • • ThemeWeaver – Mining Text Data MAEViz – Visualizing Earthquake Damage Analysis D2K Streamline (SL) • • MAIDS – Mining Streaming Data EMO – Finding Optimal Decisions D2K Web Service • Phylomat – Finding Motifs in Sequences alg | Automated Learning Group ALG Mission The specific mission of the Automated Learning Group is: • To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making • To work closely with industrial, government, and academic partners to explore new application areas for such methods, and • To transfer the resulting software technology into real world applications alg | Automated Learning Group ALG Research, Development, & Technology Transfer Model alg | Automated Learning Group Overview of Knowledge Discovery What is It? Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data • The understandable patterns are used to: • • • • Make predictions about or classifications of new data Explain existing data Summarize the contents of a large database to support decision making Create graphical data visualization to aid humans in discovering complex patterns alg | Automated Learning Group Overview of Knowledge Discovery Why Do We Need Data Mining ? • Data volumes are too large for classical analysis approaches: • • Large number of records (108 – 1012 bytes) High dimensional data ( 102 – 104 attributes) How do you explore millions of records, tens or hundreds or thousands of fields, and find patterns? • • As databases grow, the ability to use traditional query languages for the decision support process becomes infeasible Many queries of interest are difficult to state in a query language (query formulation problem) • “Find all cases of fraud” • “Find all individuals likely to by Ford Explorer” • “Find all documents that are similar to this customers problem” alg | Automated Learning Group Overview of Knowledge Discovery Knowledge Discovery Process alg | Automated Learning Group Overview of Knowledge Discovery Required Effort for each KDD Step Arrows indicate the direction we want the effort to go 60 Effort (%) 50 40 30 20 10 0 Objectives Determination alg | Automated Learning Group Data Preparation Data Mining Interpretation/ Evaluation Overview of Knowledge Discovery Three Primary Paradigms • Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired • Classification is the prediction of predefined classes – Naive Bayesian, Decision Trees, and Neural Networks • Regression is the prediction of continuous data – Neural Networks, and Decision (Regression) Trees • • Discovery – unsupervised learning approach for exploratory data analysis • Association Rules and Link Analysis • Clustering and Self Organizing Maps Deviation Detection – identifying outliers in the data • Visualization alg | Automated Learning Group Importance of Data Mining Framework • • • • • • • • Provides capability to build custom applications Provides access to data management tools Contains data mining algorithms for prediction and discovery Provides data transformations for standard operations Supports an extensible interface for creating one’s own algorithms Provides means for building and applying models Provides integrated visualizations components Provides access to distributed computing capabilities alg | Automated Learning Group D2K Overview D2K - Data To Knowledge D2K is a flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization alg | Automated Learning Group D2K Overview D2K and Its Many Components • D2K Infrastructure • • • • • • D2K API, data flow environment, distributed computing framework and runtime system D2K Modules Computational units written in Java that follow the D2K API D2K Itineraries Modules that are connected to form an application D2K Toolkit User interface for specification of itineraries and execution that provides the rapid application development environment D2K-Driven Applications Applications that use D2K modules with a custom user interface D2K Streamline (SL) Task driven system that uses D2K modules D2K Web/Grid Services Enables web deployment alg | Automated Learning Group D2K Overview D2K Toolkit Major features that D2K provides to an application developer include: • • • • • • Visual programming system employing a data flow paradigm Scalable distributed computing capabilities Flexible and extensible software development environment Multi-layered learning strategies Integrated environment for models and visualization Web service capabilities for deployment alg | Automated Learning Group D2K Overview D2K Modules Input Module: Loads data from the outside world • Flat files, database, etc. Data Prep Module: Performs functions to select, clean, or transform the data • Binning, Normalizing, Feature Selection, etc. Compute Module: Performs main algorithmic computations • Naïve Bayesian, Decision Tree, Apriori, etc. User Input Module: Requires interaction with the user • Data Selection, Input and Output selection, etc. Output Module: Saves data to the outside world • Flat files, databases, etc. Visualization Module: Provides visual feedback to the user • Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot, 3D Surface Plot alg | Automated Learning Group D2K Overview D2K Module Icon Description Module Progress Bar Appears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not Input Port Rectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent Properties Symbol If a “P” is shown in the lower left corner of the module, then the module has properties that can be set before execution alg | Automated Learning Group Output Port Rectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent Current ALG Projects MAIDS: Mining Alarming Incidents in Data Streams Stream Characteristics • Huge volumes of continuous data, possibly infinite • Fast changing and requires fast, real-time response • Data stream captures nicely our data processing needs of today • Random access is expensive— single linear scan algorithm (can only have one look) • Store only the summary of the data seen thus far • Most stream data are at pretty low-level or multidimensional in nature, needs multi-level and multidimensional processing alg | Automated Learning Group Using D2K Toolkit MAIDS alg | Automated Learning Group Current ALG Projects Text Mining • Information Retrieval • • • Information Extraction • • Indexing and retrieval of textual documents and extraction of partial knowledge using the web Classification • • Extraction of partial knowledge in the text Web Mining • • Indexing and retrieval of textual documents Finding a set of (ranked) documents that are relevant to the query Predict a class for each text document Clustering • Generating collections of similar text documents alg | Automated Learning Group Using D2K Driven Application Text Mining: Views from T2K and ThemeWeaver alg | Automated Learning Group Using D2K Driven Application MAEViz: Damage Synthesis Visualization • • • • • • • Displays terrain map Loads hazard, inventory, and fragility data Shows contour map of ground acceleration (hazard) Displays cones/bars to indicate level of damage Overlays shapefiles of different information Uses VTK for 3D Uses CUBE at BI alg | Automated Learning Group D2K SL D2K Streamline (D2K SL) • • • • • Provides step by step interface to guide user in data analysis Supports return to earlier steps to run with different parameters Uses the D2K infrastructure transparently Uses same D2K modules Provides way to capture different experiments alg | Automated Learning Group Using D2K SL EMO – Evolutionary Multiobjective Optimization • • • Identify tradeoffs among complex objectives Apply a genetic algorithm (GA) optimization in a general framework Guide the user through discrete steps to defining decision variables, fitness functions, constraints, and setting up GA parameters alg | Automated Learning Group D2K Web Service Architecture • Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP. • Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers. • Job results are also stored in the web service tier. • Results are returned to clients upon request. • A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs. • Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations. alg | Automated Learning Group Using D2K Web Service Phylomat (Motif Analysis Tool for Phylogenomics) alg | Automated Learning Group The ALG Team Staff Students Loretta Auvil Peter Bajcsy Colleen Bushell Dora Cai David Clutter Lisa Gatzke Vered Goren Chris Navarro Greg Pape Tom Redman Duane Searsmith Andrew Shirk Anca Suvaiala David Tcheng Michael Welge alg | Automated Learning Group Ritesh Agrawal Tyler Alumbaugh John Cassel Sang-Chul Lee Xiaolei Li Jeff Ng Scott Ramon Martin Urban Bei Yu Hwanjo Yu Licensing D2K • • • Faculty, staff and students at US academic institutions will be able to license and use D2K for free by downloading from alg.ncsa.uiuc.edu Private Sector Partners who have provided funding for projects related to D2K will be able to license and use D2K for free Private Sector Partners who have not provided funding will be able to license and use D2K for a discounted fee Contact John McEntire Office of Technology Management 308 Ceramics Building, MC-243 105 South Goodwin Avenue Urbana, Illinois 61801-2901 (217) 333-3715 [email protected] alg | Automated Learning Group