Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #3 Information Management and Data Mining August 29, 2005 Objective of the Unit  This unit gives an overview of various information management technologies. In addition some details of data mining will also be given. Outline of the Unit  What is Information Management?  Some Information Management Technologies  Information management Applications  Data Mining Revisiting the DM/IM/KM Framework Knowledge Secure Digital Semantic Representation Libraries Web Knowledge Biometrics Models Knowledge Digital Forensics Mining Knowledge Creation Secure Knowledge and Acquisition Knowledge Privacy Privacy Portals Secure Expert systems and Information Secure Reasoning Management Informationunder Technologies uncertainty Management Technologies Knowledge Data Mining Sharing And Security Dependable Knowledge Information Management Manipulation Semantic Inference Problem Web Data Warehouse Systems Security Sensor Database Information Security Management Multimedia Object/Multimedia ObjectInformation Database Security Security System Web Database Information Security Management Knowledge Management Technologies Information Management Technologies Relational Database Data Mining Security Peer-to-Peerand Distributed/ Distributed Heterogeneous Federated Data Information Management Database Security Security Secure Information Retrieval Systems Database Relational Database Database Systems Knowledge Distributed Knowledge Management Databases Management Each layer builds on the technologies of the lower layers Information and Computer Object Database Security Heterogeneous Information Information Database Management Management Data Management Technologies What is Information Management?  Information management essentially analyzes the data and makes sense out of the data  Several technologies have to work together for effective information management - Data Warehousing: Extracting relevant data and putting this data into a repository for analysis - Data Mining: Extracting information from the data previously unknown - Multimedia: managing different media including text, images, video and audio - Web: managing the databases and libraries on the web Data Warehouse Users Query the Warehouse Oracle DBMS for Employees Data Warehouse: Data correlating Employees With Medical Benefits and Projects Sybase DBMS for Projects Could be any DBMS; Usually based on the relational data model Informix DBMS for Medical Data Mining Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data Archaeology Data Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data, often previously unknown, using pattern recognition technologies and statistical and mathematical techniques (Thuraisingham 1998) Multimedia Information Management Video Source Broadcast News Editor (BNE) Scene Change Detection Frame Classifier Imagery Silence Detection Correlation Story GIST Theme Broadcast Detection Commercial Detection Key Frame Selection Story Segmentation Audio Closed Caption Text Speaker Change Detection Closed Caption Preprocess Segregate Video Streams Broadcast News Navigator (BNN) Token Detection Named Entity Tagging Analyze and Store Video and Metadata Multimedia Database Management System Video and Metadata Web-based Search/Browse by Program, Person, Location, ... Semantic Web 0Adapted from Tim Berners Lee’s description of the Semantic Web T R U S T P R I V A C Y Logic, Proof and Trust Rules/Query RDF, Ontologies Other Services XML, XML Schemas URI, UNICODE 0 Some Challenges: Security and Privacy cut across all layers; Integration of Services; Composability Semantic Web Technologies  Web Database/Information Management - Information retrieval and Digital Libraries  XML, RDF and Ontologies - Representation information  Information Interoperability - Integrating heterogeneous data and information sources  Intelligent agents - Agents for locating resources, managing resources, querying resources and understanding web pages  Semantic Grids - Integrating semantic web with grid computing technologies Secure Data Sharing Across Coalitions Data/Policy for Coalition Export Data/Policy Export Data/Policy Export Data/Policy Component Data/Policy for Agency A Component Data/Policy for Agency C Component Data/Policy for Agency B Some Emerging Information Management Technologies  Visualization - Visualization tools enable the user to better understand the information  Peer-to-Peer Information Management - Peers communicate with each other, share resources and carry out tasks  Sensor and Wireless Information Management - Autonomous sensors cooperating with one another, gathering data, fusing data and analyzing the data - Integrating wireless technologies with semantic web technologies Information Management for Applications: Examples  Decision Support  E-Commerce  Collaboration  Training  Knowledge Management  Virtual Organizations and Dynamic Coalitions Outline of Data Mining  What is Data Mining  Steps to Data Mining  Need for Data Mining  Example Applications  Technologies for Data Mining  Why Data Mining Now?  Preparation for Data Mining  Data Mining Tasks, Methodology, Techniques  Commercial Developments  Status, Challenges , and Directions  Example Data Mining Technique Data Mining Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data Archaeology Data Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data, often previously unknown, using pattern recognition technologies and statistical and mathematical techniques (Thuraisingham 1998) Steps to Data Mining Integrate data sources Data Sources Clean/ modify data sources Report final results Mine the data Examine Results/ Prune results Need for Data Mining  Large amounts of current and historical data being stored  As databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored data  Data for multiple data sources and multiple domains - Medical, Financial, Military, etc.  Need to analyze the data Support for planning (historical supply and demand trends) Yield management (scanning airline seat reservation data to maximize yield per seat) System performance (detect abnormal behavior in a system) Mature database analysis (clean up the data sources) - Example Applications  Medical supplies company increases sales by targeting certain physicians in its advertising who are likely to buy the products  A credit bureau limits losses by selecting candidates who are likely not to default on their payment  An Intelligence agency determines abnormal behavior of its employees  An investigation agency finds fraudulent behavior of some people Integration of Multiple Technologies Artificial Intelligence Machine Learning Database Management Parallel Processing Statistics Visualization Data Mining Why Data Mining Now?  Large amounts of data is being produced  Data is being organized  Technologies are developing for database management, data warehousing, parallel processing, machine intelligent, etc.  It is now possible to mine the data and get patterns and trends  Interesting applications exist Preparation for Data Mining  Getting the data into the right format  Data warehousing  Scrubbing and cleaning the data  Some idea of application domain  Determining the types of outcomes - e.g., Clustering, classification  Evaluation of tools  Getting the staff trained in data mining Some Types of Data Mining (Data Mining Tasks)  Classification – grouping records into meaningful subclasses - e.g., Marketing organization has a list of people living in Manhattan all owning cars costing over 20K  Sequence Detection - John always buys groceries after going to the bank  Data dependency analysis – identifying potentially interesting dependencies or relationships among data items If John, James, and Jane meet, Bill is also present -  Deviation detection – discovery of significant differences between an observation and some reference Anomalous instances Discrepancies between observed and expected values - Data Mining Methodology (or Approach)  Top-down - Hypothesis testing  Validate beliefs  Bottom-up - Discover patterns - Directed  Some idea what you want to get - Undirected  Start from fresh Some Data Mining Techniques  Market Basket analysis  Decision Trees  Neural networks  Link Analysis  Genetic Algorithms  Automatic Cluster Detection  Inductive logic programming Commercial Developments in Data Mining: Some Products  WizSoft - WhizWhy  Hugin - Hugin  IBM - Intelligent Miner  Red Brick - DataMind  Neo Vista - Decision Series  Reduct Systems - Datalogic/R  IDIS - Information Discovery  Lockheed Martin - Recon  Nicesoft – Nicel  SAS – Enterprise Miner Current Status, Challenges and Directions  Status - Data Mining is now a technology - Several prototypes and tools exist; Many or almost all of them work on relational databases  Challenges - Mining large quantities of data; Dealing with noise and uncertainty, reasoning with incomplete data  Directions Mining multimedia and text databases, Web mining (structure, usage and content), Mining metadata, Realtime data mining - Example Data Mining Technique: What is Market Basket Analysis?  Market basket analysis is a collection of techniques that will discover rules such as what items are purchased together  It has roots in point of sale transactions; but has gone beyond this applications - E.g., who travels together, who is seen with whom, etc.  Market basket analysis is used as a starting point when transactions data is available and we are not sure of the patterns we are looking for - Find items that are purchased together  Essentially market basket analysis produces association rules Example  Person Countries Visited  John England, France  James Germany, England, Switzerland  William England, Austria  Mary England, Austria, France  Jane Switzerland, France Co-Occurrence Table England Switzerland Germany France Austria England 4 1 1 2 2 Switzerland 1 2 1 1 0 Germany 1 1 1 0 0 France 2 1 0 3 1 Austria 2 0 0 1 2 Example (Concluded)  England and France / England and Austria are more likely to be traveled together than any other two countries  Austria is never traveled together with Germany or Switzerland  Germany is never traveled together with Austria or France  Rule: - If a person travels to France then he/she also travels to England Support for this rule is 2 out of 5 and that is 40% since 2 trips out of five support this rule Confidence for this rule is 66% since two out of three trips that contain France also contains England That is, if France then England rule has support 40% and confidence 66%  Challenge: How to automatically generate the rules Basic Process  Choosing the right set of items - Need to gather the right set of transaction data and the right level of detail, ensuring data quality  Generating rules from the data - Generate co-occurrence matrix for single items - Generate co-occurrence matrix with 2 items and use this to find rules with 2 items - Generate co-occurrence matrix with 3 items and use this to find rules with 3 items; etc. - -  Overcoming practical limits imposed by thousand of items - Avoid combinatorial explosions Association Rules  Rules that find associations in data  Example of a association rule is (x1, x2, x3}  x4 meaning that if x1, x2, and x3 are purchased x4 is also purchased  Association rules have confidence values Strong rules are rules with confidence value above a threshold  Challenge is to improve the algorithm - E.g., Partition-based approach, sampling - Challenges and Directions  Performance improvements  Applying techniques for web mining including web content mining, web structure mining and web usage mining  Finding associations in text - Associations between words in a document or multiple documents