Download Lecture 3 - The University of Texas at Dallas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Biometrics
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Lecture #3
Information Management and Data Mining
August 29, 2005
Objective of the Unit
 This unit gives an overview of various information management
technologies. In addition some details of data mining will also be
given.
Outline of the Unit
 What is Information Management?
 Some Information Management Technologies
 Information management Applications
 Data Mining
Revisiting the DM/IM/KM Framework
Knowledge
Secure Digital
Semantic
Representation
Libraries
Web
Knowledge
Biometrics
Models
Knowledge
Digital
Forensics
Mining
Knowledge
Creation
Secure Knowledge
and Acquisition
Knowledge
Privacy
Privacy
Portals
Secure
Expert
systems and
Information
Secure
Reasoning
Management
Informationunder
Technologies
uncertainty
Management
Technologies
Knowledge
Data
Mining
Sharing
And Security
Dependable
Knowledge
Information
Management
Manipulation
Semantic
Inference
Problem
Web
Data
Warehouse Systems
Security
Sensor
Database
Information
Security
Management
Multimedia
Object/Multimedia
ObjectInformation
Database
Security
Security
System
Web
Database
Information
Security
Management
Knowledge Management
Technologies
Information Management
Technologies
Relational
Database
Data Mining
Security
Peer-to-Peerand
Distributed/
Distributed
Heterogeneous
Federated Data
Information
Management
Database
Security Security
Secure
Information
Retrieval Systems
Database
Relational Database
Database
Systems
Knowledge
Distributed
Knowledge
Management
Databases
Management
Each layer builds on the
technologies of the lower layers
Information and
Computer
Object
Database
Security
Heterogeneous
Information
Information Database
Management
Management
Data
Management
Technologies
What is Information Management?
 Information management essentially analyzes the data and makes
sense out of the data
 Several technologies have to work together for effective information
management
- Data Warehousing: Extracting relevant data and putting this data
into a repository for analysis
- Data Mining: Extracting information from the data previously
unknown
- Multimedia: managing different media including text, images,
video and audio
- Web: managing the databases and libraries on the web
Data Warehouse
Users
Query
the Warehouse
Oracle
DBMS for
Employees
Data Warehouse:
Data correlating
Employees With
Medical Benefits
and Projects
Sybase
DBMS for
Projects
Could be
any DBMS;
Usually based on
the relational
data model
Informix
DBMS for
Medical
Data Mining
Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases
Data Dredging
Data Archaeology
Data Pattern Processing
Database Mining
Knowledge Extraction
Siftware
The process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of data, often previously unknown, using pattern
recognition technologies and statistical and mathematical techniques
(Thuraisingham 1998)
Multimedia Information Management
Video
Source
Broadcast News Editor (BNE)
Scene
Change
Detection
Frame
Classifier
Imagery
Silence
Detection
Correlation
Story
GIST Theme
Broadcast
Detection
Commercial
Detection
Key Frame
Selection
Story
Segmentation
Audio
Closed
Caption
Text
Speaker
Change
Detection
Closed
Caption
Preprocess
Segregate
Video
Streams
Broadcast News
Navigator (BNN)
Token
Detection
Named
Entity
Tagging
Analyze and Store Video and Metadata
Multimedia
Database
Management
System
Video
and
Metadata
Web-based Search/Browse by
Program, Person, Location, ...
Semantic Web
0Adapted from Tim Berners Lee’s description of the Semantic Web
T
R
U
S
T
P
R
I
V
A
C
Y
Logic, Proof and Trust
Rules/Query
RDF, Ontologies
Other
Services
XML, XML Schemas
URI, UNICODE
0 Some Challenges: Security and Privacy cut across all layers;
Integration of Services; Composability
Semantic Web Technologies
 Web Database/Information Management
- Information retrieval and Digital Libraries
 XML, RDF and Ontologies
- Representation information
 Information Interoperability
- Integrating heterogeneous data and information sources
 Intelligent agents
- Agents for locating resources, managing resources, querying
resources and understanding web pages
 Semantic Grids
- Integrating semantic web with grid computing technologies
Secure Data Sharing Across Coalitions
Data/Policy for Coalition
Export
Data/Policy
Export
Data/Policy
Export
Data/Policy
Component
Data/Policy for
Agency A
Component
Data/Policy for
Agency C
Component
Data/Policy for
Agency B
Some Emerging Information Management
Technologies
 Visualization
- Visualization tools enable the user to better understand the
information
 Peer-to-Peer Information Management
- Peers communicate with each other, share resources and carry
out tasks
 Sensor and Wireless Information Management
- Autonomous sensors cooperating with one another, gathering
data, fusing data and analyzing the data
- Integrating wireless technologies with semantic web
technologies
Information Management for Applications:
Examples
 Decision Support
 E-Commerce
 Collaboration
 Training
 Knowledge Management
 Virtual Organizations and Dynamic Coalitions
Outline of Data Mining
 What is Data Mining
 Steps to Data Mining
 Need for Data Mining
 Example Applications
 Technologies for Data Mining
 Why Data Mining Now?
 Preparation for Data Mining
 Data Mining Tasks, Methodology, Techniques
 Commercial Developments
 Status, Challenges , and Directions
 Example Data Mining Technique
Data Mining
Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases
Data Dredging
Data Archaeology
Data Pattern Processing
Database Mining
Knowledge Extraction
Siftware
The process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of data, often previously unknown, using pattern
recognition technologies and statistical and mathematical techniques
(Thuraisingham 1998)
Steps to Data Mining
Integrate
data
sources
Data Sources
Clean/
modify
data
sources
Report
final
results
Mine
the data
Examine
Results/
Prune
results
Need for Data Mining
 Large amounts of current and historical data being stored
 As databases grow larger, decision-making from the data is not
possible; need knowledge derived from the stored data
 Data for multiple data sources and multiple domains
-
Medical, Financial, Military, etc.
 Need to analyze the data
Support for planning (historical supply and demand trends)
Yield management (scanning airline seat reservation data to maximize
yield per seat)
System performance (detect abnormal behavior in a system)
Mature database analysis (clean up the data sources)
-
Example Applications
 Medical supplies company increases sales by targeting certain
physicians in its advertising who are likely to buy the products
 A credit bureau limits losses by selecting candidates who are likely
not to default on their payment
 An Intelligence agency determines abnormal behavior of its
employees
 An investigation agency finds fraudulent behavior of some people
Integration of Multiple Technologies
Artificial
Intelligence
Machine
Learning
Database
Management
Parallel
Processing
Statistics
Visualization
Data
Mining
Why Data Mining Now?
 Large amounts of data is being produced
 Data is being organized
 Technologies are developing for database management, data
warehousing, parallel processing, machine intelligent, etc.
 It is now possible to mine the data and get patterns and trends
 Interesting applications exist
Preparation for Data Mining
 Getting the data into the right format
 Data warehousing
 Scrubbing and cleaning the data
 Some idea of application domain
 Determining the types of outcomes
- e.g., Clustering, classification
 Evaluation of tools
 Getting the staff trained in data mining
Some Types of Data Mining (Data Mining Tasks)
 Classification – grouping records into meaningful subclasses
- e.g., Marketing organization has a list of people living in
Manhattan all owning cars costing over 20K
 Sequence Detection
- John always buys groceries after going to the bank
 Data dependency analysis – identifying potentially interesting
dependencies or relationships among data items
If John, James, and Jane meet, Bill is also present
-
 Deviation detection – discovery of significant differences between an
observation and some reference
Anomalous instances
Discrepancies between observed and expected values
-
Data Mining Methodology (or Approach)
 Top-down
- Hypothesis testing

Validate beliefs
 Bottom-up
- Discover patterns
- Directed

Some idea what you want to get
- Undirected

Start from fresh
Some Data Mining Techniques
 Market Basket analysis
 Decision Trees
 Neural networks
 Link Analysis
 Genetic Algorithms
 Automatic Cluster Detection
 Inductive logic programming
Commercial Developments in Data Mining: Some
Products
 WizSoft - WhizWhy
 Hugin - Hugin
 IBM - Intelligent Miner
 Red Brick - DataMind
 Neo Vista - Decision Series
 Reduct Systems - Datalogic/R
 IDIS - Information Discovery
 Lockheed Martin - Recon
 Nicesoft – Nicel
 SAS – Enterprise Miner
Current Status, Challenges and Directions
 Status
- Data Mining is now a technology
- Several prototypes and tools exist; Many or almost all of
them work on relational databases
 Challenges
- Mining large quantities of data; Dealing with noise and
uncertainty, reasoning with incomplete data
 Directions
Mining multimedia and text databases, Web mining
(structure, usage and content), Mining metadata, Realtime data mining
-
Example Data Mining Technique:
What is Market Basket Analysis?
 Market basket analysis is a collection of techniques that will
discover rules such as what items are purchased together
 It has roots in point of sale transactions; but has gone beyond this
applications
- E.g., who travels together, who is seen with whom, etc.
 Market basket analysis is used as a starting point when transactions
data is available and we are not sure of the patterns we are looking
for
- Find items that are purchased together
 Essentially market basket analysis produces association rules
Example
 Person
Countries Visited
 John
England, France
 James
Germany, England, Switzerland
 William
England, Austria
 Mary
England, Austria, France
 Jane
Switzerland, France
Co-Occurrence Table
England Switzerland Germany France Austria
England
4
1
1
2
2
Switzerland
1
2
1
1
0
Germany
1
1
1
0
0
France
2
1
0
3
1
Austria
2
0
0
1
2
Example (Concluded)
 England and France / England and Austria are more likely to be
traveled together than any other two countries
 Austria is never traveled together with Germany or Switzerland
 Germany is never traveled together with Austria or France
 Rule:
- If a person travels to France then he/she also travels to England
Support for this rule is 2 out of 5 and that is 40% since 2 trips
out of five support this rule
Confidence for this rule is 66% since two out of three trips
that contain France also contains England
That is, if France then England rule has support 40% and
confidence 66%
 Challenge: How to automatically generate the rules
Basic Process
 Choosing the right set of items
- Need to gather the right set of transaction data and the right
level of detail, ensuring data quality
 Generating rules from the data
- Generate co-occurrence matrix for single items
- Generate co-occurrence matrix with 2 items and use this to find
rules with 2 items
- Generate co-occurrence matrix with 3 items and use this to find
rules with 3 items; etc. - -  Overcoming practical limits imposed by thousand of items
- Avoid combinatorial explosions
Association Rules
 Rules that find associations in data
 Example of a association rule is (x1, x2, x3}  x4 meaning
that if x1, x2, and x3 are purchased x4 is also purchased
 Association rules have confidence values
Strong rules are rules with confidence value above a
threshold
 Challenge is to improve the algorithm
- E.g., Partition-based approach, sampling
-
Challenges and Directions
 Performance improvements
 Applying techniques for web mining including web content mining,
web structure mining and web usage mining
 Finding associations in text
- Associations between words in a document or multiple
documents