Download Lecture7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Guest Lecture
Introduction to Data Mining
Dr. Bhavani Thuraisingham
September 17, 2010
5/22/2017 15:27
4-2
Objective of the Unit
0 This unit provides an introduction to data mining
5/22/2017 15:27
4-3
Outline of Data Mining
0 What is Data Mining?
0 Data warehousing vs data mining
0 Steps to Data Mining
0 Need for Data Mining
0 Example Applications
0 Technologies for Data Mining
0 Why Data Mining Now?
0 Preparation for Data Mining
0 Data Mining Tasks, Methodology, Techniques
0 Commercial Developments
0 Status, Challenges , and Directions
5/22/2017 15:27
4-4
What is Data Mining?
Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases
Data Dredging
Data Archaeology
Data Pattern Processing
Database Mining
Knowledge Extraction
Siftware
The process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of data, often previously unknown, using pattern
recognition technologies and statistical and mathematical techniques
(Thuraisingham 1998)
5/22/2017 15:27
4-5
Data Warehouses vs Data Mining
0 Goal: Improved business efficiency
- Improve marketing (advertise to the most likely buyers)
- Inventory reduction (stock only needed quantities)
0 Information source: Historical business data
- Example: Supermarket sales records
Date/Time/Register
Fish Turkey Cranberries
12/6 13:15 2
N
Y
Y
12/6 13:16 3
Y
N
N
Wine
N
Y
...
...
...
- Size ranges from 50k records (research studies) to terabytes
(years of data from chains)
- Data is already being warehoused
0 Sample question – what products are generally purchased together?
The answers are in the data, need to MINE the data
5/22/2017 15:27
4-6
What Does Warehousing do for Data Mining?
0 Difficult to mine disparate data sources
0 Data warehouse integrates the disparate data sources into a single
logical entity
0 Maintains integrity of the data
- Scrubbing and Cleaning
0 Formats the data for querying and mining
- Multidimensional data
5/22/2017 15:27
4-7
Is it Necessary to Have a Data Warehouse for
Data Mining?
0 Key to successful data mining is having good data
0 Data warehousing integrates heterogeneous data sources, formats
the data, and facilitates interactive query processing
0 Having a data warehouse is good for data mining, but perhaps not
essential
0 Data mining tools could be used directly on good/clean databases
5/22/2017 15:27
4-8
What’s going on in data mining?
0 What are the technologies for data mining?
- Database management, data warehousing, machine learning,
statistics, pattern recognition, visualization, parallel processing
0 What can data mining do for you?
- Data mining outcomes: Classification, Clustering, Association,
Anomaly detection, Prediction, Estimation, . . .
0 How do you carry out data mining?
- Data mining techniques: Decision trees, Neural networks,
Market-basket analysis, Link analysis, Genetic algorithms, . . .
0 What is the current status?
- Many commercial products mine relational databases
0 What are some of the challenges?
- Mining unstructured data, extracting useful patterns, web
mining, Data mining, national security and privacy
5/22/2017 15:27
4-9
Steps to Data Mining
Integrate
data
sources
Data Sources
Take
Actions
Clean/
modify
data
sources
Report
final
results
Mine
the data
Examine
Results/
Prune
results
5/22/2017 15:27
4-10
Knowledge Directed to Data Mining
Mine
the data
Integrate
data
sources
Data Sources
Take
Actions
Clean/
modify
data
sources
Report
final
results
Expert
System
Examine
Results/
Prune
results
5/22/2017 15:27
4-11
Need for Data Mining
0 Large amounts of current and historical data being stored
0 As databases grow larger, decision-making from the data is not
possible; need knowledge derived from the stored data
0 Data for multiple data sources and multiple domains
- Medical, Financial, Military, etc.
0 Need to analyze the data
- Support for planning (historical supply and demand trends)
- Yield management (scanning airline seat reservation data to maximize
yield per seat)
- System performance (detect abnormal behavior in a system)
- Mature database analysis (clean up the data sources)
5/22/2017 15:27
4-12
Example Applications
0 Medical supplies company increases sales by targeting certain
physicians in its advertising who are likely to buy the products
0 A credit bureau limits losses by selecting candidates who are likely
not to default on their payment
0 An Intelligence agency determines abnormal behavior of its
employees
0 An investigation agency finds fraudulent behavior of some people
5/22/2017 15:27
4-13
Integration of Multiple Technologies
Data
Warehousing
Machine
Learning
Database
Management
Parallel
Processing
Statistics
Visualization
Data
Mining
5/22/2017 15:27
4-14
Why Data Mining Now?
0 Large amounts of data is being produced
0 Data is being organized
0 Technologies are developing for database management, data
warehousing, parallel processing, machine intelligent, etc.
0 It is now possible to mine the data and get patterns and trends
0 Interesting applications exist
5/22/2017 15:27
4-15
Preparation for Data Mining
0 Getting the data into the right format
0 Data warehousing
0 Scrubbing and cleaning the data
0 Some idea of application domain
0 Determining the types of outcomes
- e.g., Clustering, classification
0 Evaluation of tools
0 Getting the staff trained in data mining
5/22/2017 15:27
4-16
Some Data Mining Tasks/Outcomes
0 Data Mining Tasks
- Classification
- Estimation
- Prediction
- Affinity Grouping
- Clustering
- Description
- Other
= Deviation detection, Anomaly detection, Association
0 Note: Different text and papers use different terms to mean different
tasks
- e.g., Association and Affinity Groups have been used
interchangeably
5/22/2017 15:27
4-17
Some Types of Data Mining (Data Mining
Tasks/Outcomes)
0 Classification – grouping records into meaningful subclasses
- e.g., Marketing organization has a list of people living in
Manhattan all owning cars costing over 20K
0 Sequence Detection
- John always buys groceries after going to the bank
0 Data dependency analysis – identifying potentially interesting
dependencies or relationships among data items
- If John, James, and Jane meet, Bill is also present
0 Deviation detection – discovery of significant differences between an
observation and some reference
- Anomalous instances
- Discrepancies between observed and expected values
5/22/2017 15:27
4-18
Data Mining Methodology (or Approach)
0 Top-down
- Hypothesis testing
= Validate beliefs
0 Bottom-up
- Discover patterns
- Directed
= Some idea what you want to get
- Undirected
= Start from fresh
5/22/2017 15:27
4-19
Outline of Data Mining Techniques
0 Data Mining Techniques
- Market Basket Analysis
- Memory-based Detection
- Automatic Cluster Detection
- Link Analysis
- Decision Trees and Rule Induction
- Neural Networks
- Inductive Logic Programming
- Other techniques
0 Some Observations
5/22/2017 15:27
4-20
Commercial Developments in Data Mining: Some
Early Products
0 Information Discovery-IDIS
0 WizSoft - WhizWhy
0 Hugin - Hugin
0 IBM - Intelligent Miner
0 Red Brick – DataMind (became part of Informix and now part of IBM)
0 Neo Vista - Decision Series
0 Reduct Systems - Datalogic/R
0 Lockheed Martin - Recon
0 Nicesoft – Nicel
0 SAS – Enterprise Miner
0 Recent products will be discussed in Unit #9
5/22/2017 15:27
4-21
Current Status, Challenges and Directions
0 Status
- Data Mining is now a technology
- Several prototypes and tools exist; Many or almost all of
them work on relational databases
0 Challenges
- Mining large quantities of data; Dealing with noise and
uncertainty; False positives and negatives
0 Directions
- Mining multimedia and text databases, Web mining
(structure, usage and content), Data mining, national
security and privacy