Download Data Mining Roots

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
UMUC CSMN 667
Lecture #2
By Dr. Borne 2005
UMUC Data Mining Lecture 2
1
Term Paper - Data Mining Case Analysis
• Refer to Project Descriptions section of WebTycho course
Syllabus for detailed information.
• 1-page Summary (Abstract+Outline) due: April 4, 2005
• Final Paper Due Date: 12midnight, April 18, 2005
• Submit both in your WebTycho Assignments Folder
• Term Paper Page Restrictions: 5-8 pages
• I will submit your paper to TurnItIn.com for verification of
originality – per UMUC Graduate School policies.
• Format/Style: Use the SPIE Conference Proceedings Style,
which is available at:
http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs
[ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS]
By Dr. Borne 2005
UMUC Data Mining Lecture 2
2
Case Analysis Instructions (1)
The goal of the paper assignment is to complete an in-depth study of
a data mining application. Examples of applications include
financial, scientific, medical, intrusion detection, and web mining.
Describe data types, data volumes, technical challenges, end-goals,
who is the user community, which data mining algorithms are most
relevant, why data mining, how is it used, what is the current status
of data mining usage in this field? --- Possible case topics include:
 A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick).
 A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank
of America).
 A medical insurer looking to detect medical fraud.
 Gene detection in BioInformatics (e.g., Celera).
 Glitch or anomaly detection in scientific time series data.
 Abnormal network access behavior for detection of computer system intrusion and security
violation.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
3
Case Analysis Instructions (2)
• You may choose to go in depth in either one of
these two areas:
– A data mining application domain: Evaluate the application area in
detail, as explained on the previous slide, including a review and analysis of
the different data mining techniques employed there.
Or
– A data mining technique: Research in depth the different application
domains where this technique has been used. Answer the questions on the
previous slide when evaluating this technique’s different application areas.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
4
Case Analysis Paper - Instructions (3)
• Please e-mail me your suggested topic (application area to
be researched) so that I may verify that it is okay.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
5
Case Analysis Paper - Instructions (4)
• Submit your completed paper in WebTycho.
• You may submit your paper in any of these
formats: PDF, or Microsoft WORD, or postscript
(PS).
• You must submit it no later than midnight on
April 18. WebTycho will not allow submissions
after that time.
• Submit the paper in your "Assignments
Folder" (on the left menu bar within the
WebTycho course website).
By Dr. Borne 2005
UMUC Data Mining Lecture 2
6
Lecture 2:
“Data Mining Roots”
(Chapter 2 of Dunham textbook)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
7
Lecture 2 Outline
•
•
•
•
•
•
•
•
•
•
•
•
Summary of “What is Data Mining?” Tutorial
Foundations of Data Mining
Database Systems
Data Warehousing and OLAP
Statistics and Data Mining
Information Retrieval
Data Mining as “Rule Induction”
Fuzzy Sets and Logic
Machine Learning
Steps in the Data Mining Process
Major Issues in Data Mining
A Case Study: The NASA Mars Rover
By Dr. Borne 2005
UMUC Data Mining Lecture 2
8
“What is Data Mining?”
From online reading assigment -Data Mining Tutorial at :
http://www.megaputer.com/dm/dm101.php3
By Dr. Borne 2005
UMUC Data Mining Lecture 2
9
Summary of “What is Data Mining?” Tutorial
•
•
•
•
•
•
What is data mining?
Why use data mining?
What can Data Mining do for you?
Reasons for the growing popularity of Data Mining
Tasks Solved by Data Mining
Different DM Technologies and Systems
 Subject-oriented analytical systems
 Statistical packages
 Neural Networks
 Evolutionary Programming
 Memory Based Reasoning
 Decision Trees
 Genetic Algorithms
 Nonlinear Regression Methods
By Dr. Borne 2005
UMUC Data Mining Lecture 2
10
What can Data Mining do for you?
(business-focused list)
• Identify your best prospects and then
retain them as customers.
• Predict cross-sell opportunities and make
recommendations.
• Learn parameters influencing trends in
sales and margins.
• Segment markets and personalize
communications.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
11
Reasons for the Growing Popularity of Data Mining
• Growing Data Volumes
• Limitations of Human Analysis
• Low Cost of Machine Learning
Tasks Solved by Data Mining
• Prediction
• Explicit Modeling
• Classification
• Detection of Relations
• Deviation Detection
By Dr. Borne 2005
• Clustering
• Market Basket Analysis
UMUC Data Mining Lecture 2
12
Foundations of Data Mining
By Dr. Borne 2005
UMUC Data Mining Lecture 2
13
Foundations of Data Mining: Databases,
Statistics, and Machine Learning
• David Hand (1998. “Data Mining: Statistics and
More?”, The American Statistician, 52, pp. 112–
118) used the following definition.
– "Data mining is a new discipline lying at the interface of
statistics, database technology, pattern recognition, machine
learning, and other areas. It is concerned with the secondary
analysis of large databases in order to find previously
unsuspected relationships which are of interest or value to
the database owners.”
– Why “secondary”? … Because the data were typically
collected for other purposes (such as billing, accounting,
customer addresses, etc.). Primary analysis of large
databases is generally the domain of STATISTICS.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
14
Slide from Lecture 1
Evolution of Data Mining
<http://www.thearling.com/text/dmwhite/dmwhite.htm>
Evolutionary Step
Business Question
Data Collection
(1960s)
"What was my total
revenue in the last five
years?"
Data Access
(1980s)
"What were unit sales in Relational databases
Retrospective, dynamic
New England last
(RDBMS), Structured
data delivery at record
March?"
Query Language (SQL), level
ODBC
Data Warehousing &
Decision Support
(1990s)
"What were unit sales in
New England last
March? Drill down to
Boston."
Data Mining
(Emerging Today)
By Dr. Borne 2005
Enabling
Characteristics
Technologies
Computers, tapes, disks Retrospective, static
data delivery
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
"What’s likely to
Advanced algorithms,
happen to Boston unit multiprocessor
sales next month?
computers, massive
Why?"
databases
UMUC Data Mining Lecture 2
Retrospective, dynamic
data delivery at multiple
levels
Prospective, proactive
information delivery
15
Foundation for Data Mining Techniques
• 1960s:
– Data collection, database creation, IMS, and hierarchical DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, financial, manufacturing, sales, etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases
By Dr. Borne 2005
UMUC Data Mining Lecture 2
16
History of Data Mining
• Dates for specific events were imprecise in the
preceding slides. This might be a little better :
By Dr. Borne 2005
UMUC Data Mining Lecture 2
17
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
By Dr. Borne 2005
Visualization
Other
Disciplines
UMUC Data Mining Lecture 2
18
Data Mining Stepping Stones
http://www.cs.sfu.ca/~han/DM_Book.html
Increasing potential
to support
business decisions
End User
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
By Dr. Borne 2005
UMUC Data Mining Lecture 2
Business
Analyst
Data
Analyst
DBA
19
Database Systems
By Dr. Borne 2005
UMUC Data Mining Lecture 2
20
Database Systems
• DBMS joins “AI and statistics” to become Data Mining
• Data mining usually asks complex statistical questions
that are difficult to answer via traditional SQL queries
• Data mining relies on special algorithms outside of the
standard DBMS/SQL family of tools
• Data mining is used to extract knowledge from DBMS,
not just the data bits (i.e., KDD)
• Data mining applies familiar statistical concepts to
large DBMS (e.g., outlier detection; cluster analysis;
data modeling; evolutionary analysis; prediction)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
21
Data Mining is a core database function
• Data Mining has many names / aliases :
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
By Dr. Borne 2005
Knowledge Discovery in Databases (KDD)
Machine Learning (ML)
Exploratory Data Analysis (EDA)
Intelligent Data Analysis (IDA)
On-Line Analytical Processing (OLAP)
Business Intelligence (BI)
Customer Relationship Management (CRM)
Business Analytics
Target Marketing
Cross-Selling
Market Basket Analysis
Credit Scoring
Case-Based Reasoning (CBR)
Connecting the Dots
Intrusion Detection Systems (IDS)
Recommendation / Personalization Systems!
UMUC Data Mining Lecture 2
22
Database Systems and Data Mining
• Data mining brings novel non-traditional concepts to
large DBMS (e.g., association mining; neural nets;
decision trees; link analysis; pattern recognition;
classification; regression; SOMs). For example:
– Clustering Analysis = group together similar items and
separate dissimilar items
– Classification Prediction = predict the class label
– Regression = predict a numeric attribute value
– Association Analysis = detect attribute-value conditions that
occur frequently together (e.g., Beer & Diapers example)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
23
Types of Databases to be Mined
•
•
•
•
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories:
–
–
–
–
–
–
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW, and eventually the Semantic Web
By Dr. Borne 2005
UMUC Data Mining Lecture 2
24
Data Warehousing and OLAP
By Dr. Borne 2005
UMUC Data Mining Lecture 2
25
Data Warehousing
• Data warehouse = Materialized view
• Integrated view of data from distributed sources
• If transformation process can be represented via SQL,
then data warehouse can be seen as a DB view:
– CREATE VIEW warehouse_table AS
SELECT …
FROM source_table1, source_table2, …
WHERE …
– except that the view is materialized = result is stored
and needs to be maintained when source data change
By Dr. Borne 2005
UMUC Data Mining Lecture 2
26
Order of Database Operations (1)
• When building a DW, pay attention to the
order of operations in the SQL command
– particularly if large data need to be selected,
grouped, and ordered
– perhaps build intermediate views to cull data
down to manageable size
• Order of operations . . .
By Dr. Borne 2005
UMUC Data Mining Lecture 2
27
Order of Database Operations (2)
(4)
select .....
specifies attributes and computations to
appear in answer
(1)
from ....
indicates Cartesian product of source tables
(2)
where .....
provides boolean to filter Cartesian product
groupby ....
specifies attributes necessary to cluster the
results of the where-filter
(5)
orderby ....
indicates attributes on which to order any
visual display or sequential tuple returns
(6)
into ....
specifies a temporary table to hold the answer
(3)
Operational order
By Dr. Borne 2005
UMUC Data Mining Lecture 2
28
Maintaining the Data Warehouse
The key concept is ETL :
– Extraction: extract relevant
data and/or changes from the
DB sources
– Transformation: transform
the data to match the
warehouse schema
– Loading: integrate data (and
subsequent changes to data)
into the warehouse
By Dr. Borne 2005
UMUC Data Mining Lecture 2
29
Data Warehousing “features”
• Data are integrated into the DW in advance,
prior to queries being formulated
– Caution: Query results could therefore be stale
• Data are copied from distributed sources
– Care must be exercised to maintain consistency
– Query processing is local to the DW:
• faster
• can operate even when data sources are unavailable
By Dr. Borne 2005
UMUC Data Mining Lecture 2
30
Selecting views to materialize
• Factors that affect what to materialize:
–
–
–
–
Storage cost
Update cost
Which queries will benefit from it
How much will those queries benefit from it
• Examples:
– GROUP BY A1 is small, but not useful for most
queries
– GROUP BY A1, B2, C3 is useful for most
queries, but too large to be of much benefit
By Dr. Borne 2005
UMUC Data Mining Lecture 2
31
Data Warehousing and OLAP
(On-Line Analytical Processing)
• OLAP as Data Mining:
– Read data from integrated view of data sources
– Complex queries of DW for Data Analysis
– Data Analysis for Knowledge Discovery
(KDD = Data Mining)
– Knowledge Discovery for Decision Making
– Goal: optimize reads and data warehouse
queries for data exploration, mining, analysis
By Dr. Borne 2005
UMUC Data Mining Lecture 2
32
OLTP versus OLAP
(On-Line Transaction Processing vs. On-Line Analytical Processing)
• OLTP
• OLAP
– Mostly updates
– Short, simple
transactions
– DBA, clerical users
– Goal: transaction
throughput
– Local sources:
heterogeneous DBs
By Dr. Borne 2005
– Mostly reads
– Long, complex
queries
– Analysts, decision
makers
– Goal: fast queries
– Distributed sources:
single integrated view
(data warehouse)
UMUC Data Mining Lecture 2
33
OLAP Operations in the Warehouse
• Slice (select one dimensional view)
• Dice (select multi-dimensional view;
aids in the search for trends and
patterns)
• Roll-up (consolidation; dimension
reduction; aggregation; using simple
or complex expressions)
• Drill-down (querying specific items)
• Visualize (“see” the results; allows
for intuitive data understanding)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
34
From Lecture #1
The Data Warehouse as the Source
for the Mining Process
By Dr. Borne 2005
UMUC Data Mining Lecture 2
35
From “DataMines for DataWarehouses” article
(available in Webliography)
Data Mining external
to the Data Warehouse
Data Mining within
the Data Warehouse
By Dr. Borne 2005
UMUC Data Mining Lecture 2
36
Statistics and Data Mining
By Dr. Borne 2005
UMUC Data Mining Lecture 2
37
Data Mining = Statistical Analysis?
•
"Data mining … is the exploration and analysis, by automatic and
semi-automatic means, of large quantities of data in order to
discover meaningful patterns and rules." (Berry, J. A. & Linoff, G.
[1997]. Data mining Techniques For Marketing, Sales and Customer
Support, John Wiley & Sons, Inc. New York, p.5, http://www.dataminers.com/books/order.html )
•
"Data mining is the process of selecting, exploring, and modeling
large amounts of data to uncover previously unknown patterns of
data for business advantage." (SAS Institute Inc.,
http://www.sas.com/technologies/analytics/datamining/index.html )
•
"Data mining simply means finding patterns in your business data
which you can use to do your business better" (SPSS Inc.,
http://www.statistical.com.au/dm.htm )
•
”Data mining is the use of statistical analysis and machine learning
techniques, in a semiautomatic fashion, on large collections of
data." (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance
11, 34–42.)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
38
Statistics and Data Mining
• Data mining got a bad name initially because it was
initially viewed as “statistical dredging” or a “fishing
expedition”.
• Data mining became an acceptable practice because
its users exercised statistical rigor in their analyses.
• Challenges and concerns:
–
–
–
–
Data volumes are huge. Techniques don’t often scale.
Contaminated or corrupt data values (6-sigma effect)
Selection bias; non-independent observations
Fishing expedition = if you look hard enough, you will
find something. But, is it really useful or not? … …
this is the “Interestingness” Problem …
• Are the data mining results interesting to anyone?
By Dr. Borne 2005
UMUC Data Mining Lecture 2
39
Quality Management and Data Mining
• The focus of TQM (Total Quality Management) is total customer
satisfaction.
• This can be realized through CRM (Customer Relationship
Management) systems = a data mining technology :
– Gather data
– Analyze data
– Make decisions based upon results
• Related to this are 6-Sigma quality control processes : customer
satisfaction maximized through minimizing defects in products
and services delivered.
• Some references:
– http://www.sbaer.uca.edu/newsletter/2002/012202.pdf
– http://www.qualitydigest.com/apr99/html/body_spcguide.html
By Dr. Borne 2005
UMUC Data Mining Lecture 2
40
Information Retrieval
By Dr. Borne 2005
UMUC Data Mining Lecture 2
41
Information Retrieval (IR)
• IR is a combination of data discovery and
data mining in digital libraries or other
information repositories.
• An IR system operates on a collection of
documents (e.g., the WWW)
• IR is sometimes called Text Mining or Web
Mining
• Effectiveness of an IR project is measured by
precision and recall
By Dr. Borne 2005
UMUC Data Mining Lecture 2
42
Information Retrieval Metrics
Precision = (relevant & retrieved) / (retrieved)
– “Am I interested in the documents retrieved?”
– High Precision means most of the retrieved
documents are relevant to my query
Recall = (relevant & retrieved) / (relevant)
– “Have all relevant documents been retrieved?”
– High Recall means that most of the relevant
documents have been retrieved.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
43
IR and Text/Web Mining
• Semantic markup of Web or other text documents using
XML (eXtensible Markup Language)
• XML enables metadata / keyword harvesting from
document collections (e.g., Web screen-scraping)
• Harvested metadata can be stored in a Data Warehouse for
mining -- this is clearly an example of a materialized view
of distributed data sources
• Other metrics: “similarity” to other documents
(e.g., common keywords, common keyphrases)
• Application area: Automated Recommendation System
By Dr. Borne 2005
UMUC Data Mining Lecture 2
44
Information Retrieval Issues
•
•
•
•
•
•
•
Semantic content of documents
Unstructured versus structured content
Multi-modal content (image, text, numeric)
Reliability of sources
Quality of sources
Indexing for efficient & effective access
Similarity metrics (e.g., how do you do a
Groupby or a Roll-up ?)
• Privacy, Copyright, Intellectual Property
By Dr. Borne 2005
UMUC Data Mining Lecture 2
45
IR and Image Mining
• Image Mining is a form of IR and data mining
• Techniques:
–
–
–
–
Wavelet analysis and summarization
Pixel value (color) histograms and vectorization
Scene pattern recognition and indexing
Event/anomaly detection and cataloguing
(e.g, forest fires seen in satellite photos)
– Edge detection (unsharp masking) and graphs
• The data to be mined are the information databases
extracted from the images (not the raw image data
themselves)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
46
Data Mining as “Rule Induction”
By Dr. Borne 2005
UMUC Data Mining Lecture 2
47
From Lecture #1
Decision Tree Classification:
based on rules at each node of the tree
Should I play
tennis today?
By Dr. Borne 2005
UMUC Data Mining Lecture 2
48
Intelligent actions (decision support) are
often represented by a set of rules…
IF age = “<=30” AND student = “no”
IF age = “<=30” AND student = “yes”
IF age = “31…40”
IF age = “>40” AND credit_rating = “excellent”
IF age = “>40” AND credit_rating = “fair”
THEN buys_computer = “no”
THEN buys_computer = “yes”
THEN buys_computer = “yes”
THEN buys_computer = “yes”
THEN buys_computer = “no”
(example of Decision Tree rules)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
49
Rule-Based Algorithms (RBA)
•
•
•
•
•
RBA = Decision Support via “if-then rules”
Can generate the rules from a Decision Tree (DT).
But, rules do not need to be derived from a DT.
Rules have no order, unlike Decision Trees.
Trees are built by examining all cases; whereas
rules are generated one case at a time.
• Rule Induction is the method for deriving rules.
• Case-Based Reasoning (CBR) is a related
application of rule-based algorithms.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
50
Sometimes the rules are fuzzy…
(example of Fuzzy Rule Induction)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
51
Fuzzy Sets and Logic
By Dr. Borne 2005
UMUC Data Mining Lecture 2
52
Fuzzy Sets and Logic
• Data mining does not always yield absolute answers, but
statistical answers that indicate the probability frequency
of occurrence of patterns or classes, or the likelihood that
an object in the database belongs to a given class.
• In predictive data mining, the result is fuzzy (e.g.,
predicting loan default through bank account analysis
does not guarantee that the customer will indeed default
on their loan).
• Fuzzy Logic is a method for handling uncertainty in
data, in decision-making, and in control systems.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
53
Sets and Logic - Classical (Boolean)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
54
Sets and Logic - Fuzzy
By Dr. Borne 2005
UMUC Data Mining Lecture 2
55
Classical versus Fuzzy
By Dr. Borne 2005
UMUC Data Mining Lecture 2
56
Fuzzy Logic, Control Systems, and Data Mining
• Suppose you have a R/T (real-time) data monitoring
(data mining) control system attached to machinery in a
large manufacturing plant.
• Temperature sensor on a machine says that it is running
very hot (... what is “hot”? -- that’s fuzzy).
• Motion sensor within machine says that it is running at
high RPM, very fast (… what is “fast”? -- that’s fuzzy).
• The machine is not technically over-heating, which you
know because of past experience and common sense.
• Control System responds to data and knowledge-base by
invoking a rule to slow down the motor speed a little bit.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
57
Application of Fuzzy Logic to Data Mining - 1
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>
Direct Mailing System
• The problem is to identify customers from a customer database who can be
targeted for a sale under the assumption that these customers responded
positively to advertisements mailed to them. The additional constraint is that
the mailing list budget is limited and number of advertisements to be mailed
are to be controlled to increase profit. The first step involves analyzing the
database for attributes like "frequency of visits to the store", "sum of
purchases", etc. Analysis and plots of the data then determine the cluster of
good customers. Next, one has to find the attribute relationships to define a
query condition which is represented by a pair of attributes and a fuzzy
linguistic value. One then verifies and refines the query condition by using
another customer database. Thus the customer database is ranked and sorted
by degree values based on a given fuzzy query condition. The customers
retrieved by the query determine the list of the potential of good customers.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
58
Application of Fuzzy Logic to Data Mining - 2
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>
Vibration Sensor
• A product which was used to sense vibrations and predict the causes of
these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy
rules. The original sensor was based on simple threshold rule. The error rate
for this sensor was around 12%. The fuzzy rules were created by analyzing
the actual data in specified cases of earthquakes, automobiles etc. A feature
extraction was done on the data set to identify each kind of cause.
Relationships between the feature parameters and the kind of vibration were
discovered to develop the fuzzy rules. These rules were then tested and
refined. The accuracy of the sensor’s prediction improved dramatically, with
the error rate falling to within 1%.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
59
Non-Fuzzy Logic System
By Dr. Borne 2005
UMUC Data Mining Lecture 2
60
Adaptive Fuzzy Logic System
This example is related
to air conditioner settings
in a warm room, but the
adaptive fuzzy logic system
may be applied to activate
other “thinking machines”.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
61
Machine Learning – a tool for
Data Mining and Intelligent
Decision Support
By Dr. Borne 2005
UMUC Data Mining Lecture 2
62
Machine Learning
• What is Machine Learning? -- “ML is the application of
computer algorithms that improve automatically
through experience.”
• Why is ML applicable to Data Mining? -– Refer to earlier slide “Reasons for the growing popularity of
data mining” :
• Growing Data Volume -- ML enables the intelligent analysis of
overwhelmingly large data/knowledge repositories
• Limitations of Human Analysis -- ML enables automated searches for
complex multifactor dependencies in data
• Low Cost of Machine Learning -- machines and software are cheaper
than people; the ML process is repeatable, consistent, and robust in
handling very large data analysis tasks; adaptive ML algorithms can
scale with the problem.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
63
Machine Learning and Data Mining
• ML Techniques for DM (to be covered later):
–
–
–
–
–
–
–
–
Decision Trees
Rule Mining and Rule Learning
Case-Based Reasoning (CBR)
Neural Nets (NN)
Supervised and Unsupervised Learning
Support Vector Machines (SVM)
Bayesian Networks
Genetic Algorithms (GA)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
64
Neural Nets
• “Neural networks are the second best way of
doing just about anything.” (John Denker)
Data
Neural Network
Fuzzy
Rules
• The best way is “is to apply all available domain
knowledge and spend a considerable amount
of time, money and effort in building a rule
system that will give the right answer. The
second best way of doing anything is to learn
from experience.” (Burbidge & Buxton)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
65
Supervised vs. Unsupervised Learning
• In Supervised Learning algorithms, a training
set is provided (data with correct answers),
which is used to mine for known patterns.
• In Unsupervised Learning algorithms, data are
provided with no a priori knowledge of the
hidden patterns (knowledge) that they contain.
The goal is to discover (learn) these patterns.
• A class known as Semi-Supervised Learning
also exists, where knowledge is known and
applied from one data collection in order to
mine, analyze, classify, and interpret a related
data collection.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
66
Machine Learning, Data Mining, and
Support Vector Machines (SVM)
• SVM is the tool of choice for the application of
ML to the data mining classification problem.
• So what are they? … “a statistical learning
system for predictive data mining -- for
estimating regression functions.”
• Loads of information available here:
http://www.cs.rpi.edu/~bij2/svm.html
http://www.kernel-machines.org/tutorial.html
By Dr. Borne 2005
UMUC Data Mining Lecture 2
67
SVM Process Overview
Initial
Classification
Data
Data
SVM
Training
SVM
Weights
Classification
Elements
In
Classification
By Dr. Borne 2005
UMUC Data Mining Lecture 2
Elements
Out of
Classification
68
SVM Classification
• SVM attempts to find an optimal separating
hyperplane between members of the two
initial classifications.
Separating
hyperplane
Class “A”
Class “B”
By Dr. Borne 2005
UMUC Data Mining Lecture 2
69
SVM Class Separation Problem
• An optimal hyperplane partitions the initial
classification correctly and maximizes distance
from the plane to elements on either ‘side’:
positive and negative examples.
• When the training examples (initial classification)
consist of very diverse expression patterns, then
finding an optimal hyperplane can be impossible.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
70
SVM Kernel Construction
The expression data can be transformed to a higher
dimensional space (feature space) by applying a
kernel function. This transformation can have the
effect of allowing a separating hyperplane to be
found.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
71
Practical SVM Issues
• Results depend heavily on the input
parameters.
• Using a high degree kernel function risks
artificial separation of the data.
• An iterative approach to increasing the
kernel power is advisable.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
72
SVM Results
• Two classes are produced:
– Positive Class: contains elements with expression
patterns similar to those in the positive examples in the
training set.
– Negative Class: contains all other members of the input
set.
• Each of these classes has elements that fall in two groups:
– Those initially in the class (true positives and true
negatives)
– Those recruited into the class (false positives and false
negatives)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
73
Machine Learning Resources
• 1. Massive compilation of ML resources at :
http://home.earthlink.net/~dwaha/research/machine-learning.html
• 2. Excellent Reference Book: Tom Mitchell’s
“Machine Learning” (1997; McGraw-Hill) :
http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
• 3. Machine Learning & Data Mining Resources :
http://www.mlnet.org/
My favorite ML site …
Click on Software
… a site dedicated to “machine learning,
knowledge discovery, case-based reasoning,
knowledge acquisition, and data mining.”
By Dr. Borne 2005
UMUC Data Mining Lecture 2
74
Recap of ML and DM
• DM requires machine assistance in the search and analysis of very
large (often distributed, heterogeneous) databases
• Intelligent analysis of complex multi-dimensional multipledependency data also demands machine assistance
• Algorithms for DM are most efficient when they are adaptable to
the type and content of the data (i.e., the system “learns”)
• Machines are less expensive than humans
• Machines are usually scalable as the problem size grows
• Actionable data (the end-goal of DM) depends in many cases on an
embedded ML algorithm to take appropriate action (in control
systems; decision-support systems; robotics; autonomous systems)
• ML and DM are historically, technically, and functionally
intertwined (e.g, some data mining research groups call themselves
Machine Learning Groups)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
75
Steps in the Data Mining Process
By Dr. Borne 2005
UMUC Data Mining Lecture 2
76
Steps in the Data Mining Process
http://www.cs.sfu.ca/~han/DM_Book.html
• Learning the application domain:
– relevant prior knowledge and goals of DM application
• Creating a target data set: Data selection
• Data cleaning and preprocessing: (may take 40-60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing data mining functions
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining & KDD: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Using the discovered knowledge = Actionable Data!
By Dr. Borne 2005
UMUC Data Mining Lecture 2
77
Steps in the Data Mining Process - Pictorial View
By Dr. Borne 2005
UMUC Data Mining Lecture 2
78
Cleaning the “Dirty Data”
• Excellent reference: Dorian Pyle’s book “Data Preparation
for Data Mining” (1999, Morgan Kaufmann; 540pp)
• Frequent problem: missing (NULL) values
• Empty value  Missing value (must treat each case
differently)
• Various options for NULLs (may introduce bias):
–
–
–
–
use “fill value” (e.g, -999)
use estimated value (prediction from data model)
use interpolated value (from surrounding entries)
ignore any records with nulls
• November 2003 Workshop on Data Cleaning:
http://dimacs.rutgers.edu/Workshops/DataCleaning/
By Dr. Borne 2005
UMUC Data Mining Lecture 2
79
Data Preprocessing (Laundering the Data)
(may take 40-80% of the total data mining project effort!)
(Reference: “Data Scrubbing” article in Computerworld 2003)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
80
"Data Scrubbing by the Numbers”
(http://www.computerworld.com/printthis/2003/0,4814,78260,00.html)
Here are some of the findings:
 Data cleansing accounts for up to 70% of the cost and effort of
implementing most data warehouse projects, according to analysts.
 In 2001, The Data Warehousing Institute estimated that dirty data costs
U.S. businesses $600 billion per year.
 Data cleanliness and quality was the No. 2 problem -- right behind
budget cuts -- cited in a 2003 IDC survey of 1,648 companies
implementing business analytics software enterprise-wide.
 Only 23% of 130 companies surveyed by Cutter Consortium on their
data warehousing and business-intelligence practices use specialized
data cleansing tools.
 Of those companies in the Cutter Consortium study using specialized
data scrubbing software, 31% are using tools that were built in-house.
By Dr. Borne 2005
UMUC Data Mining Lecture 2
81
Major Issues in Data Mining
By Dr. Borne 2005
UMUC Data Mining Lecture 2
82
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling of noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Handling very large data volumes (the “data flood”)
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, and incremental mining methods
By Dr. Borne 2005
UMUC Data Mining Lecture 2
83
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
– Protection of data security, integrity, and privacy
• Dirty data (60% of the effort, or more)
– Preparing the data for mining (transformation, cleaning, processing)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
84
Case Study - The Mars Rover
http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html
By Dr. Borne 2005
UMUC Data Mining Lecture 2
85
Data Mining in Action
• Data Mining facilitates
Intelligent Data
Understanding
• Data Mining enables
Decision Support and
Active Control Systems
By Dr. Borne 2005
UMUC Data Mining Lecture 2
86
What is Intelligent Data Understanding?
• IDU refers to the application of techniques for
transforming data into understanding.
… (sound familiar?)
Data  Information  Knowledge  Understanding / Wisdom!
• Web reference: http://is.arc.nasa.gov/IDU/index.html
• IDU specifically refers to automating the following
techniques for machine-assisted data analysis:
– Data Mining (e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html)
– Knowledge Discovery
– Machine Learning
By Dr. Borne 2005
UMUC Data Mining Lecture 2
87
Intelligent Data System Applications (1)
• Rove around the surface of Mars and take samples of
rocks (mass spectroscopy = a data histogram)
• Supervised Learning (search for rocks with known
compositions)
• Unsupervised Learning (discover what types of rocks
are present, without preconceived biases)
• Association Mining (find unusual associations)
• Clustering (find the set of unique classes of rocks)
• Classification (assign rocks to known classes)
• Deviation/Outlier Detection (one-of-kind; interesting?)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
88
Intelligent Data System Applications (2)
• On-board Intelligent Data Understanding & Decision
Support Systems (Fuzzy Logic & Decision Trees &
Cased-Based Reasoning ) – Science Goal Monitoring:
– “stay here and do more”; or else “move on to another rock”
– “send results to Earth immediately”; or “send results later”
• Learn as it goes (Machine Learning & Neural Nets)
• Relate the results to other factors, such as dust storms
(XML & Information Retrieval & Information Fusion
with other data from orbiting satellite “mother ship”)
• Predict where to go in order to find interesting rocks
(Logistic Regression & Case-Based Reasoning)
By Dr. Borne 2005
UMUC Data Mining Lecture 2
89
Mars Rover as an
Adaptive Fuzzy Logic System
• Decisions are based on data mined, prior
experience, new knowledge, and fuzzy logic
• Rover acts autonomously, without human
intervention, in Deep Space environment
• Actions are driven by mining actionable
data from all sensors
By Dr. Borne 2005
UMUC Data Mining Lecture 2
90
Summary
By Dr. Borne 2005
UMUC Data Mining Lecture 2
91
Summary of Topics Covered
•
•
•
•
•
•
•
•
•
•
•
•
Summary of “What is Data Mining?” Tutorial
Foundations of Data Mining
Database Systems
Data Warehousing and OLAP
Statistics and Data Mining
Information Retrieval
Data Mining as “Rule Induction”
Fuzzy Sets and Logic
Machine Learning
Steps in the Data Mining Process
Major Issues in Data Mining
A Case Study: The NASA Mars Rover
By Dr. Borne 2005
UMUC Data Mining Lecture 2
92