Download Another Look at Data Mining - Computer Information Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Another Look at Data Mining
Why do we mine?
What do we mine?
How do we mine?
What is Data Mining
Data mining discovers meaningful new
correlations, hidden patterns and
relationships in your data
 Conceptual descendent of statistics
 Combines machine learning,statistics,and
databases
 Knowledge discovery:process of building
and implementing a data mining solution

CS753
Dr. Mary Ann Robbert
Data Mining Overview


Knowledge Discovery in Databases, KDD
No one data mining approach




each tool viewed logically as application of client
Can reside on separate machine or in separate process and access
data warehouse
RDBMS or proprietary OLAP embed data mining
capabilities deeply within engines to improve efficiency
and add extensions
Requires a good foundation in terms of a data warehouse
CS753
Dr. Mary Ann Robbert
Data Mining Overview
(con’t)

Common algorithmic approaches






association, affinity grouping
predicting, sequence-based analysis
clustering
classification
estimation
Steps are:data selection, data
transformation,data mining,result
interpretation.
CS753
Dr. Mary Ann Robbert
Strategic Benefit of Data
Mining
Direct Marketing
 Trend Analysis
 Fraud detection
 Forecasting in Financial Markets

CS753
Dr. Mary Ann Robbert
Why Data Mining Now?

Economics
 Unprecedented

Parallel computing
 Enormous

affordability of MIPS and MB
amounts of data can be processed
Popularity of data warehouses, data marts
 Relatively
clean data available
CS753
Dr. Mary Ann Robbert
Data Mining compared to
Traditional Analysis

Traditional Analysis
 Did
sales of product X increase in Nov.?
 Do sales of product X decrease when there is a
promotion on product Y?

Data mining is result oriented
 What
are the factors that determine sales of
product X?
CS753
Dr. Mary Ann Robbert
Data Mining compared to
Traditional Analysis (con’t)

Traditional; analysis is incremental
 Does
billing level affect turnover?
 Does location affect turnover?
 Analyst builds model step by step

Data Mining is result oriented
 Identify
the factors and predict turnover
CS753
Dr. Mary Ann Robbert
Steps in Data Mining

Data Manipulation - can be 70-80% of data
mining effort





data cleaning
missing values
data derivation
merging data
Defining a study


Supervised-articulating goal, choosing dependent
variable or output and specifying data fields
Unsupervised-group similar types of data or identify
exceptions
CS753
Dr. Mary Ann Robbert
Steps in Data Mining (con’t)

Reading the data and building the model
 model
summarizes large amounts of data by
accumulating indicators
(frequencies,weight,conjunctions,differentiation)

Understanding the model
 Know

the particular model
Prediction
 Choose
the best outcome based on historical data
CS753
Dr. Mary Ann Robbert
Models
Genetic Algorithms
 Neural Nets
 Agents
 Statistics
 Visualization

CS753
Dr. Mary Ann Robbert
Genetic Algorithms

Artificial intelligence system that mimics the
evolutionary, survival-of-the-fittest processes to
generate increasingly better solutions to a problem.

Genetic algorithms produce several generations of
solutions, choosing the best of the current set for each
new generation.

Examples



Generating human faces based on a few known features.
Generating solutions to routing problems.
Generating stock portfolios.
CS753
Dr. Mary Ann Robbert
EVOLUTION IN GENETIC
ALGORITHMS



SELECTION - or survival of the fittest. The
key is to give preference to better outcomes.
CROSSOVER - combining portions of good
outcomes in the hope of creating an even
better outcome.
MUTATION - randomly trying combinations
and evaluating the success (or failure) of the
outcome.
CS753
Dr. Mary Ann Robbert
Neural Nets



Mathematical Model of the Way a Brain
Functions
Machine learning approach by which
historical data can be examined for
pattern recognition
A neural network simulates the human
ability to classify things based on the
experience of seeing many examples.

Pros -Numerical Data

Cons - Opaque, Art or Science
CS753
Dr. Mary Ann Robbert
Example
Distinguishing
different chemical
compounds
Detecting
anomalies in human tissue
that may signify disease
Reading
handwriting
Detecting
fraud in credit card use
CS753
Dr. Mary Ann Robbert
Intelligent Agents


Software entities that carry out some set of
operations on behalf of user or program with some
degree of autonomy and employ some knowledge
or representation of users goals and desires.
Some common characteristics


ability to communicate, cooperate and coordinate with
other agents
ability to act autonomously to achieve collective goal of
system
CS753
Dr. Mary Ann Robbert
Intelligent Agents (con’t)

Tasks
 automate
repetitive tasks
 finding and filtering information
 summarizing complex data
Capability to learn and make
recommendations
 Black box approach hides complexity and
allows for design of scalable system

CS753
Dr. Mary Ann Robbert
Comparison
Based On
Starting
Information
AI System
Problem Type
Expert
Systems
Diagnostic or
prescriptive
Strategies of
experts
Expert’s
know-how
Neural
Networks
Identification,
classification,
prediction
The human
brain
Acceptable
patterns
Genetic
Algorithms
Biological
Optimal solution evolution
Set of
possible
solutions
Intelligent
Agents
Specific and
repetitive tasks
Your
preferences
One or more AI
techniques
Statistics

SAS, SPSS
 Pros
- Established technology
 Cons - Needs assumptions, nominal
variable handling, management
acceptance?
CS753
Dr. Mary Ann Robbert
Visualization
Data visualization refers to technologies
that support visualization of information
 Includes – digital images, GIS, multidimensions, 3-D presentations, animations
 http://www.almaden.ibm.com/cs/quest/dem
o/assoc/general.html

CS753
Dr. Mary Ann Robbert
Data Mining is Not a Silver
Bullet

It does not:
 Find
answers to questions you don’t ask
 Eliminate the need for domain experience
 Remove the need for data analysis skills
CS753
Dr. Mary Ann Robbert
Data Mining Software
http://www.kdnuggets.com/software/
 http://www.attar.com/ download
 http://www.cs.bham.ac.uk/~anp/software.ht
ml software listing

CS753
Dr. Mary Ann Robbert
Six Rules of Data Quality
by Ken Orr
1. Data that is not used cannot be correct for very long
2. Data Quality in an information system is a function of
its use, not its collection
3.Data quality will ultimately be no better than its most
stringent use
4. Data quality problems tend to become worse with the
age of the system
5. Less likely it is that some data element will change,
more traumatic it will be when it finally does change.
CS753
Dr. Mary Ann
Robbert
6. Information overload
affects
data
quality
Data Quality Software

http://www.rulequest.com/gritbot-info.html
CS753
Dr. Mary Ann Robbert
General DW Data transformation
Resolve inconsistent legacy formats
 Strip out unwanted fields
 Interpret codes into text
 Combine data from multiple sources under
a common key
 Find fields used for multiple purposes and
interpret fields value based on context

CS753
Dr. Mary Ann Robbert
Data transformation for
Data Mining
Flag normal, abnormal, out of bounds or
impossible facts
 Recognize random or noise values from
context and mask out
 Apply uniform treatment to NULL values
 Flag fast records with changed status
 Classify individual record by one of its
aggregates

CS753
Dr. Mary Ann Robbert
Conclusion

For successful data mining:
 data
analysis and mining goals must be
identifies and formulated
 appropriate data must be selected, cleaned and
prepared for queries and business analysis
http://www.rulequest.com/cubistexamples.html#BOSTON
 http://www.almaden.ibm.com/cs/quest/

CS753
Dr. Mary Ann Robbert