Download PPT Format - Karim El

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Data Mining
Group Members:
Karim C. El-Khazen
Pascal Suria
Lin Gui
Philsou Lee
Xiaoting Niu
Introduction to Data Mining
Definition
General Concept
Foundations
Evolution
Applications
Challenges
Algorithms
Classical
Next Generations
Introduction to Data Mining
What is Data Mining?
Data mining is the process for the non-trivial extraction of
implicit, previously unknown and potentially useful
information from data stored in repositories using pattern
recognition technologies as well as statistical and
mathematical methods.
Introduction to Data Mining
Foundations
Massive data collection
Powerful multiprocessor computers
Data mining algorithms
Introduction to Data Mining
Evolution
Evolutionary
Step
Data Collection
(1960s)
Data Access
(1980s)
Data
Warehousing &
Decision Support
(1990s)
Data Mining
(Emerging
Today)
Business Question
"What was my total
revenue in the last
five years?"
"What were unit
sales in New
England last
March?"
"What were unit
sales in New
England last March?
Drill down to
Boston."
"What’s likely to
happen to Boston
unit sales next
month? Why?"
Enabling
Technologies
Computers, tapes,
disks
Product
Providers
IBM, CDC
Characteristics
Relational
databases
(RDBMS), SQL,
ODBC
OLAP, multidimensional
databases, data
warehouses
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at record
level
Retrospective,
dynamic data
delivery at multiple
levels
Advanced
algorithms,
multiprocessor
computers,
massive databases
Pilot, Lockheed,
IBM, SGI,
numerous startups
Pilot, Comshare,
Arbor, Cognos,
Microstrategy
Retrospective, static
data delivery
Prospective,
proactive
information delivery
Introduction to Data Mining
Applications
Industry
Retails
Health maintenance group
Telecommunications
Credit card
Web mining
Sports and entertainment solutions
Introduction to Data Mining
Challenges
Ability to handle different types of data
Graceful degeneration of data mining algorithms
Valuable data mining results
Representation of data mining requests and results
Mining at different abstraction levels
Mining information from different sources of data
Protection of privacy and data security
Introduction to Data Mining
Hierarchy of Choices and Decisions
Business goal
Collecting, cleaning and preparing data
Prediction
Model type and algorithms
Introduction to Data Mining
Data Description
Descriptions of data characteristics in
elementary and aggregated form
Summarization
Visualization
Introduction to Data Mining
Predictive Data Mining
Predictive modeling is a term used to describe the
process of mathematically or mentally representing a
phenomenon or occurrence with a series of equations or
relationships.
Introduction to Data Mining
Prediction: Classification
Classification predicts class membership
Pre-classify (using classification
algorithms)
Test to determine the quality of the model
Predict (using effective classifier)
Introduction to Data Mining
Prediction: Regression
Regression takes a numerical dataset and develops a
mathematical formula that fits the data.
When you're ready to use the results to predict future
behavior, you simply take your new data, plug it into
the developed formula and you get a prediction!
Introduction to Data Mining
Algorithms
Classical Techniques
Statistics
Neighborhoods
Clustering
Next Generations
Decision Tree
Neural Network
Rule Induction
Introduction to Data Mining
Statistics
Classical Statistics:
Related to the collection and description of data
Believes: there exists an underlying pattern of data
distribution
Objective: find the best guess
Data Mining:
Employs statistical methods
Needs to analyze huge amounts of data
Beyond traditional statistics
Introduction to Data Mining
Neighborhoods
Basic idea:
For a new problem, look for the similar problems
(neighborhoods) that have been solved
Key point: find the neighborhood
Calculate the distance: how far is good to be
considered as a neighbor?
Which class the new problem belong to?
Large computational load:
New calculation for each new case
Introduction to Data Mining
Clustering
Elements grouped together according to different
characteristics
Every cluster share same values (homogenous)
Problem: Control the number of cluster
Hierarchical clustering: flexibility
Non-hierarchical clustering: given by user
Used most frequently for:
Consolidating data into a high-level of view
Group records into likely behaviors
Introduction to Data Mining
Decision Tree
A way of representing a series of rules that lead to a
class or value
Structure:
Decision node, branches, leaves
Example: A loan officer wants to determine the credit
of applicants
Introduction to Data Mining
Decision Tree (continued)
Help to induce the tree and its rules to make predictions
Introduction to Data Mining
Neural Networks
Efficiently modeling large and complex
problems with hundreds of predictor variables
Structure:
Input layer, hidden layer, output layer
Activation function between nodes
Requires training and testing of relations
Introduction to Data Mining
Neural Networks (continued)
Example:
Introduction to Data Mining
Rule Induction
A method to derive a set of rules to classify
cases
For example, rule induction can be used to
discover patterns relating decisions (e.g.,
credit card application)
Rules may not cover all possible situations
Introduction to Data Mining