Download The Scope of Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia, lookup

Data assimilation wikipedia, lookup

Transcript
DataMining
By
Guan Hang Su
CS157A section 2 fall 2005
Outline

Overview
---- Define Data Mining
---- Foundation of Data Mining
---- Scope of Data Mining
---- Techniques in data mining
----Applications
What is DataMining?
Discovering “hidden value” in your data
warehouse
Define Data Mining


The automated extraction of hidden
predictive information from (large)
databases
Three key words:




Automated
Hidden
Predictive
Implicit is a statistical methodology


Data mining lets you be proactive
Prospective rather than Retrospective
The Foundations of Data Mining

Data mining techniques are the result of a long
process of research and product development. This
evolution began when business data was first
stored on computers, continued with improvements
in data access, and more recently, generated
technologies that allow users to navigate through
their data in real time. Data mining takes this
evolutionary process beyond retrospective data
access and navigation to prospective and proactive
information delivery.
The Foundations of Data Mining (continue)
Data mining is ready for application in
the business community because it is
supported by three technologies that
are now sufficiently mature:



Massive data collection
Powerful multiprocessor computers
Data mining algorithms
The Scope of Data Mining
Data mining derives its name from the similarities
between searching for valuable business
information in a large database
Example — finding linked products in gigabytes of
store scanner data and mining a mountain for a
vein of valuable ore.
Both processes require either sifting through an
immense amount of material, or intelligently probing
it to find exactly where the value resides.
The Scope of Data Mining (cont..)
Given databases of sufficient size and
quality, data mining technology can
generate new business opportunities by
providing these capabilities:


Automated prediction of trends and behaviors
Automated discovery of previously unknown
patterns.
The Scope of Data Mining (cont..)

Automated prediction of trends and behaviors
--- Data mining automates the process of finding predictive
information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered
directly from the data.
Typical example of a predictive problem:
1)targeted marketing.
2) forecasting bankruptcy
The Scope of Data Mining (cont..)

Automated discovery of previously unknown patterns
---- Data mining tools sweep through databases and identify
previously hidden patterns in one step.
Example of pattern discovery: The analysis of retail sales data
to identify seemingly unrelated products that are often
purchased together
Other pattern discovery problems include detecting fraudulent
credit card transactions and identifying anomalous data that
could represent data entry keying errors.
Techniques in data mining

The most commonly used techniques
in data mining:
Artificial neural networks
Decision trees
Genetic algorithms
Nearest neighbor method
Rule induction

Artificial neural networks: Non-linear predictive
models that learn through training and resemble
biological neural networks in structure.


Decision trees: Tree-shaped structures that
represent sets of decisions. These decisions
generate rules for the classification of a dataset.
Specific decision tree methods include
Classification and Regression Trees (CART) and
Chi Square Automatic Interaction Detection
(CHAID)


Genetic algorithms: Optimization
techniques that use processes such as
genetic combination, mutation, and natural
selection in a design based on the concepts
of evolution.
Nearest Neighbor. A data mining technique
that performs prediction by finding the
prediction value of records (near neighbors)
similar to the record to be predicted.

Rule induction: The extraction of useful
if-then rules from data based on statistical
significance

Other Techniques :
Bayesian networks
----- Naïve Bayes
Support vector machines
Many more…..

Decision Trees

Nearest Neighbor classification

Neural Networks

Rule Induction

K-means Clustering
Example of Neural Network

Input layer
Hidden layer
Output




Difficult
interpretation
Tends to ‘overfit’
the data
Extensive amount
of training time
A lot of data
preparation
Works with all data
types
Example of Rule of induction

Description

Produces decision trees:

income < $40K



income > $40K



job > 5 yrs then good risk
job < 5 yrs then bad risk
high debt then bad risk
low debt then good risk
Or Rule Sets:

Rule #1 for good risk:



if income > $40K
if low debt
Rule #2 for good risk:


if income < $40K
if job > 5 years
K-Nearest-Neighbor (kNN) Models


Use entire training database as the model
Find nearest data point and do the same thing as you did for that record
100
Age
0
Doses
1000
Very easy to implement. More difficult to use in production.
Disadvantage: Huge Models
Example of Decision Trees
How Data Mining Works

How exactly is data mining able to tell you
important things that you didn't know or
what is going to happen next? The
technique that is used to perform these feats
in data mining is called modeling.

Modeling is simply the act of building a
model in one situation where you know the
answer and then applying it to another
situation that you don't.

Computers are loaded up with lots of
information about a variety of situations
where an answer is known and then the
data mining software on the computer must
run through that data and distill the
characteristics of the data that should go
into the model

Once the model is built it can then be used
in similar situations where you don't know
the answer
Some results of Data Mining




Forecasting what may happen in the
future.
Classifying people or things into
groups by recognizing patterns.
Clustering people or things into groups
based on their attributes.
Sequencing what events are likely to
lead to later events
Example
For example, say that you are the director of
marketing for a telecommunications
company and you'd like to acquire some
new long distance phone customers.
1)randomly mail out the coupon to general
population.
2) or use your business experience stored in
your database to build a model , then
choose the right target.
Cont..

As the marketing director you have access
to a lot of information about all of your
customers: their age, sex, credit history and
long distance calling usage.

The problem is that you don't know the long
distance calling usage of these prospects
(since they are most likely now customers of
your competition).


We 'd like to
concentrate on
those prospects
who have large
amounts of long
distance usage .We
can accomplish this
by building a model
Cust
Pros
General information (e.g.
demographic data)
Known
Known
Proprietary information (e.g.
customer transactions)
Known
Target


For instance, a simple model for a
telecommunications company might be:
98% of my customers who make more than
$60,000/year spend more than $80/month
on long distance.
With this model in hand new customers can
be selectively targeted
Architecture for Data Mining

To best apply these advanced techniques, they
must be fully integrated with a data warehouse as
well as flexible interactive business analysis tools.

Many data mining tools currently operate outside of
the warehouse, requiring extra steps for extracting,
importing, and analyzing the data. Furthermore,
when new insights require operational
implementation, integration with the warehouse
simplifies the application of results from data
mining.

illustrates an architecture for advanced
analysis in a large data warehouse
Data Mining Applications

The US Drug Enforcement Agency needed to be
more effective in their drug “busts”.
Analyzed suspects’ cell phone usage to focus
investigations.

HSBC need to cross-sell more effectively
by identifying profiles that would be
interested in higher yielding investments.
Reduced direct mail costs by 30% while
garnering 95% of the campaign’s
revenue.
Bibliography

http://www.thearling.com/dmintro/dmintro_frame.ht
m

http://www.thearling.com/text/dwhite/dmwhite.htm

http://www.cs.sjsu.edu/faculty/lee/cs157/25SpL22D
ataMining.ppt

http://www.oracle.com/technology/products/bi/odm/i
ndex.html