Download Data Mining-1 - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
UNIT-III (BIA)
What Is Data Mining?
• Data mining
– Extraction of interesting (previously unknown and potentially
useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence etc.
Data Mining
• Data Mining is the process of extracting information from the
company's various databases and re-organizing it for purposes
other than what the databases were originally intended for.
• It provides a means of extracting previously unknown,
predictive information from the base of accessible data in data
warehouses.
• Data mining process is different for different organizations
depending upon the nature of the data and organization.
• Data mining tools use sophisticated, automated algorithms to
discover hidden patterns, correlations, and relationships among
organizational data.
• Data mining tools are used to predict future trends and
behaviors, allowing businesses to make proactive, knowledge
driven decisions.
• For ex: for targeted marketing, data mining can use data on past
promotional mailings to identify the targets most likely to
maximize the return on the company’s investment in future
mailings.
Functions
• Classification:
It
infers
the
defining
characteristics of a certain group
• Clustering: Identifies group of items that share
a particular characteristic
• Association: Identifies relationships between
events that occur at one time
• Sequencing: Similar to association, except that
the relationship exists over a period of time
• Forecasting: Estimates future values based on
patterns within large sets of data
Why Not Traditional Data Analysis?
• Tremendous amount of data
• High-dimensionality of data
• High complexity of data
• Different types of data :
– Time-series data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations etc
Characteristics
• Data mining tools are needed to extract the buried information .
• The “miner” is often an end user, empowered by “data drills” and
other power query tools to ask ad hoc questions and get answers
quickly, with little or no programming skill.
• The data mining environment usually has a client/server
architecture.
• Because of the large amounts of data, it is sometimes necessary to
use parallel processing for data mining.
• Data mining tools are easily combined with spreadsheets and
other end user software development tools, enabling the mined data
to be analyzed and processed quickly and easily.
• Data mining yields five types of information: associations,
sequences, classifications, clusters and forecasting.
Common data mining applications
APPLICATION
DESCRIPTION
Market
segmentation
Identifies the common characteristics of customers
who buys the same products from the company
Customer nature Predicts which customers are likely to leave your
company and go to a competitor
Fraud detection
Identifies which transactions are most likely to be
fraudulent
Direct
marketing
Identifies which prospects should be included in a
mailing list to obtain the highest response rate
Market based
analysis
Understands what products or services are
commonly purchased together
Trend analysis
Reveals the difference between a typical customer
this month versus last month
Science
Simulates nuclear explosions; visualizes quantum
physics
Entertainment
Models customer flows in theme parks; analyzes safety
of amusement parks rides
Insurance and
health care
Predicts which customers will buy new policies;
identifies behavior patterns that increase insurance
risk; spots fraudulent claims
Manufacturing
Optimizes product design, balancing manufacturability
and safety; improves shop-floor scheduling and
machine utilization
Medicine
Ranks successful therapies for different illnesses;
predicts drug efficacy; discovers new drugs and
treatments
Oil and gas
Analyzes seismic data for signs of underground deposits
; prioritizes drilling locations; simulates underground
flows to improve recovery
Retailing
Discerns buying-behavior patterns; predicts how
customers will respond to marketing campaigns
3 Steps Data Mining Process
• Stage 1: Exploration. This stage usually starts with data
preparation which may involve cleaning data, data
transformations, selecting subsets of records
• Stage 2: Model building and validation. This stage involves
considering various models and choosing the best one based
on their predictive performance
• Stage 3: Deployment. That final stage involves using the
model selected as best in the previous stage and applying it to
new data in order to generate predictions or estimates of the
expected outcome
Some of the tools used for data mining
are:
• Artificial neural networks - Non-linear predictive models that learn
through training and resemble biological neural networks in
structure.
• Decision trees - Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset.
• Rule induction - The extraction of useful if-then rules from data
based on statistical significance.
• Genetic algorithms - Optimization techniques based on the
concepts of genetic combination, mutation, and natural selection.
• Nearest neighbor - A classification technique that classifies each
record based on the records most similar to it in an historical
database.
Reasons for the growing popularity of
Data Mining
• Growing Data Volume
• Limitations of Human Analysis
• Low Cost of Machine Learning
ADVANTAGES /APPLICATIONS OF
DATA MINING
• Marking/Retailing: Data mining can aid direct
marketers by providing them with useful and
accurate trends about their customers’
purchasing behavior.
• Banking/Crediting: Data mining can assist
financial institutions in areas such as credit
reporting and loan information.
ADVANTAGES OF DATA MINING Cont…
• Law enforcement: Data mining can aid law
enforcers in identifying criminal suspects as well
as apprehending these criminals by examining
trends in location, crime type, habit, and other
patterns of behaviors.
• Researchers: Data mining can assist
researchers by speeding up their data analyzing
process; thus, allowing them more time to work
on other projects.
DISADVANTAGES OF DATA MINING
• Although data mining application is a very powerful
tool, it cannot stand alone by itself. To be successful,
data mining needs a skilled user who will supply the
correct data and a specialist who can make objective
conclusions out of the output that is created. If the
user supplies incorrect or minimal amount of
information, output will be affected and forecast will
not be credible.
• Furthermore, while data mining helps the user discover
patterns and relationships in data, it cannot promise
perfect results, cannot explain why an outcome
occurs, and cannot correct problems in your data.
DISADVANTAGES OF DATA MINING
• Privacy Issues / Security issues : Although
companies have a lot of personal information about us
available online, they do not have sufficient security
systems in place to protect that information. For
example, American Express also sold their customers’
credit card purchases to another company.
• Misuse of information: Some of the company will
answer your phone based on your purchase history. If
you have spent a lot of money or buying
a lot of product from one company, your call will be
answered really soon. So you should not think that
your call is really being answer in the order in which it
was receive.
Data Mining Techniques
•
•
•
•
Classification
Clustering
Regression
Association Rules
Difference between Classification and
clustering
• classification : supervised learning
• Clustering : unsupervised learning
Supervised Classification
•The input data, also called the training set, consists of multiple records each
having multiple attributes or features.
•Each record is tagged with a class label.
•The objective of classification is to analyze the input data and to develop an
accurate description or model for each class using the features present in the
data.
•This model is used to classify test data for which the class descriptions are
not known. (1)
Supervised learning
• suppose you had a basket and it is fulled with some fresh fruits your
task is to arrange the same type fruits at one place.
• suppose the fruits are apple,banana,cherry,grape.
• so you already know from your previous work that, the shape of
each and every fruit so it is easy to arrange the same type of fruits
at one place.
• here your previous work is called as train data in data mining.
• so you already learn the things from your train data, This is because
of you have a response variable which says you that if some fruit
have so and so features it is grape, like that for each and every fruit.
• This type of data you will get from the train data.
• This type of learning is called as supervised learning.
• This type solving problem come under Classification.
• So you already learn the things so you can do you job confidently.
Unsupervised learning:
• suppose you had a basket and it is fulled with some fresh
fruits your task is to arrange the same type fruits at one place.
• This time you don't know any thing about that fruits, you are
first time seeing these fruits so how will you arrange the
same type of fruits.
• What you will do first you take on fruit and you will select any
physical character of that particular fruit. suppose you taken
color.
• Then you will arrange them base on the color, then
the groups will be some thing like this.
• RED COLOR GROUP: apples & cherry fruits.
• GREEN COLOR GROUP: bananas & grapes.
Unsupervised learning:
• so now you will take another physical character as size, so
now the groups will be some thing like this.
•
•
•
•
•
•
RED COLOR AND BIG SIZE: apple.
RED COLOR AND SMALL SIZE: cherry fruits.
GREEN COLOR AND BIG SIZE: bananas.
GREEN COLOR AND SMALL SIZE: grapes.
job done happy ending.
here you didn't know learn any thing before means no train
data and noresponse variable.
• This type of learning is know unsupervised learning.
• clustering comes under unsupervised learning.
Supervised Classification
•The input data, also called the training set, consists of multiple records each
having multiple attributes or features.
•Each record is tagged with a class label.
•The objective of classification is to analyze the input data and to develop an
accurate description or model for each class using the features present in the
data.
•This model is used to classify test data for which the class descriptions are
not known. (1)
Classification
• Classification: Given a set of items that have several classes, and
given the past instances (training instances) with their associated
class, Classification is the process of predicting the class of a new
item.
• Therefore to classify the new item and identify to which class it
belongs
• Example: A bank wants to classify its Home Loan Customers into
groups according to their response to bank advertisements. The
bank might use the classifications “Responds Rarely, Responds
Sometimes, Responds Frequently”.
• The bank will then attempt to find rules about the customers that
respond Frequently and Sometimes.
• The rules could be used to predict needs of potential customers.
Technique for Classification
• Decision-Tree Classifiers
Job
Engineer
Carpenter
Income
<30K
Bad
>50K
Good
Income
<40K
Income
>90K
Bad
Doctor
Good
>100K
<50K
Bad
Predicting credit risk of a person with the jobs specified.
Good
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use the Model in
Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Clustering
–
“Clustering algorithms find groups of items that are similar.
… It divides a data set so that records with similar content
are in the same group, and groups are as different as
possible from each other. ”
–
Example: Insurance company could use clustering to
group clients by their age, location and types of insurance
purchased.
–
The categories are unspecified and this is referred to as
‘unsupervised learning’
Clustering
• Group Data into Clusters
– Similar data is grouped in the same cluster
– Dissimilar data is grouped in the same cluster
• How is this achieved ?
– K-Nearest Neighbor
• A classification method that classifies a point by calculating the
distances between the point and points in the training data set.
Then it assigns the point to the class that is most common among
its k-nearest neighbors (where k is an integer).
– Hierarchical
• Group data into t-trees
What Is Good Clustering?
• A good clustering method will produce high quality
clusters in which:
• the intra-class (that is, intra-cluster) similarity is high.
• the inter-class similarity is low.
• The quality of a clustering result also depends on both
the similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by
its ability to discover some or all of the hidden
patterns.
• However, objective evaluation is problematic: usually
done by human / expert inspection
Difference between Classification and
clustering
• CLASSIFICATION
• We have a Training set containing data that have been previously
categorized
• Based on this training set, the algorithms finds the category that the new
data points belong to
• Since a Training set exists, we describe this technique as Supervised
learning
• CLUSTERING
• We do not know the characteristics of similarity of data in advance
• Using statistical concepts, we split the datasets into sub-datasets such that
the Sub-datasets have “Similar” data
• Since Training set is not used, we describe this technique as Unsupervised
learning
Regression
• Regression is a data mining (machine learning)
technique used to fit an equation to a dataset.
The simplest form of regression, linear
regression, uses the formula of a straight line (y =
mx + b) and determines the appropriate values
for m and b to predict the value of y based upon
a given value of x.
• Advanced techniques, such as multiple
regression, allow the use of more than one input
variable and allow for the fitting of more
complex models, such as a quadratic equation.
Regression
• “Regression deals with the prediction of a value, rather than a
class.”
• Example: Find out if there is a relationship between smoking
patients and cancer related illness.
• Given values: X1, X2... Xn
• Objective predict variable Y
• One way is to predict coefficients a0, a1, a2
– Y = a0 + a1X1 + a2X2 + … anXn
– Linear Regression
Association Rules
• “An association algorithm creates rules that describe how often
events have occurred together.”
• Example: When a customer buys a hammer, then 90% of the
time they will buy nails.
Association Rules
• Support: “is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the
rule”
• Example:
– People who buy hotdog buns also buy hotdog sausages in 99% of
cases. = High Support
– People who buy hotdog buns buy hangers in 0.005% of cases. =
Low support
• Situations where there is high support for the antecedent
are worth careful attention
– E.g. Hotdog sausages should be placed in near hotdog buns in
supermarkets if there is also high confidence.
Data Mining Applications
Here is the list of areas where data mining is
widely used:
• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection
FINANCIAL DATA ANALYSIS
• The financial data in banking and financial industry is
generally reliable and of high quality which facilitates
the systematic data analysis and data mining. Here are
the few typical cases:
• Design and construction of data warehouses for
multidimensional data analysis and data mining.
• Loan payment prediction and customer credit policy
analysis.
• Classification and clustering of customers for targeted
marketing.
• Detection of money laundering and other financial
crimes.
RETAIL INDUSTRY
• Data Mining has its great application in Retail Industry because it collects
large amount data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of
data collected will continue to expand rapidly because of increasing ease,
availability and popularity of web.
• The Data Mining in Retail Industry helps in identifying customer buying
patterns and trends. That leads to improved quality of customer service
and good customer retention and satisfaction. Here is the list of examples
of data mining in retail industry:
• Design and Construction of data warehouses based on benefits of data
mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
TELECOMMUNICATION INDUSTRY
• Today the Telecommunication industry is one of the most emerging
industries providing various services such as fax, pager, cellular
phone, Internet messenger, images, e-mail, web data transmission
etc.Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding.
This is the reason why data mining is become very important to
help and understand the business.
•
•
•
•
•
•
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
BIOLOGICAL DATA ANALYSIS
• Now a days we see that there is vast growth in field of
biology such as genomics, proteomics, functional Genomics
and biomedical research. Biological data mining is very
important part of Bioinformatics. Following are the aspects
in which Data mining contribute for biological data analysis:
• Semantic integration of heterogeneous , distributed
genomic and proteomic databases.
• Alignment, indexing , similarity search and comparative
analysis multiple nucleotide sequences.
• Discovery of structural patterns and analysis of genetic
networks and protein pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.
Queries?