Download Data Mining - الجامعة التكنولوجية

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
An Introduction to Data
Mining
By Rand Ali
Computer Engineering & Information
Technology Department
What is data Mining?
Extraction
of interesting patterns
or knowledge from huge amount
of data.
Why Data Mining
The
progress of computer hardware
technology has led to large supplies of
powerful and affordable computers, data
collection equipment and storage media.
The
last decade has experienced a
revolution in information availability and
exchange via the Internet.
Why Data Mining

The fast-growing, great amount of data,
collected and stored in large and many
data repositories, has far exceeded our
human ability for understanding without
powerful tools.

As a result, data collected in large data
repositories become “data tombs”—data
archives that are seldom visited.
We are data rich but information poor
Data Mining objective

Data mining tools perform data analysis
and may uncover important data patterns,
contributing greatly to business strategies
and scientific and medical research.

Data Mining turn data tombs into “golden
nuggets” of knowledge.
Data mining—searching for knowledge (interesting
patterns) in your data.
Data Mining is a step of knowledge Discovery process
Knowledge discovery as a process is an iterative sequence of the
following steps:
1.
2.
3.
4.
5.
6.
7.
Data cleaning (to remove noise and inconsistent data).
Data integration (where multiple data sources may be
combined)
Data selection (where data relevant to the analysis task are
retrieved from the database)
Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for instance)
Data mining (an essential process where intelligent methods
are applied in order to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
Knowledge presentation (where visualization and
knowledge representation techniques are used to present the
mined knowledge to the user)
What makes a pattern
interesting?
a pattern is interesting if it is
1. easily understood by humans.
2. valid on new or test data with
some degree of certainty.
3. potentially useful.
4. novel.
Origins of Data Mining
Primary Data Mining Tasks
In general, data mining tasks can be classified into
two categories: descriptive and predictive.
 Predictive methods, use some variables to
predict unknown or future values of other
variables.
Ex: Classification, Regression, Deviation
Detection.
Descriptive methods, characterize the general
properties of the data in the database.
Ex: Association Rule Discovery, Clustering,
Sequential Pattern Discovery.

1- Association Rule Discovery
Given
a set of records each of which
contain some number of items from a
given collection.
Association
Rules Discovery produces
dependency rules which will predict
occurrence of an item based on
occurrences of other items.
2-Sequential Pattern Discovery

Sequential pattern mining is the discovery
of frequently occurring ordered events or
subsequences as patterns.
An example of a sequential pattern is
“Customers who buy a Canon digital
camera are likely to buy an HP color
printer within a month.”
3-Classification

Classification is the process of finding a model
(or function) that describes and distinguishes
data classes or concepts, for the purpose of
being able to use the model to predict the class
of objects whose class label is unknown.

The derived model is based on the analysis of a
set of training data (i.e., data objects whose class
label is known).
Classification Example
4-Regression

Whereas classification predicts categorical
(discrete, unordered) labels, Regression
analysis is used to predict missing or
unavailable numerical data values rather
than class labels.
5-Clustring
clustering analyzes data objects without
consulting a known class label. In general, the
class labels are not present in the training data
simply because they are not known to begin with.
Clustering can be used to generate such labels.
 Clusters of objects are formed so that objects
within a cluster have high similarity in
comparison to one another, but are very dissimilar
to objects in other clusters. Each cluster that is
formed can be viewed as a class of objects, from
which rules can be derived.

6-Outlier Analysis
 A database
may contain data
objects that do not comply with
the general behavior or model of
the data. These data objects are
outliers.
Application 1
Market basket analysis

analyzing customer buying habits by finding
associations between the different items that
customers place in their “shopping baskets”.

The discovery of such associations can help
to develop marketing strategies by gaining
insight into which items are frequently
purchased together by customers.
Possible Marketing Strategies

In one strategy, items that are frequently
purchased together can be placed in
proximity in order to further encourage the
sale of such items together.

Market basket analysis can also help retailers
plan which items to put on sale at reduced
prices. If customers tend to purchase
computers and printers together, then having
a sale on printers may encourage the sale of
printers as well as computers.

If we think of the universe as the set of items
available at the store, then each item has a
Boolean variable representing the presence or
absence of that item. Each basket can then be
represented by a Boolean vector of values
assigned to these variables.

The Boolean vectors can be analyzed for
buying patterns that reflect items that are
frequently associated or purchased together.
These patterns can be represented in the form
of association rules.
For example, the information that customers who purchase
computers also tend to buy antivirus software at the
same time is represented in Association Rule below:

Computer=>antivirus_software[support=2%
confidence =60%] (1)

Rule support and confidence are two measures of rule
interestingness. They respectively reflect the usefulness
and certainty of discovered rules.
A support of 2% for Association Rule (1) means that 2%
of all the transactions under analysis show that computer
and antivirus software are purchased together.
A confidence of 60% means that if a customer buys a
computer, there is 60% chance that he will buy antivirus
as well.



Typically, association rules are
considered interesting if they satisfy
both a minimum support threshold and
a minimum confidence threshold.

Such thresholds can be set by users or
domain experts
Application2
Data Mining &DNA data analysis

a great deal of biomedical research has
focused on DNA data analysis.

Recent research in DNA analysis has led
to the discovery of genetic causes for
many diseases and disabilities, as well as
the discovery of new medicine and
approaches for disease diagnosis,
prevention, and treatment.
An important focus in genome research is
the study of DNA sequences since such
sequences form the foundation of the
genetic codes of all living organisms.
 All DNA sequences comprise four basic
building blocks (called nucleotides):
adenine(A), cytosine(C), guanine(G), and
thymine(T).
 These four nucleotides are combined to
form long sequences or chains that
resemble a twisted ladder.

DNA structure
Human
beings have around 100,000
genes.
Most diseases are not triggered by a
single gene but by a combination of
genes acting together.
Association analysis methods can be
used to help determine the kinds of
genes that are likely to co-occur in
target samples.
Such analysis would facilitate the
discovery of groups of genes and the
study of interactions and relationships
between them.
Thank you