Download What is data mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture 6
Themes in this session
Data mining
Reading Directions
[Komp, article 3] Elmasri and Navathe, Foundation of
Database Systems, Chapter 26.2 Data Mining
What is data mining?
“Data Mining is data analysis in order to discover
hidden correlations (pattern, rules) in huge data
sets”
“Data Mining is the process of extracting valid,
previously unknown, comprehensible, and actionable
information from large databases and using it to
make crucial business decisions.”
1
Data Mining versus KDD
• Knowledge Discovery in Databases involves the
extraction of implicit, previously unknown and
potentially useful information from data.
• Data Mining is the use of algorithms to extract the
information and patterns derived by the KDD
process.
The KDD Process
Knowledge
Patterns
Transformed
data
Target
Data
Preprocessed
Data
Interpretation/
Evaluation
Data Mining
Transformation
Data
Preprocessing
Selection
2
The KDD Process
• Selection: This first step obtains the data from various
databases, files, and nonelectronic sources.
• Preprocessing: Incorrect data is corrected or removed,
missing data must be supplied or predicted.
• Transformation: Data from different sources is converted
into a common format for processing. Some data is encoded or
transformed into more usable formats. Data reduction might be
applied to shrink the data to be analysed.
• Data Mining: Applying algorithms to the transformed data to
generate the desired results.
• Interpretation/Evaluation: Visualising the results by using
different GUI strategies and interpreting them.
Enabling factors for data mining
Data availability
• Increased amount of electronically stored data
• Increased processing power
• Increased data storage ability
• Increased data gathering ability (networks,
extraction tools)
• Increased number of data warehouses
Business conditions
• Increased need to compete effectively
• Increased awareness of need to know customers
3
Data mining uses in enterprises
• Predict customer pattern of behaviour, e.g buying
pattern
• Discover market developments driven by
demographic changes
• Discover shifts in consumption
• Identification of new customers
• Anticipation of demands on inventory
Data Mining Models and Tasks
Data
Mining
Predictive
Descriptive
Clustering
Classification
Regression
Time Series Analysis
Prediction
Summarisation
Association Rules
Sequence Discovery
4
Classification
• Classification maps data into predefined groups of
classes. Classification algorithms require the
classes to be defined based on data attribute
values. They often describe these classes by
looking at the characteristics of data already
known to belong to the classes.
• Pattern recognition is a type of classification where
an input patterns is classified into one of several
classes based on its similarity to these predefined
classes.
• Example: Determining whether to approve a bank
loan application.
Regression
• Regression is used to map a data item to a real
valued prediction variable. In actuality, regression
involves the learning of the function that does this
mapping.
• Regression assumes that the target data fit into
some known type of function (e.g., linear) and then
determines the best function of this type that
models the given data.
• Example: Eva wishes to reach a certain level of savings before her
retirement. Periodically, she predicts what her retirement savings will be
based on its current value and several past values. She uses a simple linear
regression formula to predict this value by fitting past behaviour to a
linear function and then using this function to predict the values at point
in the future. Based on these values, she then alters her investment
portfolio.
5
Time Series Analysis
• With time series analysis, the value of an attribute
is examined as it varies over time. The values are
obtained as evenly spaced time points (daily,
weekly, hourly, etc).
• A time series plot is used to visualise the time
series.
• Example: Eva is trying to determine whether to purchase stocks from
Companies X, Y or Z. For a period of one month she charts the daily stick
price for these companies. Using this information she decides to purchase
stocks from X, because it is less volatile while overall showing a slightly
larger relative amount of growth then either of the other stocks.
Prediction
• Many real-world data mining applications can be seen as
predicting future data states based on past and
current data. Prediction can be viewed as a type of
classification (with the difference that it is classifying
a future state rather than a current state.)
• Although future values may be predicted using time
series analysis or regression techniques, other
approaches may be used as well.
• Example: Predicting flooding is a difficult problem. One approach uses
monitors placed at various points in the river. These monitors collect data
relevant to flood prediction, water level, rain amount, time, humidity, and so on.
Then the water level at a potential flooding point in the river can be predicted
based on data collected by the sensors upriver from this point. The prediction
must be made with respect to the time the data were collected.
6
Clustering
• Clustering is similar to classification except that
the groups are not predefined, but rather defined
by the data alone. It can be thought of as
partitioning the data into groups that might or
might not be disjointed.
• The clustering is usually accomplished by
determining the similarity among the data on
predefined attributes.
• Since the clusters are not predefined, a domain
expert is often required to interpret the meaning
XXX
of the created clusters.
Profitable
X XX
Dept X X
customers!
X
• Example:
X X
XX
X X
XX
Income
Summarisation
• Summarisation maps data into subsets with
associated simple descriptions.
• Summarisation is also called characterisation or
generalisation. It extracts or derives
representative information about the data set.
• This may be accomplished by actually retrieving
portions of the data. Alternatively, summary type
information (such as mean of some numeric
attribute) can be derived from the data.
• Example: One of the many criteria used to compare universities
by the U.S. News and World Report is the average score.
7
Association Rules
• Link analysis, alternatively referred to as affinity
analysis or association, refers to the data mining task
of uncovering relationship among data.
• An association rule is a model that identifies specific
types of data associations. These associations are
often used in the retail sales community to identify
items that are frequently purchased together.
• Example: A grocery store is trying to decide whether to put bread on
sale. To help determine the impact of this decision, the retailer generates
association rules that show what other products are frequently purchased wit
bread. He finds that 70% of the time bread is sold jelly is also sold. Based on
this, he decide to place some jelly at the end of the aisle where the bread is
placed and decides to not have the jelly on sale at the same time.
Sequence Discovery
• Sequential analysis or sequence discovery is used to
determine sequential patterns in data.
• These patterns are similar to associations that are
found in the data, but they are based on time.
• Unlike a market basket analysis, which requires the
items to be purchased at the same time, in
sequence discovery the items are purchased over
time in some order.
• Example: The webmaster at XYZ Corp. periodically analyse the Web
log data to determine how users of the XYZ’s Web pages access them. He
is interested in determining which pages are most frequently accessed and
in what sequence they are accessed. He determines that 70% of the users
of page A follow one of the following patterns of behaviour: <A,B,C> or
<A,D,B,C> or <A,E,B,C>. He then decides to add a link directly from page A
to page C.
8
Association Rules
Ex. If a customer buys X, (s)he is also likely to buy Y
Transaction-id Time
Items-Brought
101
792
1130
1730
milk, bread, juice
milk, juice
milk, eggs
bread, cookies, coffee
X⇒Y
6:35
7:38
8:05
8:40
where X = {x1, x2,…,xn} and Y = {y1, y2,…,ym}
are sets of items, with xi ≠ yj for each i and j
Support (prevalence)
nr. of trans. cont. X ∪ Y
nr. of trans.
{Milk, Juice} = 2/4 = 50%
{Bred, Juice} = 1/4 = 25%
Confidence (strength)
nr of trans cont. X ∪ Y
nr of trans. cont. X
Milk ⇒ Juice 2/3 = 66,7%
Bred ⇒ Juice 1/2 = 50%
Mining Association Rules
1. Generate all item sets that have a
support that exceeds a threshold
defined by the user
2. For each such item set generate all
the rules that have confidence above
a threshold defined by the user
Example:
nr of trans = 4
support ≥ 30%
conf ≥ 70%
1.
support {milk, bread, eggs} = 30%
support {cookies, juice} = 0%
support {cookies, coffee} = 20%
support {milk, eggs} = 50 %
…
nr. of sets to be checked is 27
(in general 2nr of items)
1
2
3
4
5
6
7
8
9
10
milk, bread, eggs, juice
milk, juice
milk, eggs
bread, cookies, coffee
milk, bread, eggs, fruits
milk, bread, eggs, coffee
cookies, coffee
coffee, milk
fruits, milk
eggs, milk
2.
conf (milk, bread ⇒ eggs) = 3/3 = 100%
conf (milk, eggs ⇒ bread ) = 3/5 = 60%
conf (eggs, bread ⇒ milk ) = 3/3 = 100%
conf (milk ⇒ bread, eggs) = 3/8 = 38%
conf (bread ⇒ milk, eggs) = 3/4 = 75%
conf (eggs ⇒ bread, milk) = 3/5 = 60%
conf (milk ⇒ eggs) = 5/8 = 63%
conf (eggs ⇒ milk) = 5/5 = 100%
9
Association Rules - Basic
Algorithm
•
Test the support for item sets of length 1 (1-itemsets) by scanning the database. Discard those that
do not meet the minimum required support
•
Extend the large 1-itemsets into 2-itemsets by
appending one item each time, to generate all
candidate item sets of length two. Test the
support for all candidate item sets and eliminate
those that do not meet the minimum support
•
Repeat the above steps; at step k, the previously
found (k-1) item sets are extended into k-itemsets
Association Rules among
Hierarchies
Beverages
Carbonated
Colas
Clear
drinks
Non-Carbonated
Mixed
drinks
Bottled
juices
Orange
Bottled
water
Wine
coolers
Apple
Beverage ⇒ Desserts
Desserts ⇒ Beverage
Desserts
Ice
Baked
Cream
Ice cream ⇒ Wine coolers
Frozen
Yoghurt
Regular Low fat
Low fat frozen yoghurt ⇒ Bottled water
10
Association Rules - Negative
Associations
“60% of customers who buy potato chips do not buy
bottled water”
The problem:
In a DB with 10000 items there are 210000 possible
combination of items, a majority of which do not
appear even once in the DB.
How to find only the interesting negative associations?
Soft Drinks
Joke
Wakeup
x
x
Topsy
Chips
Days
Nightos
Partyos
Data visualisation -
A picture tells more than thousand words
• Five hundred people, all from the same section of
London, England, died of cholera within a 10-day period
in September 1854. Dr. John Snow a local physician, had
been studying this spread of cholera for some time.
One of the earliest known examples of data
visualisation is Dr. Snow’s use of maps to provide his
long-held theory that cholera was a waterborne
infection.
11
Applications of Data Mining
Marketing
• analysis of customers behaviour based on buying patterns
• determination of marketing strategies including advertising,
store location, and targeted mailing
• segmentation of customers, stores, or products
• design of catalogs, store layouts, and advertising campaigns
Finance
• analysis of creditworthiness of clients
• segmentation of accounts receivables
• performance analysis of finance investments like stocks,
bonds and mutual funds
• evaluation of financing options
• fraud detection
Applications of Data Mining 2
Manufacturing
• optimisation of resources like machines, manpower and
materials
• optimal design of manufacturing processes, shop-floor
layouts and product design, such as for products tailored
according to customers requirements
Health Care
• analysis of effectiveness of certain treatments
• optimisation of processes within a hospital, relating patients
wellness data with doctor qualifications
• analysis of side effects of drugs
12