Download KONSEP DATA MINING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining dan KDD
[Knowledge Discovery in Database]
PENDAHULUAN
Minggu II
Senin , 27 Feb 2017 - 3KA13 (E311)
E. Syahrul
Data mining:
Proses memperkerjakan satu atau bbrp teknik-teknik pembelajaran
komputer (machine learning) untuk menganalisis dan mengekstraksi
pengetahuan (knowledge) secara otomatis.
Proses iteratif dan interaktif untuk menemukan pola atau model
yang baru, bermanfaat, dan dimengerti dalam suatu database yang sangat
besar (massive databases).
Serangkaian proses untuk menggali nilai tambah dari suatu kumpulan data
berupa pengetahuan yang selama ini tidak diketahui secara manual dari
suatu kumpulan data
Menggunakan berbagai perangkat lunak analisis data untuk menemukan
pola dan relasi data agar dapat digunakan untuk membuat prediksi dengan
tepat
 Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
 Exploration & analysis, by automatic or semi-
automatic means, of large quantities of
data in order to discover
meaningful patterns
Ref: Tan,Steinbach, Kumar, Introduction to Data Mining, 2004
Area
DBMS
OLAP
Data Mining
Knowledge discovery
of hidden patterns and
insights
Task
Extraction of detailed Summaries, trends and
and summary data
forecasts
Type of result
Information
Analysis
Insight and Prediction
Method
Deduction (Ask the
question, verify with
data)
Multidimensional data
modeling, Aggregation,
Statistics
Induction (Build the
model, apply it to
new data, get the
result)
Example
question
Who purchased
mutual funds in the
last 3 years?
Who will buy a
What is the average income
mutual fund in the
of mutual fund buyers by
next 6 months and
region by year?
why?
 Data Mining
  bagian dari KDD
  bagian yang penting dari KDD
http://www.dwreview.com/Data_mining/Closed_loop_DM.html
the process of searching for hidden knowledge in the massive
amounts of data that we are technically capable of generating
and storing. Data, in its raw form, is simply a collection of
elements, from which little knowledge can be gleaned. With
the development of data discovery techniques the value of the
data is significantly improved.
https://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/KDD3.htm
Example of DBMS, OLAP and Data
Mining: Weather data
Assume we have made a record of the weather conditions
during a two-week period, along with the decisions of a tennis
player whether or not to play tennis on each particular day.
Thus we have generated tuples (or examples, instances)
consisting of values of four independent variables (outlook,
temperature, humidity, windy) and one dependent variable
(play). See the textbook for a detailed description.
DBMS
Consider our
data stored in
a relational
table as
follows:
By querying a DBMS containing the above table we may answer questions like:
-What was the temperature in the sunny days? {85, 80, 72, 69, 75}
-Which days the humidity was less than 75? {6, 7, 9, 11}
-Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14}
-Which days the temperature was greater than 70 and the humidity was less than 75? The
intersection of the above two: {11}
Decision Tree for PlayTennis
 Attributes and their values:
 Outlook: Sunny, Overcast, Rain
 Humidity: High, Normal
 Wind: Strong,Weak
 Temperature: Hot, Mild, Cool
 Target concept - Play Tennis: Yes, No
OLAP
Using OLAP we can create a Multidimensional Model of
our data (Data Cube). For example using the dimensions:
time, outlook and play we can create the following model.
time represents the days grouped in weeks:
Data Mining
By applying various Data Mining techniques we can find
associations and regularities in our data, extract knowledge in
the forms of rules, decision trees etc., or just predict the value
of the dependent variable (play) in new situations (tuples).
Here are some examples (all produced by Weka):
Mining Association Rules
To find associations in our data we first discretize the
numeric attributes (a part of the data pre-processing
stage in data mining). Thus we group the temperature values in
three intervals (hot, mild, cool) and humidity values in two
(high, normal) and substitute the values in data with the
corresponding names. Then we apply the Apriori algorithm
and get the following association rules:
Rules
Humidity =normal windy=false 4 ==> play=yes (4, 1)
Temperature =cool 4 ==> humidity=normal (4, 1)
Outlook
=overcast 4 ==> play=yes (4, 1)
Temperature =cool play=yes 3 ==> humidity=normal (3, 1)
Outlook
=rainy windy=false 3 ==> play=yes (3, 1)
Outlook
=rainy play=yes 3 ==> windy=false (3, 1)
Outlook
=sunny humidity=high 3 ==> play=no (3, 1)
Outlook
=sunny play=no 3 ==> humidity=high (3, 1)
Temperature =cool windy=false 2 ==> humidity=normal
play=yes (2, 1)
10. Temperature =cool humidity=normal windy=false 2 ==>
play=yes (2, 1)
1.
2.
3.
4.
5.
6.
7.
8.
9.
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Wind
Strong
Yes
No
Weak
Yes
These rules show some attribute values sets (the so called item
sets) that appear frequently in the data. The numbers after each
rule show the support (the number of occurrences of the item
set in the data) and the confidence (accuracy) of the rule.
Interestingly, rule 3 is the same as the one that we produced by
observing the data cube.
Classification by Decision Trees and Rules
Using the ID3 algorithm we can produce the following
decision tree (shown as a horizontal tree):
 outlook = sunny
 humidity = high: no
 humidity = normal: yes
 outlook = overcast: yes
 outlook = rainy
 windy = true: no
 windy = false: yes
The decision tree consists of decision
nodes that test the values of their
corresponding attribute. Each value of
this attribute leads to a subtree and so
on, until the leaves of the tree are
reached. They determine the value of
the dependent variable. Using a
decision tree we can classify new tuples
(not used to generate the tree). For
example, according to the above tree
the tuple {sunny, mild, normal, false}
will be classified under play=yes.
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Each internal node tests an attribute
Normal
Each branch corresponds to an
attribute value node
Yes
Each leaf node assigns a classification
A decision trees can be represented as a set of rules, where each
rule represents a path through the tree from the root to a leaf.
Other Data Mining techniques can produce rules directly. For
example the Prism algorithm available in Weka generates the
following rules.
If outlook = overcast then yes
If humidity = normal and windy = false then yes
If temperature = mild and humidity = normal then yes
If outlook = rainy and windy = false then yes
If outlook = sunny and humidity = high then no
If outlook = rainy and windy = true then no
Prediction methods
Data Mining offers techniques to predict the value of the
dependent variable directly without first generating a model.
One of the most popular approaches for this purpose is based of
statistical methods. It uses the Bayes rule to predict the
probability of each value of the dependent variable given the
values of the independent variables. For example, applying
Bayes to the new tuple discussed above we get:
P(play=yes | outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.8
P(play=no | outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.2
Then obviously the predicted value must be "yes".
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny
Hot
High
Weak
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
PlayTennis
?
No
Wind
Strong
No
Weak
Yes
Decision Tree for Conjunction
Outlook=Sunny Wind=Weak
Outlook
Sunny
Wind
Strong
23
NoICS320
Overcast
No
Weak
Yes
Rain
No
Good
Luck!
Additional Info