Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining dan KDD [Knowledge Discovery in Database] PENDAHULUAN Minggu II Senin , 27 Feb 2017 - 3KA13 (E311) E. Syahrul Data mining: Proses memperkerjakan satu atau bbrp teknik-teknik pembelajaran komputer (machine learning) untuk menganalisis dan mengekstraksi pengetahuan (knowledge) secara otomatis. Proses iteratif dan interaktif untuk menemukan pola atau model yang baru, bermanfaat, dan dimengerti dalam suatu database yang sangat besar (massive databases). Serangkaian proses untuk menggali nilai tambah dari suatu kumpulan data berupa pengetahuan yang selama ini tidak diketahui secara manual dari suatu kumpulan data Menggunakan berbagai perangkat lunak analisis data untuk menemukan pola dan relasi data agar dapat digunakan untuk membuat prediksi dengan tepat Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns Ref: Tan,Steinbach, Kumar, Introduction to Data Mining, 2004 Area DBMS OLAP Data Mining Knowledge discovery of hidden patterns and insights Task Extraction of detailed Summaries, trends and and summary data forecasts Type of result Information Analysis Insight and Prediction Method Deduction (Ask the question, verify with data) Multidimensional data modeling, Aggregation, Statistics Induction (Build the model, apply it to new data, get the result) Example question Who purchased mutual funds in the last 3 years? Who will buy a What is the average income mutual fund in the of mutual fund buyers by next 6 months and region by year? why? Data Mining bagian dari KDD bagian yang penting dari KDD http://www.dwreview.com/Data_mining/Closed_loop_DM.html the process of searching for hidden knowledge in the massive amounts of data that we are technically capable of generating and storing. Data, in its raw form, is simply a collection of elements, from which little knowledge can be gleaned. With the development of data discovery techniques the value of the data is significantly improved. https://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/KDD3.htm Example of DBMS, OLAP and Data Mining: Weather data Assume we have made a record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular day. Thus we have generated tuples (or examples, instances) consisting of values of four independent variables (outlook, temperature, humidity, windy) and one dependent variable (play). See the textbook for a detailed description. DBMS Consider our data stored in a relational table as follows: By querying a DBMS containing the above table we may answer questions like: -What was the temperature in the sunny days? {85, 80, 72, 69, 75} -Which days the humidity was less than 75? {6, 7, 9, 11} -Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14} -Which days the temperature was greater than 70 and the humidity was less than 75? The intersection of the above two: {11} Decision Tree for PlayTennis Attributes and their values: Outlook: Sunny, Overcast, Rain Humidity: High, Normal Wind: Strong,Weak Temperature: Hot, Mild, Cool Target concept - Play Tennis: Yes, No OLAP Using OLAP we can create a Multidimensional Model of our data (Data Cube). For example using the dimensions: time, outlook and play we can create the following model. time represents the days grouped in weeks: Data Mining By applying various Data Mining techniques we can find associations and regularities in our data, extract knowledge in the forms of rules, decision trees etc., or just predict the value of the dependent variable (play) in new situations (tuples). Here are some examples (all produced by Weka): Mining Association Rules To find associations in our data we first discretize the numeric attributes (a part of the data pre-processing stage in data mining). Thus we group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) and substitute the values in data with the corresponding names. Then we apply the Apriori algorithm and get the following association rules: Rules Humidity =normal windy=false 4 ==> play=yes (4, 1) Temperature =cool 4 ==> humidity=normal (4, 1) Outlook =overcast 4 ==> play=yes (4, 1) Temperature =cool play=yes 3 ==> humidity=normal (3, 1) Outlook =rainy windy=false 3 ==> play=yes (3, 1) Outlook =rainy play=yes 3 ==> windy=false (3, 1) Outlook =sunny humidity=high 3 ==> play=no (3, 1) Outlook =sunny play=no 3 ==> humidity=high (3, 1) Temperature =cool windy=false 2 ==> humidity=normal play=yes (2, 1) 10. Temperature =cool humidity=normal windy=false 2 ==> play=yes (2, 1) 1. 2. 3. 4. 5. 6. 7. 8. 9. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Wind Strong Yes No Weak Yes These rules show some attribute values sets (the so called item sets) that appear frequently in the data. The numbers after each rule show the support (the number of occurrences of the item set in the data) and the confidence (accuracy) of the rule. Interestingly, rule 3 is the same as the one that we produced by observing the data cube. Classification by Decision Trees and Rules Using the ID3 algorithm we can produce the following decision tree (shown as a horizontal tree): outlook = sunny humidity = high: no humidity = normal: yes outlook = overcast: yes outlook = rainy windy = true: no windy = false: yes The decision tree consists of decision nodes that test the values of their corresponding attribute. Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached. They determine the value of the dependent variable. Using a decision tree we can classify new tuples (not used to generate the tree). For example, according to the above tree the tuple {sunny, mild, normal, false} will be classified under play=yes. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Each branch corresponds to an attribute value node Yes Each leaf node assigns a classification A decision trees can be represented as a set of rules, where each rule represents a path through the tree from the root to a leaf. Other Data Mining techniques can produce rules directly. For example the Prism algorithm available in Weka generates the following rules. If outlook = overcast then yes If humidity = normal and windy = false then yes If temperature = mild and humidity = normal then yes If outlook = rainy and windy = false then yes If outlook = sunny and humidity = high then no If outlook = rainy and windy = true then no Prediction methods Data Mining offers techniques to predict the value of the dependent variable directly without first generating a model. One of the most popular approaches for this purpose is based of statistical methods. It uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables. For example, applying Bayes to the new tuple discussed above we get: P(play=yes | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8 P(play=no | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2 Then obviously the predicted value must be "yes". Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High Weak Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes PlayTennis ? No Wind Strong No Weak Yes Decision Tree for Conjunction Outlook=Sunny Wind=Weak Outlook Sunny Wind Strong 23 NoICS320 Overcast No Weak Yes Rain No Good Luck! Additional Info