Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KONSEP DATA MINING PENDAHULUAN Minggu I Senin , 27 Feb 2017 - 3KA13 (E311) Elfitrin Syahrul Latar belakang DM Definisi Data Mining Kebutuhan DM Ilmu yang berkaitan dengan DM Database & DM Penerapan DM Tools yang digunakan Data mining? • Pengumpulan data dan berbagai database warehoused • Web data, e-commerce • pembelian di supermarket (departement/grocery stores) • Transaksi bank spt kartu kredit • Komputer: harga terjangkau dan performansi yang bagus • Daya saing yang semakin kuat • Peningkatan layanan (customized services for an edge: in customer relationship management.) Data tersimpan dalam GB/jam remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation Informasi yang tersembunyi (“hidden” information) Manual analisi membutuh waktu Banyak data yang tidak terolah Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns Ref: Tan,Steinbach, Kumar, Introduction to Data Mining, 2004 1. Data mining adalah pencarian dan teknik analisa data yang besar untuk menemukan pola dan aturan yang berarti (Berry & Linoff, 2004: 7). 2. Data mining adalah teknik untuk menganalisa sekumpulan data yang besar guna menemukan hubungan yang tidak diduga dan berguna bagi pemilik data (Hand, 2001: 1). 3. Data mining adalah proses untuk menemukan pola dan hubungan dalam suatu data (Hornick, 2007: 6). 4. Data mining adalah suatu proses otomatis atau semi otomatis untuk menemukan informasi (knowledge) baru dan berpotensi dari sekumpulan data (Tang & Jamie, 2005:2). How do you explain Data Mining to non Computer Science people? Mango Shopping Mango Shopping Suppose you go shopping for mangoes one day. The vendor has laid out a cart full of mangoes.You can handpick the mangoes, the vendor will weigh them, and you pay according to a fixed Rs per Kg rate (typical story in India). Obviously, you want to pick the sweetest, most ripe mangoes for yourself (since you are paying by weight and not by quality). How do you choose the mangoes? You remember your grandmother saying that bright yellow mangoes are sweeter than pale yellow ones. So you make a simple rule: pick only from the bright yellow mangoes.You check the color of the mangoes, pick the bright yellow ones, pay up, and return home. Happy ending? Not quite. Life is complicated Suppose you go home and taste the mangoes. Some of them are not sweet as you'd like.You are worried. Apparently, your grandmother's wisdom is insufficient. There is more to mangoes than just color. After a lot of pondering (and tasting different types of mangoes), you conclude that the bigger, bright yellow mangoes are guaranteed to be sweet, while the smaller, bright yellow mangoes are sweet only half the time (i.e. if you buy 100 bright yellow mangoes, out of which 50 are big in size and 50 are small, then the 50 big mangoes will all be sweet, while out of the 50 small ones, on average only 25 mangoes will turn out to be sweet). You are happy with your findings, and you keep them in mind the next time you go mango shopping. But next time at the market, you see that your favorite vendor has gone out of town.You decide to buy from a different vendor, who supplies mangoes grown from a different part of the country. Now, you realize that the rule which you had learnt (that big, bright yellow mangoes are the sweetest) is no longer applicable.You have to learn from scratch.You taste a mango of each kind from this vendor, and realize that the small, pale yellow ones are in fact the sweetest of all. Now, a distant cousin visits you from another city.You decide to treat her with mangoes. But she mentions that she doesn't care about the sweetness of a mango, she only wants the most juicy ones. Once again, you run your experiments, tasting all kinds of mangoes, and realizing that the softer ones are more juicy. Now, you move to a different part of the world. Here, mangoes taste surprisingly different from your home country.You realize that the green mangoes are in fact tastier than the yellow ones. You marry someone who hates mangoes. She loves apples instead.You go apple shopping. Now, all your accumulated knowledge about mangoes is worthless.You have to learn everything about the correlation between the physical characteristics and the taste of apples, by the same method of experimentation.You do it, because you love her. Cont… Enter computer programs Now, imagine that all this while, you were writing a computer program to help you choose your mangoes (or apples).You would write rules of the following kind: if (color is bright yellow and size is big and sold by favorite vendor): mango is sweet. if (soft): mango is juicy. etc. You would use these rules to choose the mangoes.You could even send your younger brother with this list of rules to buy the mangoes, and you would be assured that he will pick only the mangoes of your choice. But every time you make a new observation from your experiments, you have to manually modify the list of rules.You have to understand the intricate details of all the factors affecting the quality of mangoes. If the problem gets complicated enough, it can get really difficult to make accurate rules by hand that cover all possible types of mangoes.Your research could earn you a PhD in Mango Science (if there is one). But not everyone has that kind of time. Enter Machine Learning algorithms ML algorithms are an evolution over normal algorithms. They make your programs "smarter", by allowing them to automatically learn from the data you provide. You take a randomly selected specimen of mangoes from the market (training data), make a table of all the physical characteristics of each mango, like color, size, shape, grown in which part of the country, sold by which vendor, etc (features), along with the sweetness, juicyness, ripeness of that mango (output variables).You feed this data to the machine learning algorithm (classification/regression), and it learns a model of the correlation between an average mango's physical characteristics, and its quality. Next time you go to the market, you measure the characteristics of the mangoes on sale (test data), and feed it to the ML algorithm. It will use the model computed earlier to predict which mangoes are sweet, ripe and/or juicy. The algorithm may internally use rules similar to the rules you manually wrote earlier (for eg, a decision tree), or it may use something more involved, but you don't need to worry about that, to a large extent. Voila, you can now shop for mangoes with great confidence, without worrying about the details of how to choose the best mangoes. And what's more, you can make your algorithm improve over time (reinforcement learning), so that it will improve its accuracy as it reads more training data, and modifies itself when it makes a wrong prediction. But the best part is, you can use the same algorithm to train different models, one each for predicting the quality of apples, oranges, bananas, grapes, cherries and watermelons, and keep all your loved ones happy :) And that, is Machine Learning for you. Tell me if it isn't cool. Machine Learning: Making your algorithms smart, so that you don't need to be. ;) Data Mining Data mining is about explaining the past and predicting the future by means of data analysis. "Drowning in Data yet Starving for Knowledge" ??? "Computers have promised us a fountain of wisdom but delivered a flood of data" William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus “Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” T. S. Eliot What is NOT data mining? Data Mining, noun: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything" Jeff Jonas, IBM "An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results" W.S. Brown “Introducing Econometrics” "A buzz word for what used to be known as DBMS reports" An Anonymous Data Mining Skeptic Data melimpah, kebutuhan akan informasi (pengetahuan) sebagai pendukung pengambilan keputusan untuk membuat solusi bisnis Ketersediaan data transaksi dalam volume yang besar Informasi yang penting melahirkan gudang data yang mengintegrasikan informasi dari sistem yang tersebar untuk mendukung pengambilan keputusan Ketersediaan teknologi informasi yang terjangkau dan dapat diadopsi secara luas. Ilmu-ilmu yang berkaitan Database Information Science High performance computing Visualization Machine learning Statistics Artificial Neural networks Mathematical modeling Information retrieval Pattern recognition Penerapan Data Mining Analisa Pasar dan Manajemen - Menebak target pasar - Melihat pola beli pemakai dari waktu ke waktu - Cross Market Analysis - Profil Customer - Identifikasi Kebutuhan Customer - Menilai loyalitas customer - Informasi summary Analisa Perusahaan dan Manajemen Asuransi Resiko -Digunakan Australian Health Merencanakan Keuangan dan Insurance Commision untuk Evaluasi Aset mengidentifikasi layanan - Merencanakan Sumber Daya kesehatan dan berhasil menghemat (Resource Planning) satu juta dollar pertahun - Memonitor Persaingan Olah raga (Competition) - Digunakan IBM Advanced Scout Telekomunikasi untuk menganalisis statistik - Melihat jutaan transaksi yang masuk permainan NBA dalam rangka dengan competitive advantage untuk tim - tujuan menambah layanan otomatis New York Knicks Keuangan Internet Web Surf-Aid - Mendeteksi transaksi-transaksi -Digunakan IBM Surf-Aid untuk keuangan yang mencurigakan dimana mendata akses halaman Web akan susah dilakukan jika khususnya berkaitan dengan menggunakan analisis standar. pemasaran melalui web. Area DBMS OLAP Task Extraction of detailed Summaries, trends and and summary data forecasts Type of result Information Method Deduction (Ask the Multidimensional data question, verify with modeling, Aggregation, data) Statistics Example question Who purchased mutual funds in the last 3 years? Analysis Data Mining Knowledge discovery of hidden patterns and insights Insight and Prediction Induction (Build the model, apply it to new data, get the result) Who will buy a What is the average income mutual fund in the of mutual fund buyers by next 6 months and region by year? why? Example of DBMS, OLAP and Data Mining: Weather data Assume we have made a record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular day. Thus we have generated tuples (or examples, instances) consisting of values of four independent variables (outlook, temperature, humidity, windy) and one dependent variable (play). See the textbook for a detailed description. Decision Tree for PlayTennis Attributes and their values: Outlook: Sunny, Overcast, Rain Humidity: High, Normal Wind: Strong,Weak Temperature: Hot, Mild, Cool Target concept - Play Tennis: Yes, No Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Wind Strong Yes No Weak Yes Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Each branch corresponds to an attribute value node Yes Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High Weak Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes PlayTennis ? No Wind Strong No Weak Yes Good Luck! General Did You know (source 2016 data).