Download KONSEP DATA MINING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
KONSEP DATA MINING
PENDAHULUAN
Minggu I
Senin , 27 Feb 2017 - 3KA13 (E311)
Elfitrin Syahrul
 Latar belakang DM
 Definisi Data Mining
 Kebutuhan DM
 Ilmu yang berkaitan dengan DM
 Database & DM
 Penerapan DM
 Tools yang digunakan
Data mining?
• Pengumpulan data dan berbagai database warehoused
• Web data, e-commerce
• pembelian di supermarket (departement/grocery stores)
• Transaksi bank spt kartu kredit
• Komputer: harga terjangkau dan performansi yang bagus
• Daya saing yang semakin kuat
• Peningkatan layanan (customized services for an edge: in customer
relationship management.)
 Data tersimpan dalam GB/jam
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene expression data
 scientific simulations generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation
 Informasi yang tersembunyi (“hidden” information)
 Manual analisi membutuh waktu
 Banyak data yang tidak terolah
 Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
 Exploration & analysis, by automatic or semi-
automatic means, of large quantities of
data in order to discover
meaningful patterns
Ref: Tan,Steinbach, Kumar, Introduction to Data Mining, 2004
1. Data mining adalah pencarian dan teknik analisa data yang besar untuk
menemukan pola dan aturan yang berarti (Berry & Linoff, 2004: 7).
2. Data mining adalah teknik untuk menganalisa sekumpulan data yang besar guna
menemukan hubungan yang tidak diduga dan berguna bagi pemilik data (Hand,
2001: 1).
3. Data mining adalah proses untuk menemukan pola dan hubungan dalam suatu data
(Hornick, 2007: 6).
4. Data mining adalah suatu proses otomatis atau semi otomatis untuk menemukan
informasi (knowledge) baru dan berpotensi dari sekumpulan data (Tang & Jamie,
2005:2).
How do you explain Data Mining to non
Computer Science people?
 Mango Shopping
Mango Shopping
Suppose you go shopping for mangoes one day. The vendor has laid out a cart full of mangoes.You can handpick the mangoes, the vendor will
weigh them, and you pay according to a fixed Rs per Kg rate (typical story in India).
Obviously, you want to pick the sweetest, most ripe mangoes for yourself (since you are paying by weight and not by quality). How do you
choose the mangoes?
You remember your grandmother saying that bright yellow mangoes are sweeter than pale yellow ones. So you make a simple rule: pick only from
the bright yellow mangoes.You check the color of the mangoes, pick the bright yellow ones, pay up, and return home. Happy ending?
Not quite.
Life is complicated
Suppose you go home and taste the mangoes. Some of them are not sweet as you'd like.You are worried. Apparently, your grandmother's wisdom
is insufficient. There is more to mangoes than just color.
After a lot of pondering (and tasting different types of mangoes), you conclude that the bigger, bright yellow mangoes are guaranteed to be sweet,
while the smaller, bright yellow mangoes are sweet only half the time (i.e. if you buy 100 bright yellow mangoes, out of which 50 are big in size
and 50 are small, then the 50 big mangoes will all be sweet, while out of the 50 small ones, on average only 25 mangoes will turn out to be
sweet).
You are happy with your findings, and you keep them in mind the next time you go mango shopping. But next time at the market, you see that
your favorite vendor has gone out of town.You decide to buy from a different vendor, who supplies mangoes grown from a different part of the
country. Now, you realize that the rule which you had learnt (that big, bright yellow mangoes are the sweetest) is no longer applicable.You have to
learn from scratch.You taste a mango of each kind from this vendor, and realize that the small, pale yellow ones are in fact the sweetest of all.
Now, a distant cousin visits you from another city.You decide to treat her with mangoes. But she mentions that she doesn't care about the
sweetness of a mango, she only wants the most juicy ones. Once again, you run your experiments, tasting all kinds of mangoes, and realizing that
the softer ones are more juicy.
Now, you move to a different part of the world. Here, mangoes taste surprisingly different from your home country.You realize that the green
mangoes are in fact tastier than the yellow ones.
You marry someone who hates mangoes. She loves apples instead.You go apple shopping. Now, all your accumulated knowledge about mangoes is
worthless.You have to learn everything about the correlation between the physical characteristics and the taste of apples, by the same method of
experimentation.You do it, because you love her.
Cont…
 Enter computer programs
Now, imagine that all this while, you were writing a computer program to help you choose
your mangoes (or apples).You would write rules of the following kind:
if (color is bright yellow and size is big and sold by favorite vendor): mango is sweet.
if (soft): mango is juicy.
etc.
You would use these rules to choose the mangoes.You could even send your younger brother
with this list of rules to buy the mangoes, and you would be assured that he will pick only the
mangoes of your choice.
But every time you make a new observation from your experiments, you have to manually
modify the list of rules.You have to understand the intricate details of all the factors affecting
the quality of mangoes. If the problem gets complicated enough, it can get really difficult to
make accurate rules by hand that cover all possible types of mangoes.Your research could earn
you a PhD in Mango Science (if there is one).
But not everyone has that kind of time.

Enter Machine Learning algorithms
ML algorithms are an evolution over normal algorithms. They make your programs "smarter", by allowing them to
automatically learn from the data you provide.
You take a randomly selected specimen of mangoes from the market (training data), make a table of all the physical
characteristics of each mango, like color, size, shape, grown in which part of the country, sold by which vendor, etc
(features), along with the sweetness, juicyness, ripeness of that mango (output variables).You feed this data to the
machine learning algorithm (classification/regression), and it learns a model of the correlation between an
average mango's physical characteristics, and its quality.
Next time you go to the market, you measure the characteristics of the mangoes on sale (test data), and feed it to the
ML algorithm. It will use the model computed earlier to predict which mangoes are sweet, ripe and/or juicy. The
algorithm may internally use rules similar to the rules you manually wrote earlier (for eg, a decision tree), or it may
use something more involved, but you don't need to worry about that, to a large extent.
Voila, you can now shop for mangoes with great confidence, without worrying about the details of how to choose the
best mangoes. And what's more, you can make your algorithm improve over time (reinforcement learning), so
that it will improve its accuracy as it reads more training data, and modifies itself when it makes a wrong prediction.
But the best part is, you can use the same algorithm to train different models, one each for predicting the quality of
apples, oranges, bananas, grapes, cherries and watermelons, and keep all your loved ones happy :)
And that, is Machine Learning for you. Tell me if it isn't cool.
Machine Learning: Making your algorithms smart, so that you don't need to be. ;)
Data Mining
Data mining is about explaining the
past and predicting the future by
means of data analysis.
"Drowning in Data yet Starving for Knowledge"
???
"Computers have promised us a fountain of wisdom but delivered a flood of data"
William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
“Where is the wisdom we have lost in knowledge? Where is the knowledge we have
lost in information?”
T. S. Eliot
What is NOT data mining?
Data Mining, noun: "Torturing data until it confesses ... and if you torture it
enough, it will confess to anything"
Jeff Jonas, IBM
"An Unethical Econometric practice of massaging and manipulating the data to
obtain the desired results"
W.S. Brown “Introducing Econometrics”
"A buzz word for what used to be known as DBMS reports"
An Anonymous Data Mining Skeptic
 Data melimpah, kebutuhan akan informasi (pengetahuan)
sebagai pendukung pengambilan keputusan untuk membuat
solusi bisnis
 Ketersediaan data transaksi dalam volume yang besar
 Informasi yang penting melahirkan gudang data yang
mengintegrasikan informasi dari sistem yang tersebar untuk
mendukung pengambilan keputusan
 Ketersediaan teknologi informasi yang terjangkau dan
dapat diadopsi secara luas.
Ilmu-ilmu yang berkaitan
 Database
 Information Science
 High performance computing
 Visualization
 Machine learning
 Statistics
 Artificial Neural networks
 Mathematical modeling
 Information retrieval
 Pattern recognition
Penerapan Data Mining
 Analisa Pasar dan Manajemen
- Menebak target pasar
- Melihat pola beli pemakai
dari waktu ke waktu
- Cross Market Analysis
- Profil Customer
- Identifikasi Kebutuhan
Customer
- Menilai loyalitas customer
- Informasi summary
Analisa Perusahaan dan Manajemen  Asuransi
Resiko
-Digunakan Australian Health
Merencanakan Keuangan dan
Insurance Commision untuk
Evaluasi Aset
mengidentifikasi layanan
- Merencanakan Sumber Daya
kesehatan dan berhasil menghemat
(Resource Planning)
satu juta dollar pertahun
- Memonitor Persaingan
 Olah raga
(Competition)
- Digunakan IBM Advanced Scout
 Telekomunikasi
untuk menganalisis statistik
- Melihat jutaan transaksi yang masuk
permainan NBA dalam rangka
dengan
competitive advantage untuk tim
- tujuan menambah layanan otomatis
New York Knicks
 Keuangan
 Internet Web Surf-Aid
- Mendeteksi transaksi-transaksi
-Digunakan IBM Surf-Aid untuk
keuangan yang mencurigakan dimana
mendata akses halaman Web
akan susah dilakukan jika
khususnya berkaitan dengan
menggunakan analisis standar.
pemasaran melalui web.

Area
DBMS
OLAP
Task
Extraction of detailed Summaries, trends and
and summary data
forecasts
Type of result
Information
Method
Deduction (Ask the Multidimensional data
question, verify with modeling, Aggregation,
data)
Statistics
Example
question
Who purchased
mutual funds in the
last 3 years?
Analysis
Data Mining
Knowledge discovery
of hidden patterns and
insights
Insight and Prediction
Induction (Build the
model, apply it to new
data, get the result)
Who will buy a
What is the average income
mutual fund in the
of mutual fund buyers by
next 6 months and
region by year?
why?
Example of DBMS, OLAP and Data
Mining: Weather data
Assume we have made a record of the weather conditions
during a two-week period, along with the decisions of a tennis
player whether or not to play tennis on each particular day.
Thus we have generated tuples (or examples, instances)
consisting of values of four independent variables (outlook,
temperature, humidity, windy) and one dependent variable
(play). See the textbook for a detailed description.
Decision Tree for PlayTennis
 Attributes and their values:
 Outlook: Sunny, Overcast, Rain
 Humidity: High, Normal
 Wind: Strong,Weak
 Temperature: Hot, Mild, Cool
 Target concept - Play Tennis: Yes, No
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Wind
Strong
Yes
No
Weak
Yes
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Each internal node tests an attribute
Normal
Each branch corresponds to an
attribute value node
Yes
Each leaf node assigns a classification
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny
Hot
High
Weak
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
PlayTennis
?
No
Wind
Strong
No
Weak
Yes
Good
Luck!
General
Did You know (source 2016 data).