Download Veri Madenciliği

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction

Instructor: Cengiz Örencik

E-mail: [email protected]


Course materials:
myweb.sabanciuniv.edu/cengizo/courses

Reference Books
◦ Veri Madenciliği: Kavram ve Algoritmaları, Doç. Dr.
Gökhan Silahtaroğlu, 2013
◦ Data Mining: Concepts and Techniques, Jiawei Han
and Micheline Kamber, 2010
1
midterm
 2 inclass quiz
 1 final
 HW
?
%30
%20
%50



Fundamental data mining tools / concepts
Classification, clustering, associations and
correlations algorithms
Real life examples and implementations


Data preprocess
Data Warehouses
◦ Data from different sources/different structure 
unified schema, reside at a single site
◦ Periodic data summary

Associations and correlations
◦ Market basket analysis, etc.

Classification and prediction
◦ E.g. is he trustable for credit application?

Cluster Analysis
◦ People with similar spending patterns


Text and WEB mining
Privacy preserving data mining
◦ Protect personal information

“Necessity is the mother of invention”
Plato

Continuously petabytes of new data is
produced
◦ 90% of world's data generated over last two years
◦ Twitter, facebook, online shopping, mobese cams
etc.

Easy to access and store data
 e.g. customer voice records
 Web Crawler
 e.g. twits that contain “election” and “party” terms

Hard part is getting knowledge from the data
Data mining is extracting non-trivial
(previously unknown) and valid
knowledge from large amounts of data
that can be used in decision making
 Non-trivial

◦ Huge cost to get predictable info
◦ Not to prove sth you already know
 Diaper – beer correlation

Large data

Decision making
◦ Validity

Query

◦ Suitable
◦ Not suitable
◦ No common language
 SQL – relational DB

Data

Output
◦ known
◦ Subset of data
Databases
Data
◦ Static
◦ Dynamic

Query

Output
◦ Not known
◦ Not subset of data
Data Mining

Database queries
◦ List of the people that has a boat at Kalamış marine
and has the name “Ahmet”
◦ Credit card owners under 30 that has >5000 TL/m
spending

Data Mining Queries
◦ Credit application with low risk (classification)
◦ Card owners with similar buying patterns
(clustering)
◦ Products purchased together with PS4 games
(association rules)
Cleaning
Databases
Selection
transformation
Data Mining
Data
Warehouse
Presentation
Evaluation
Knowledge
patterns
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
14

Market analysis
◦ Target audience, customer relations

Risk analysis
◦ Resource management, check competitive
enterprise

Fraud detection
◦ Insurance, banking
◦ Modeling using history data

Document similarity
◦ plagiarism


Want to fit data into a model
Predictive mining
◦ Classify people that may not pay mortgage
payments
◦ Predict people that leave your company for another
◦ Predict exchange market (borsa)

Descriptive mining
◦
◦
◦
◦
Shows hidden information
Shows your best customers
Which products sell together
Which customers have similar shopping trends

Classification [Predictive]

Clustering [Descriptive]

Association Rules [Descriptive]