Download What Is Data Mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining MTAT.03.183
(4AP = 6EAP)
(4AP
6EAP)
Introduction
Jaak Vilo
Jaak
Vilo
2009 Fall
Lecturer
•
•
•
•
•
1986‐1991 U Tartu
1991‐1999
1991
1999 U Helsinki U Helsinki (sequence pattern discovery)
1999‐2002 EMBL‐EBI, UK (bioinformatics)
2002‐ EGeen ‐> Quretec (Biobank and Data Mgmnt)
U Tartu professor (Bioinformatics) 2007
U Tartu, professor (Bioinformatics) 2007
– EXCS – Center of Excellence
– STACC – Software Technologies and Applications
Competence Center (Tarkvara TAK)
– research projects
Jaak Vilo and other authors
UT: Data Mining 2009
2
Students
•
•
•
•
•
•
>80 registered
Estonian vs Foreign
Estonian vs
MSc 1st y / 2nd y ? BSc , PhD ? Non IT/CS ? IT/CS ?
Why this class? Expectations? (ESSCaSS’08,09…)
Jaak Vilo and other authors
UT: Data Mining 2009
3
Course
•
•
•
•
•
http://courses.cs.ut.ee/2009/dm/
List: [email protected]
List: [email protected]
Lectures: 10:15, Liivi 2‐403 Seminars: 12:15, Liivi 2‐403 Prof Jaak Vilo vilo@ut ee
Prof. Jaak Vilo [email protected]
• http://www.quicktopic.com/43/H/eWqhydvFpUN
• Other? Skype
O h ? Sk
?
? Jaak Vilo and other authors
UT: Data Mining 2009
4
Seminars
• Three types:
1. Homework: presentations/discussions
p
/
2. Guest lectures, visitors
3 Practical labs/training (no concrete
3.
(no concrete plans yet)
• Participation is obligatory (>75%)
Jaak Vilo and other authors
UT: Data Mining 2009
5
Grading requirements
•
•
•
•
•
•
Participation! >75% of
Participation!
>75% of seminars
Homeworks (30%) (min 50% of assignments)
Projects/essays (30%) Exam (40%)
Total: 100% + thresholds
All deadlines are stringent.
Jaak Vilo and other authors
UT: Data Mining 2009
6
Homework
• Tasks/assignments
– 5 tasks/week + possibly bonuses
– About in every 2 weeks (irregular)
• Report/mark
p /
all completed
p
tasks
– written reports on tasks
– ready to present fully
present fully to class
– there will be some uploading system
– and/or paper sheets in class
• Deadline always before class start (Thu, 12:15)
start (Thu, 12:15)
Jaak Vilo and other authors
UT: Data Mining 2009
7
4AP = 6EAP
4AP = 6EAP
• 4 weeks (4x40h=160h) of intensive work
– assumingg basic knowledge
g of BSc material
• 1/3 in
1/3 in class
• 1/3 reading, homeworks
• 1/3 projects, writing, … Jaak Vilo and other authors
UT: Data Mining 2009
8
What is Data Mining?
• Data ‐> Information, Knowledge, Insight
– new, interesting, nontrivial, useful
,
g,
,
…
• Data size ‐> Algorithmic challenge
• Predictive, useful
Predictive, useful ‐> theoretical
theoretical and and
economical challenge
• Why? By
y yp
practical demand and need…
Jaak Vilo and other authors
UT: Data Mining 2009
9
Textbooks
•
•
•
•
•
Han, Kamber: Data Mining: Concepts and Techniques, Second Edition (The b
d
h
d d
( h
Morgan Kaufmann Series in Data Management Systems) Google Books
web
Chakrabarti et al. Data Mining: know it all. Morgan Kaufmann 2008 @ELsevier @AMazon @Google
Bramer: Principles of Data Mining (Springer 2007) @Amazon @Springer
Bramer: Principles of Data Mining (Springer, 2007) @Amazon
@Google
David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining (MIT Press, 2001) @MIT Press @Google
Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining Inference and Prediction (Springer
Statistical Learning: Data Mining, Inference, and Prediction. (Springer 2009) @Tibshirani @Amazon
Jaak Vilo and other authors
UT: Data Mining 2009
10
• Han, Kamber: Data Mining: Concepts and T h i
Techniques, Second
S
d Edition
Edi i (The
(Th Morgan
M
Kaufmann Series in Data Management
Systems)
• TOC: TOC: http://www.cs.uiuc.edu/homes/hanj/bk2/toc.pdf
http://www cs uiuc edu/homes/hanj/bk2/toc pdf
Jaak Vilo and other authors
UT: Data Mining 2009
11
Jaak Vilo and other authors
UT: Data Mining 2009
12
What’ss it all about?
What
it all about?
Data
DB
Jaak Vilo and other authors
UT: Data Mining 2009
13
•
•
•
•
•
•
•
•
Statistics
Patterns in data
Learning
Classification
Knowledge / Information
/ Information / /
Algorithms
Prediction
…
Jaak Vilo and other authors
UT: Data Mining 2009
14
Sources of data (growth)
Sources of data (growth)
•
•
•
•
•
•
•
•
•
devices
net/web
logs
transactional db
consumer
multimedia(!)
science
cheaper storage, compute power
…
Jaak Vilo and other authors
UT: Data Mining 2009
15
Why
y Data Mining?
g

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools,, database systems,
y
, Web,,
computerized society

Major
j sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data,
data but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
16
Evolution of Sciences

Before 1600, empirical science

1600-1950s theoretical science
1600-1950s,


1950s 1990s computational
1950s-1990s,
comp tational science



Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science

The flood of data from new scientific instruments and simulations

The ability to economically store and manage petabytes of data online

The Internet and computing
p
g Grid that makes all these archives universallyy accessible


Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
17
Evolution of Database Technology
gy

1960s:


1970s:



Relational data model,
model relational DBMS implementation
1980s:

RDBMS, advanced data models ((extended-relational, OO, deductive, etc.))

Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:


Data collection,
ll i
database
d b
creation,
i
IMS
S and
d networkk DBMS
S
Data mining, data warehousing, multimedia databases, and Web
databases
2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
18
examples from Machine Learning
• 1950’ies – checkers (Arthur Samuels 1959)
• 1960
1960’ies
ies – NN
NN – perceptron and it
and it’ss limitations
• 1970’ies – expert systems, decision trees
(ID3), …
(ID3)
• 1980’ies – Neural Networks, PAC learning, …
,
g,
• 1990’ies – Data mining, ILP, Ensembles
• 2000’ – SVM, Kernels, Graphical Models, …
Jaak Vilo and other authors
UT: Data Mining 2009
19
Chapter 1. Introduction

Why Data Mining?

What Is Data Mining?

A Multi-Dimensional View of Data Mining

Data Mining Functionalities: What Kinds of Patterns Can Be Mined?

Data Mining: On What Kind of data?

Time and
Ti
d Ordering:
Od i
Sequential
S
ti l Pattern,
P tt
Trend
T d and
d Evolution
E l ti
Analysis

Structure and Network Analysis

Evaluation of Knowledge

Applications
pp
of Data Mining
g

Major Challenges in Data Mining

A Brief History of Data Mining and Data Mining Society

Summary
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
20
What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously
u
unknown
o
and
a
d po
potentially
a y useful)
u u ) pa
patterns or
o knowledge
o dg from
o
huge amount of data


Alternative names


Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
W t h out:
Watch
t IIs everything
thi “d
“data
t mining”?
i i ”?

Simple search and query processing

(D d ti ) expertt systems
(Deductive)
t
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
21
Knowledge
g Discovery
y ((KDD)) Process


This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
l in
i the
th knowledge
k
l d discovery
di
role
Data Mining
process
T k l
Task-relevant
t Data
D t
Selection
D t W
h
Data
Warehouse
Data Cleaning
Data Integration
Databases
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
22
Example: A Web Mining Framework

Web mining usually involves

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining
g

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into
knowledge-base
knowledge
base
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
23
Data Mining
g in Business Intelligence
g
Increasing potential
to support
business decisions
Decision
M ki
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
y
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Jiawei Han, Micheline
Kamber, and Jian Pei
Data Mining: Concepts and Techniques
DBA
24
Collaborative filtering
Collaborative filtering
– Amazon, Netflicks
• Collaborative filtering systems usually take two steps:
– Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
– Use the ratings from those like‐minded users found in step 1 to calculate a prediction for the active user
Jaak Vilo and other authors
UT: Data Mining 2009
25
Jaak Vilo and other authors
UT: Data Mining 2009
26
Netflix prize
http://www.netflixprize.com/
• http://en.wikipedia.org/wiki/Netflix_Prize
18K movies
i
480K
customers
~ 100M ratings
????
Jaak Vilo and other authors
UT: Data Mining 2009
Test on 2.8M witheld ratings
27
Social network
Social network
• Graph of connections
• Social network
mining
Jaak Vilo and other authors
UT: Data Mining 2009
28
Web
• Interlinked web sites and pages
• Directed Graph of links
• Information Retrieval, PageRank
Retrieval PageRank
• Web mining
Jaak Vilo and other authors
UT: Data Mining 2009
29
Web usage mining
Web usage mining
• Software and web usage logs
• Typical use patterns
• User groups, their
groups their preferences, behavior
preferences behavior
• Can you predict their goals and help to achieve
them?
– distributed online transactions, queries, … (Google, etc)
Jaak Vilo and other authors
UT: Data Mining 2009
30
Biomedical data mining
Biomedical data mining
• Analyse:
– DNA, ,
– Genotype information
– disease histories
– find associated genes
– predict and classify diseases and outcomes
– discover “how biology
gy works”
–…
Jaak Vilo and other authors
UT: Data Mining 2009
31
Combinatorial Data Mining Algorithms
Combinatorial
Data Mining Algorithms
(research seminar, Sven Laur, PhD)
Basics ideas and techniques
– How to find frequent sets in databases
H t fi d f
t t i d t b
– How to find frequent motifs in sequences
Algorithmic problems
– Depth‐first vs breath first search
– How to avoid combinatorial explosion
p
Interpretation of results
– Which patterns are important enough?
Combinatorial Data Mining Algorithms
(research seminar, Sven Laur, PhD.)
Other important aspects
– How to handle noisy data
– Random sampling vs linear scan
Applications and extensions
Applications and extensions
– Association rules in practice – Log analysis. Episode rules and usability
Log analysis Episode rules and usability
– Graph mining and biochemistry
Combinatorial Data Mining Algorithms
(research seminar, Sven Laur, PhD) Administrative details
•
•
•
•
•
•
•
Combinatorial Data Mining Algorithms
C
bi t i l D t Mi i Al ith
Gives 3 EAP (2 old AP)
Takes place on Wednesdays in L122
First seminar is on 16th of September
Each participant has to give a presentation Project work is combined with DM course
http://courses.cs.ut.ee/2009/fast‐counting/
Research at U Tartu
Research at U Tartu • BIIT – http://biit.cs.ut.ee/
• STACC – Software Technologies and A li i
Applications Competence Center
C
C
– companies and universities
– Skype, Regio, Delfi, Quretec, …
– Research problems, topics, scholarships Jaak Vilo and other authors
UT: Data Mining 2009
35
Research topics
Research topics
• Publications =>> Projects, funding
Projects, funding
• Relevant to STACC, companies
• Can lead to job offers 
Jaak Vilo and other authors
UT: Data Mining 2009
36
UT CS department
UT CS department
• Job offers:
• courses.cs.ut.ee ‐ web site development
– UT CS department courses web development
– Other sysdamin
y
and Department
p
development
p
tasks
–…
Jaak Vilo and other authors
UT: Data Mining 2009
37