Download Data Mining lecture - LIACS Data Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
LIACS Data Mining course
an introduction
Course Information

Course website:
http://datamining.liacs.nl/DaMi/
(will be updated periodically)

Practical exercises
Course Schedule

Start Sept 12
Last lecture Dec 12 (exam preparation)

Next four weeks: 2 hours later, 15:45 – 17:30


more news later

Leidens Ontzet Oct 3, no course

Exam: Jan 9, 14:00 – 17:00
Course Textbook
Data Mining
Practical Machine Learning Tools and Techniques
third edition, Morgan Kaufmann, ISBN 978-0-12-374856-0
by Ian Witten and Eibe Frank
(EUR 45,95 at Amazon.de)
Introduction Data Mining
an overview and some examples
Data Mining definitions
Data Mining:
the concept of extracting previously unknown and
potentially useful, interesting knowledge from large
sets of data
secondary statistics: analyzing data that wasn’t
originally collected for analysis
Data Mining, the big idea

Organizations collect large amounts of data
Often for administrative purposes
Large body of experience
Learning from experience

Goals







Prediction/forecasting
Diagnostics
Optimization
…
2 Streams

Mining for insight





Understanding a domain
Finding regularities between variables
Interpretable models
Examples: medicine, production, maintenance
‘Black-box’ Mining



Don’t care how you do it, just do it well
Optimization
Examples: marketing, forecasting (financial, weather)
example: Direct Mail
Optimize the response to a mailing, by targeting only
those that are likely to respond:


more response
fewer letters
test mailing
final
mailing
remainder
Customer information
response
3%
Data Mining
Customer information
customer model
response
30%
example: Bioinformatics

Find genes involved in disease (Parkinson’s, Celiac,
Neuroblastoma)
Measurements from patients (1) and controls (0)
Gene expression: measurements of 20k genes
dataset 20,001 x 100

Challenges






many variables
few examples (patients), testing is expensive
interactions between genes
Data Mining paradigms

Classification




Clustering


divide dataset into groups of similar cases
Regression


(binary) class variable
predict class of future cases
most popular paradigm
numeric target variable
Association


find dependencies between variables
basket analysis, …
Classification (decision tree)
Predict the class (often 0/1) of an object on the
basis of examples of other objects (with a class
given).
0.4
Rent
0.2
0.1
Buy
0.07
Other
No
0.64
Age < 35
Yes
0.25
Age ≥ 35
No
0.51
Price < 200K Yes
0.01
Price ≥ 200K No
Applying a classifier (decision tree)
New customer: (House = Rent, Age = 32, …)
prediction = Yes
Age < 35
Yes
Age ≥ 35
No
Rent
Price < 200K Yes
Buy
Price ≥ 200K
Other
No
No
Classification
Age < 35
Yes
Age ≥ 35
No
Rent
Price < 200K Yes
Buy
Price ≥ 200K
Other



Tree makes attribute dependencies explicit
 Class depends on House
The influence of other attributes is less
Dependencies are (often) fuzzy


No
multiple attributes are needed
Perfect predictions are rare
No
Graphical interpretation


dataset with two attributes + 1 class (+/-)
graphical interpretation of decision tree
y
+
+
+
+
+
+
+
+
0
-
+
+
+
-
-
+
-
-
x
x<t
y < t’
xt
y  t’
Graphical interpretation


dataset with two attributes + 1 class (+/-)
other classifiers
Support Vector Machine
y
+
+
+
+
+
+
+
+
0
-
+
+
+
-
-
+
-
-
x
Neural Network
Applications of DM

Marketing







outgoing
incoming
Bioinformatics & Medicine
Fraud detection
Risk management
Insurance
Enterprise resource planning
Break
Data Mining Applications
Training Data Speed Skating


Speed skating team LottoNL-Jumbo
Detailed historic data

training details
• duration
• intensity



competition results
Finding patterns of effective training
Visualise data
Kjeld Nuis




178 races
On average 2.89% above track record
Specialises on 1000 m (2.1%)
2015-2016




Dutch champion 1000 m, 1500 m
WC Distances: bronze 1000 m, silver 1500 m
WC Sprint: ‘silver’
ISU World Cup: gold 1000 m, silver 1500 m
Total sum of load over last 5 days, morning sessions
undesired result
due to over-training
advised upper
limit
InfraWatch: monitoring of infrastructure
Continuous monitoring of a large bridge ‘Hollandse Brug’
 145 sensors
 time-dependent, at frequencies up to 100Hz
 multi-modal (sensor, video, different freq.)
 managing large data quantities, >5 Gb per day
InfraWatch: monitoring of infrastructure





34 geo-phones (vibration sensors)
44 embedded strain-gauges, 47 gauges outside
20 thermometers
video camera
weather station
sensor mining
Maintenance planning at KLM






Routine checks of aircraft
Maintenance requires up to 10k different parts
Ordering parts incurs delay (costs)…
… but so does keeping stock
In theory 10k individual predictions
Input



maintenance history
flight history, Sahara/North Pole
Only few parts predictable
Cashflow Online




Online personal finance overview
All bank transactions are loaded into the application
transactions are classified into different categories
Data Mining predicts category
67 Categories
Gas Water Licht
Onderhoud huis en tuin
Telefoon + Internet + TV
Contributie (sport-)verenigingen
Levensverzekering / Lijfrente
Rente ontvangen
Boodschappen
Hypotheekrente
Naar spaarrekening
Geldopname/chipknip
Verzekeringen overig
Loterijen
Cadeau's
Interne boeking
Vakantie & Recreatie
Uitgaan, hobby's en sport
Creditcard
Ziektekostenverzekering
Brandstof
Woonhuis / Opstalverzekering
Huishouden overig
School- en Studiekosten
Inkomsten overig
Kleding & Schoenen
Lenen
Openbaar vervoer/Taxi
…
Fragmented results:
Boodschappen (groceries)
Contributie
Decision Tree over all categories
false
true
Data Mining at LIACS

Applications






bioinformatics (LUMC)
Sports Analytics (LottoNL-Jumbo, PSV)
Hollandse Brug (Strukton, TU Delft, RWS)
fraud detection at Achmea health insurance and NZa
ChartEx, medieval documents (English, Latin)
Complex data




graphical data (molecules)
relational data (criminal careers)
stream data (sensor-data, click-streams)
…
Related documents