Download Data Mining Allan Tucker School of Information Systems Computing and Mathematics

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Allan Tucker
School of Information Systems Computing and Mathematics
Brunel University, London. UB8 3PH. UK
The talk
• The Data Explosion
• Data Mining techniques & Application
• Data Mining in the Media
• Some of our work on Biomedical Data
Mining
• Some Caveats
Data historically...
• Preserve of scientists:
Darwin, 1800s
Newton, 1600s
Galton, 1800s
Pearson, 1900s
Database Technology Timeline

1960s:




Data collection, database creation
1970s:

Relational data model

Relational DBMS implementation
1980s:

Advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s—2000s:

Data Warehousing

Multimedia and Web databases

Distributed DW: The Cloud
Data Generation examples
• Data collected from online forms
• Amazon purchases
• Google
searches
• Loan requests
• Massively parallel sequencing of biological
data (gene expression)
• Telescopes scanning the skies
The Data Explosion
“We are drowning in information, but starving
for knowledge” John Naisbett (Futurologist)
Due to the advance of IT and the Internet
• Massive increase in ability to:
• Record: Electronic records and forms, the Internet
• Store: Data Warehouses, the Cloud
• Risk of Information Overload
The Data Explosion
Need to Analyse:
Data Mining,
Machine Learning,
Intelligent Data Analysis,
Knowledge Discovery in Databases,
Bioinformatics
Knowledge
Overlap with Statistics
“Statistics is the science of the collection, organization, and
interpretation of data. It deals with all aspects of this,
including the planning of data collection in terms of the
design of surveys and experiments”, OED
“... statistics, that is, the mathematical treatment of reality ...”
Hannah Arendt
“He uses statistics as a drunken man uses lampposts - for
support rather than for illumination.” Andrew Lang
“There are lies, damned lies, and statistics.”
Benjamin Disraeli
Overlap with Statistics
“DM is the process of extracting patterns from large data sets
by combining methods from statistics and artificial intelligence
with database management.”, Jason Frand, UCLA
• More explorative
• Not always an hypothesis
• Works with Historical Data
• Rarely any experimental design!
• Makes less assumptions about the data
Data Mining Process
(or Knowledge Discovery)


Knowledge Discovery in Databases (KDD)
The Process (from Advances in KDD and Data
mining):
Knowledge
Data
Target
Data
Pre-processed
Data
Transformed
Data
Patterns
Typical Tasks
Descriptive:
• Clustering (customer profiling)
• Association Rule Mining (basket analysis)
Predictive:
• Classification (medical diagnosis)
• Forecasting (stock forecasts)
• Regression (interpolation / extrapolation)
Clustering (unsupervised learning)
• Looking for data points that are similar
• Depends on how you measure difference
or similarity!
Clustering (unsupervised learning)
• Customer Relationship Management
Clustering (unsupervised learning)
• Patients with similar symptoms
Classification (supervised learning)
• Separate Classes with:
3
2.5
A simple model:
Generalisable but biased
or ...
2
1.5
1
0.5
3
2.5
3
3.5
4
4.5
5
2.5
... a complex model:
Good fit but risks overfitting
2
1.5
1
0.5
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
5.5
6
6.5
7
Classification (supervised learning)
• For example,
• Purchases bought online
• Patient data – healthy vs disease
• Agreeing loans
Decision Trees
Credit Rating / Pregnancy Screening examples
Feature Selection
• How data clusters / classifies depends very much
upon selected variables (features)
• What if we select customers based upon their
purchases only? Or also include demographics ?
• We get very different results.
• Feature selection involves automatically
identifying the important variables
Feature Selection
• Clusters plotted with different features
8
5
4.5
3
7
2.5
4
6
3.5
2
5
3
Series1
2.5
Series1
4
Series2
2
Series3
Series1
1.5
Series2
Series2
Series3
3
Series3
1
1.5
2
1
0.5
1
0.5
0
0
0
2
4
6
8
10
0
0
1
2
3
4
5
0
2
4
6
8
• Filter methods score each variable independently
(e.g. chi squared)
• Wrapper approaches model the interactions
between variables
Association Rules
• Based upon Basket Analysis
• Supermarkets use this all the time
• Given a large amount of basket data, generate
rules:
<Set of items>
<Set of items>
<Set of items>
If <items> then <items>
(confidence / support)
Association Rules
• Based upon Basket Analysis
• Supermarkets use this all the time
• Given a large amount of basket data, generate
rules:
<Set of items>
<Set of items>
<Set of items>
If <items> then <items>
(confidence / support)
Association Rules
• Why do they find this knowledge useful?
• Loyalty Cards
• Shop layout
• Special offers
• Think of amazon ...
Association Rules
Association Rules
Time-Series Models
• Statistical & “AI” Models (Neural Networks)
• Temporal Abstractions
• For example, EEG & ECG in ICUs, Stock markets
Time-Series Models
• Statistical & “AI” Models (Neural Networks)
• Temporal Abstractions
• For example, EEG & ECG in ICUs, Stock markets
Bayesian Networks
• A probabilistic method to model data
• Easily interpreted by non-statisticians
• Can be used to combine existing
knowledge with data
• Essentially use independence assumptions
to model the joint distribution of a domain
Bayesian Networks
• Simple 2 variable Joint Distribution
P(Gene, Disease)
Gene
¬ Gene
Disease
0.89
0.01
¬ Disease
0.03
0.07
• Can use it to ask many useful questions
• But requires kN probabilities
Bayesian Network for Toy Domain
P(A)
.001
Gene A
A
T
T
F
F
C P(D)
T .70
F .01
B
T
F
T
F
Gene D
P(C)
.95
.94
.29
.001
Gene B
P(B)
.002
Gene C
Gene E
C P(E)
T .90
F .05
Bayesian Networks for Classification &
Feature Selection & forecasting
• Nodes that can represents class labels or variables at
“points in time”
t-1
t
• Also latent variables via EM
X1
X1
X1
X2
P(X1)
P(X2)
X3
X4
P(X4 | X3)
P(X3 | X1, X2)
X5
C
P(X5 | X3)
X1
X2
X3
XN
X2
X2
X3
X3
X4
X4
XN
XN
t-1
t
H
H
X2
X2
XN
XN
Bayesian Networks for Classification &
Feature Selection & forecasting
• Diagnosing Aircraft Failure
Data Mining - Successes
Some successful examples of its use:
Search Engines – Bayesian networks
Pharmaceutical companies – Drug Discovery
Credit card companies – Fraud Detection
Transportation companies - Routing
Large consumer package goods companies
(to improve the sales process to retailers)
• Hospital Organisation – Decision Analysis
• Online businesses – Market Research
•
•
•
•
•
On Business Intelligence & OLAP
Application of DM & DW to Business Data
On-Line Analytic Processing:




Overlap with Data Mining
More focussed on interactive ad-hoc analysis
Exploits multidimensional modelling
Concepts of:




Drill-down
Consolidation
Slicing & Dicing
Visualisation & Dashboards
On Business Intelligence & OLAP
On Social Media / Market Research
LinkedIn (professional contacts)
Skype (voice / video)
Ipods
(location)
Flickr
(images)
Facebook (personal contacts)
On Social Media / Market Research
LinkedIn (professional contacts)
Skype (voice / video)
Ipods
(location)
Flickr
(images)
Facebook (personal contacts)
Data Mining – In the media – part 1
Most positive news stories relate to other names for DM
Some of our Data Mining work in
Biomedical / Eco Informatics
• Building Gene Regulatory Networks
• Building Trajectories of Disease from
Medical Data
• Building Dynamic Models of Ecological
Data
Microarray Data & Bioinformatics
• Major source of data for gene expression activity
• Technology takes measurements over 1000s of
genes simultaneously
• Gene Regulatory Networks (GRNs) model how
genes interact
• Eliciting reliable GRNs from data key to
understanding biological mechanisms
Yeast
The Importance of Independent
Test Data
• Prediction – Train a network on one dataset
• Test it on the others sets (Independent Data)
• As opposed to Cross Validation (testing on the
same dataset)
Models of Increasing Complexity
(MIC)
• Extending the Consensus across platforms
• Select one dataset for training, others
become test sets
• Score mean and var of SSE using CV and
independent test sets
• Use these to rank genes (this is feature
selection)
(2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with
Increasing Complexity, BMC Bioinformatics 11 : 32
Mechanisms Between Species?
• Dandelion Algorithm – extension of MIC
(submitted) Anvar, Y. Tucker, A. Venema, A. van Ommen, G.J.B. van der Maarel, S.M. Raz, V. „t Hoen, P.A.C.
“Interspecies translation of gene disease networks increase robustness and predictive accuracy”, PLOS
Computational Biology
Inter-species Mechanisms
Modelling Clinical Data
•
Biomedical studies often involve data sampled
from a cross-section of a population
• Collecting medical information on patients suffering from a
particular disease and controls
•
These studies show a “snapshot” of the disease
process but disease is inherently temporal:
• Previously healthy people can develop a disease over time
going through different stages of severity
•
If we want to model the development of such
processes, usually require longitudinal data
(expensive)
Models of Disease:
Visual Field and Retinal Image Data
• Progressive loss of the
field of vision is
characteristic of many eye
diseases
• Glaucoma is a leading
cause of irreversible
blindness in the world.
• VF Data: sensitivity of field of
vision
• HRT Data: anatomical info of
retina
b) Pseudo Time-Series for CS Data
Tucker, A. and Garway-Heath, D., The Pseudo Temporal Bootstrap for Predicting Glaucoma from
Cross-Sectional Visual Field Data, IEEE Transactions on IT in Biomedicine 14 (1) : 79-85 , 2010
Fisheries Population Modelling
Cod Collapse in G Bank, N Sea & ESS
10
George’s Bank
Functional Collapse
in late „80s
8
50000.00
Catch
40000.00
6
30000.00
4
20000.00
2
10000.00
0
0.00
1970
North Sea
No Functional
Collapse
60000.00
Biomass
1975
1980
1985
1990
1995
2000
2005
400
350
300
250
200
150
100
50
0
300000.00
250000.00
200000.00
150000.00
100000.00
50000.00
0.00
1970
1975
1980
1985
1990
1995
2000
2005
12000
35000.00
30000.00
25000.00
20000.00
15000.00
10000.00
5000.00
0.00
10000
East Scotian Shelf
Functional Collapse
in early „90s
8000
6000
4000
2000
0
1970
1975
1980
1985
1990
1995
2000
2005
Dynamic Functional Models
• Predicting ESS event &
Cod biomass from G Bank
ESS
G Bank
Th Skate
Cod
Cusk
Cod Catch
Summary
What is Data Mining
• Potential (& Successful) Applications
•
• Business Intelligence
• Medical Informatics
• Bio Informatics
• Ecological Data
• Engineering
• What about some of the downsides…
Caveats to Data Mining
Data Quality ✓
Spurious Correlations ✓
Over-fitting ✓
“Black Box” Modelling ✓
Over-reliance – slave to the data ?
“Can’t see the wood for the trees” ?
Data Mining – in the media – part 2
• Data mining government /commercial data sets for
national security or law enforcement purposes has
raised privacy concerns
• EU – The “right to be forgotten” e.g. Facebook
• Patenting Genetic information
Data Mining – in the media – part 2
Data Mining Video 2
http://www.time.com/time/video/player/0,320
68,821500876001_2058396,00.html
Data Mining in the Future …
• Maybe a rebranding is needed?
• Medical Informatics & Business
Intelligence
• Data to Knowledge
• Knowledge Discovery in Databases &
KDnuggets
• In the cloud: “Cloud Analytics”
Data Mining in the Future …
• Maybe different names?
• Medical Informatics & Business
Intelligence
• Data to Knowledge
• Knowledge Discovery in Databases
• In the cloud: “Cloud Analytics”
Thanks for listening
Emma Steele, Yahya Anvar & PeterBram ‘t Hoen for their work on the
microarray research
Daniel Duplisea for the work on the
fish biomass research
Stefano Ceccon, Yuanxi Li & David
Garway-Heath for work on Glaucoma
research