Download Big Data Analytics: Techniques, Applications & Risks Dr Allan Tucker

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data Analytics:
Techniques, Applications & Risks
Dr Allan Tucker
Centre for Intelligent Data Analysis,
Department of Computer Science,
Brunel University, London.
The Talk
•
•
•
•
The Data Explosion in Science
Definitions of Big Data
Techniques and Case Studies
Potential & Caveats: A Gold Rush?
Data historically...
• Preserve of handful of scientists:
Newton, 1600s
Darwin, 1800s
Pearson, 1900s
Feng, web.colby.edu
Porter, Princeton University Press
The Complete Work of Charles Darwin Online
Modern Data Generation
• New devices:
• Sequencing
• Clinical tests
• Publications
• Sensor Data
• Data collected online
• Twitter discussions
• Search
Definitions of Big Data
• “Data greater than x records”:
• NHS genomics project will collect data from 100k patients
• 294 billion emails / day, Over 1 billion google searches / day
• Trillions of sensors monitoring environment / individuals
Definitions of Big Data
• “Data greater than x records”:
• NHS genomics project will collect data from 100k patients
• 294 billion emails / day, Over 1 billion google searches / day
• Trillions of sensors monitoring environment / individuals
• “High Volume, Velocity and Variety”, Gartner
Definitions of Big Data
• “Data greater than x records”:
• NHS genomics project will collect data from 100k patients
• 294 billion emails / day, Over 1 billion google searches / day
• Trillions of sensors monitoring environment / individuals
• “High Volume, Velocity and Variety”, Gartner
• “Data where analysis is non-trivial in terms of
computing power”, Professor Hand, Imperial
(Depends what you want to do to it)
Big Data Analysis
• Attempts to deal with data explosion to
discover patterns and knowledge from data
• Rebranding of Data Mining?
• Includes focus on the user / analyser (less
automated, more interactive)
Overlap with Statistics
• “Statistics is the science of the collection, organization, and
interpretation of data. It deals with all aspects of this, including the
planning of data collection in terms of the design of surveys and
experiments”, Oxford English Dictionary
• Data Mining is: More explorative, Not always an hypothesis,
Works with historical data, Rarely any experimental
design! Makes less assumptions about the data
Overlap with Statistics
• “Statistics is the science of the collection, organization, and
interpretation of data. It deals with all aspects of this, including the
planning of data collection in terms of the design of surveys and
experiments”, Oxford English Dictionary
• Data Mining is: More explorative, Not always an hypothesis,
Works with historical data, Rarely any experimental
design! Makes less assumptions about the data
• But same caveats!:
“He uses statistics as a drunken man uses lampposts - for support rather
than for illumination.” Andrew Lang
“There are lies, damned lies, and statistics.”
Benjamin Disraeli
1) Visualisation (& Integration)
Visualisation
GIS:
Hyper local weather
Water usage
Application of fertilizer, and pesticides
Crop yield
Crop weakness due to disease, pests etc.
2) Selecting & Clustering Data
• Identify important variables
5
8
4.5
3
7
2.5
4
6
3.5
2
5
3
Series1
2.5
Series1
4
Series2
2
Series3
Series1
1.5
Series2
Series2
Series3
3
Series3
1
1.5
2
1
0.5
1
0.5
0
0
0
2
4
6
8
10
0
0
1
2
3
4
5
0
2
4
6
8
• e.g. Biomarkers, clinical indicators, loan decisions
• Determining clusters – patient / customer profiles
Networks and Multiple Datasets
Biotic
stress
Heat stress
and
photosynthesis
Bo et al. 2014
3) Network Models
• Correlations
• Partial Correlations
• Bayesian / Boolean Networks
• False Discovery Rates
Thum et al. 2008
Case Study:
Modelling Progression
Modelling Progression - Time
Saunders et al. IOVS 2014
Modelling Progression - Time
Ceccon et al. 2012
Modelling Progression - Time
Tucker et al. 2010
Case Study: Fisheries
• Neda’s work
Modelling Space & Time - Fisheries
Modelling Space & Time - Fisheries
Trifonova et al.
2014
Modelling Space & Time - Fisheries
Trifonova et al.
2014
Case Study: Text Mining for
Botanists
Text Mining for Botanists
• Kew Gardens Flora Collection
Text Mining for Botanists
• Kew Gardens Flora Collection
Habitat Cluster Size
Habitat 5 - OPEN
Habitat 6 - WOODLAND
Habitat 7 - SCRUB
Habitat 8 - SCRUB2
2500
Habitat 4 - DISTURBED
2000
Habitat 3 - MONTANE
1500
Habitat 2 - RAINFOREST
1000
Habitat 1 - BUSHLAND
500
0
Habitat 0 - WETLANDS
Tucker et al.
2014
Text Mining Tweets for Finance
• Tweets for Finance work
Nasseri et al.
2014
Caveats to Big Data: A Gold Rush?
“Big Data promise tremendous advances. But the
media hype ignores the difficulties and risks
associated with this promise”, Professor David
Hand, Imperial College, IDA 2013
Caveats to Big Data: A Gold Rush?
“Big Data promise tremendous advances. But the
media hype ignores the difficulties and risks
associated with this promise”, Professor David
Hand, Imperial College, IDA 2013
Caveats to Big Data
• People do not really want data. They want answers
• Big Data Analysis requires expertise (domain /
statistical)
• Data alone cannot answer a question
• Small targetted quality controlled (small) data often
better
Caveats to Big Data
• People do not really want data. They want answers
• Big Data Analysis requires expertise (domain /
statistical)
• Data alone cannot answer a question
• Small targetted quality controlled (small) data often
better
Caveats to Big Data
“Big Data promise tremendous advances. But the
media hype ignores the difficulties and risks
associated with this promise”, Professor David
Hand, Imperial College, IDA 2013
Some free copies at the front!
A Lesson from Crowd Sourcing
Social media for epidemics
– Google flu trends
A Lesson from Crowd Sourcing
Social media for epidemics
– Google flu trends
Potential of Big Data Analysis
• Already great successes:
– Medicine
– Farming
– Finance
– Biology / Ecology
• Discovery of new associations (merging data)
• Interaction between domain expert and data
Definitions of Big Data
• “High Volume, Velocity and Variety”, Gartner
• “Data where analysis is non-trivial in terms of
computing power”. Professor Hand
Definitions of Big Data
• “High Volume, Velocity and Variety”, Gartner
• “Data where analysis is non-trivial in terms of
computing power”. Professor Hand
Requires:
• Heterogeneous data (need for integration)
• Humans in the loop
• Need for new algorithms (significance, FDRs,
feedback in social media)
Definitions of Big Data
• “High Volume, Velocity and Variety”, Gartner
• “Data where analysis is non-trivial in terms of
computing power”. Professor Hand
Requires:
• Heterogeneous data (need for integration)
• Humans in the loop
• Need for new algorithms (significance, FDRs,
feedback in social media)
Thanks to …
Pêches et Océans Fisheries and Oceans
Canada
Canada