Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analytics: Techniques, Applications & Risks Dr Allan Tucker Centre for Intelligent Data Analysis, Department of Computer Science, Brunel University, London. The Talk • • • • The Data Explosion in Science Definitions of Big Data Techniques and Case Studies Potential & Caveats: A Gold Rush? Data historically... • Preserve of handful of scientists: Newton, 1600s Darwin, 1800s Pearson, 1900s Feng, web.colby.edu Porter, Princeton University Press The Complete Work of Charles Darwin Online Modern Data Generation • New devices: • Sequencing • Clinical tests • Publications • Sensor Data • Data collected online • Twitter discussions • Search Definitions of Big Data • “Data greater than x records”: • NHS genomics project will collect data from 100k patients • 294 billion emails / day, Over 1 billion google searches / day • Trillions of sensors monitoring environment / individuals Definitions of Big Data • “Data greater than x records”: • NHS genomics project will collect data from 100k patients • 294 billion emails / day, Over 1 billion google searches / day • Trillions of sensors monitoring environment / individuals • “High Volume, Velocity and Variety”, Gartner Definitions of Big Data • “Data greater than x records”: • NHS genomics project will collect data from 100k patients • 294 billion emails / day, Over 1 billion google searches / day • Trillions of sensors monitoring environment / individuals • “High Volume, Velocity and Variety”, Gartner • “Data where analysis is non-trivial in terms of computing power”, Professor Hand, Imperial (Depends what you want to do to it) Big Data Analysis • Attempts to deal with data explosion to discover patterns and knowledge from data • Rebranding of Data Mining? • Includes focus on the user / analyser (less automated, more interactive) Overlap with Statistics • “Statistics is the science of the collection, organization, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments”, Oxford English Dictionary • Data Mining is: More explorative, Not always an hypothesis, Works with historical data, Rarely any experimental design! Makes less assumptions about the data Overlap with Statistics • “Statistics is the science of the collection, organization, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments”, Oxford English Dictionary • Data Mining is: More explorative, Not always an hypothesis, Works with historical data, Rarely any experimental design! Makes less assumptions about the data • But same caveats!: “He uses statistics as a drunken man uses lampposts - for support rather than for illumination.” Andrew Lang “There are lies, damned lies, and statistics.” Benjamin Disraeli 1) Visualisation (& Integration) Visualisation GIS: Hyper local weather Water usage Application of fertilizer, and pesticides Crop yield Crop weakness due to disease, pests etc. 2) Selecting & Clustering Data • Identify important variables 5 8 4.5 3 7 2.5 4 6 3.5 2 5 3 Series1 2.5 Series1 4 Series2 2 Series3 Series1 1.5 Series2 Series2 Series3 3 Series3 1 1.5 2 1 0.5 1 0.5 0 0 0 2 4 6 8 10 0 0 1 2 3 4 5 0 2 4 6 8 • e.g. Biomarkers, clinical indicators, loan decisions • Determining clusters – patient / customer profiles Networks and Multiple Datasets Biotic stress Heat stress and photosynthesis Bo et al. 2014 3) Network Models • Correlations • Partial Correlations • Bayesian / Boolean Networks • False Discovery Rates Thum et al. 2008 Case Study: Modelling Progression Modelling Progression - Time Saunders et al. IOVS 2014 Modelling Progression - Time Ceccon et al. 2012 Modelling Progression - Time Tucker et al. 2010 Case Study: Fisheries • Neda’s work Modelling Space & Time - Fisheries Modelling Space & Time - Fisheries Trifonova et al. 2014 Modelling Space & Time - Fisheries Trifonova et al. 2014 Case Study: Text Mining for Botanists Text Mining for Botanists • Kew Gardens Flora Collection Text Mining for Botanists • Kew Gardens Flora Collection Habitat Cluster Size Habitat 5 - OPEN Habitat 6 - WOODLAND Habitat 7 - SCRUB Habitat 8 - SCRUB2 2500 Habitat 4 - DISTURBED 2000 Habitat 3 - MONTANE 1500 Habitat 2 - RAINFOREST 1000 Habitat 1 - BUSHLAND 500 0 Habitat 0 - WETLANDS Tucker et al. 2014 Text Mining Tweets for Finance • Tweets for Finance work Nasseri et al. 2014 Caveats to Big Data: A Gold Rush? “Big Data promise tremendous advances. But the media hype ignores the difficulties and risks associated with this promise”, Professor David Hand, Imperial College, IDA 2013 Caveats to Big Data: A Gold Rush? “Big Data promise tremendous advances. But the media hype ignores the difficulties and risks associated with this promise”, Professor David Hand, Imperial College, IDA 2013 Caveats to Big Data • People do not really want data. They want answers • Big Data Analysis requires expertise (domain / statistical) • Data alone cannot answer a question • Small targetted quality controlled (small) data often better Caveats to Big Data • People do not really want data. They want answers • Big Data Analysis requires expertise (domain / statistical) • Data alone cannot answer a question • Small targetted quality controlled (small) data often better Caveats to Big Data “Big Data promise tremendous advances. But the media hype ignores the difficulties and risks associated with this promise”, Professor David Hand, Imperial College, IDA 2013 Some free copies at the front! A Lesson from Crowd Sourcing Social media for epidemics – Google flu trends A Lesson from Crowd Sourcing Social media for epidemics – Google flu trends Potential of Big Data Analysis • Already great successes: – Medicine – Farming – Finance – Biology / Ecology • Discovery of new associations (merging data) • Interaction between domain expert and data Definitions of Big Data • “High Volume, Velocity and Variety”, Gartner • “Data where analysis is non-trivial in terms of computing power”. Professor Hand Definitions of Big Data • “High Volume, Velocity and Variety”, Gartner • “Data where analysis is non-trivial in terms of computing power”. Professor Hand Requires: • Heterogeneous data (need for integration) • Humans in the loop • Need for new algorithms (significance, FDRs, feedback in social media) Definitions of Big Data • “High Volume, Velocity and Variety”, Gartner • “Data where analysis is non-trivial in terms of computing power”. Professor Hand Requires: • Heterogeneous data (need for integration) • Humans in the loop • Need for new algorithms (significance, FDRs, feedback in social media) Thanks to … Pêches et Océans Fisheries and Oceans Canada Canada