Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data What is it? How will it affect business? Copyright © 2014-2016 Curt Hill What is big data? • Large - more data than will fit in a single spreadsheet • Complex – many different formats – Binary – JSON – Tab delimited • Unstructured or semi-structured – Not easy to make sense of by a machine – Consider medical notes – Little metadata Copyright © 2014-2016 Curt Hill Where does it come from? • In last two years we have generated 90% of the world’s data • Most of this is machine generated data – – – – Sensors Mobile devices Smart devices Web data Copyright © 2014-2016 Curt Hill Why is big data important? • Has the potential to revolutionize: – Science – Business – Governament • Lets consider some examples Copyright © 2014-2016 Curt Hill First • The first big data center was built in1965 by the US government • The goal was to store federal data – 742 million tax returns – 175 million sets of fingerprints • This is not large by todays standards but was very impressive then Copyright © 2014-2016 Curt Hill Science • Astonomy – Sloan Digital Sky Survey has altered the jobs of astronomers – They used to spend a significant amount of time taking pictures of the sky – The data is now in a database • Biology – A single Next Generation Sequencing machine can produce Terabytes of data per day • The Large Hadron Collider produces 11 TByte per second – Only 2 TByte can be retained Copyright © 2014-2016 Curt Hill Business • Web focused companies use the vast quantity of data to customize advertising, content recommendations for their users • Health care companies monitor and analyze data from hospital and home care devices • Energy companies are using the consumption data to make their production scheduling more efficient • Many other Copyright examples © 2014-2016 Curt Hill Government • Census data was the perhaps the first big data application • Los Angeles is using sensors and big data analysis to make decisions concerning traffic • Use sensors to model and analyze and seismic activity, weather data, etc. Copyright © 2014-2016 Curt Hill 4 Vs Define Big Data • • • • Volume Velocity Variety Variability or Veracity Copyright © 2014-2016 Curt Hill Volume • What is the most data that can be stored in a single disk drive – There are physical boundaries • We can up these boundaries by distributing this over many boxes – Scale up – increase the box – Scale out – distribute over many boxes • Then we have the well known communication problems Copyright © 2014-2016 Curt Hill Velocity • The data is coming in fast – Faster than can be structured by a person • 1 Terabyte of data for each New York Stock Exchange day • 15 Billion network connections in a day • 100 sensors per car • Storage first – Then figure out if you want it Copyright © 2014-2016 Curt Hill Variety • Formats of data: – JSON – Tab delimited / CSV – Tweets and Facebook updates • The formats of the future are yet to be revealed • We need to be able to handle these Copyright © 2014-2016 Curt Hill Variability or Veracity • Making sense of it on its own is problematic • The meaning may change over the course of time • The reliability issues must be considered – False signal from a sensor about to malfunction Copyright © 2014-2016 Curt Hill Challenges • If we had only one of these Vs to deal with, we could handle it successfully • The problem is we often have to deal with two or more at a time – This may make traditional relational databases too slow to respond Copyright © 2014-2016 Curt Hill Big Data Life Cycle • Several phases: – Acquisition – Extraction and cleaning – Integration, Aggregation and Representation – Modeling and Analysis – Interpretation Copyright © 2014-2016 Curt Hill Acquisition • The data is usually recorded by sensors and transmitted to a computer – In the case of computer data the sensor is software: an application, a web server, web browser or part of the network • Often the data is so large that real time filtering/compression is required to reduce – Eg. Large Hadron Collider (11 TB/sec) or upcoming square kilometer telescope (100 million TB/day) Copyright © 2014-2016 Curt Hill Extraction and cleaning • Data arrives in a variety of formats • Consider health care data – Raw data from sensors, each in its own format – Admission data from an existing database – Physician comments • These diverse sources must be formatted to be usable Copyright © 2014-2016 Curt Hill Cleaning • Sensor data may have errors from transmission or interference • Transcription of handwritten information may reflect bias of author • Part of the extraction process is to attempt to clean up the data • No predefined way to do so – Dependent on the source of the data, which determines the types of errors possible Copyright © 2014-2016 Curt Hill Integration, Aggregation and Representation • How is the data stored for use? • The use of data from multiple sources complicates this • Since the data represents many different components it needs to be stored in form useful for modeling and analysis Copyright © 2014-2016 Curt Hill Provenance • How and where the data was collected – Context information • Multiple usage data may need to record the provenance – Web data originally collected to inform customized advertising may be used to study traffic patterns • Single usage data, such as health records, is easier – Even health records might be anonymized for statistical studies Copyright © 2014-2016 Curt Hill Modeling and Analysis • Once the data is established in a data warehouse the data mining can proceed • Data Mining extracts previously unknown and potentially useful information • Big data provides issues not present in normal data mining • Let us digress to traditional data mining Copyright © 2014-2016 Curt Hill A Local Retailer • Sales tickets are collected at the point of sale terminal • These live in the store transactional database to guide low or mid level management with the following information: – Income and expenses – Products selling poorly or well • After a short time these are purged from this database Copyright © 2014-2016 Curt Hill The Data Warehouse • At a corporate data warehouse the information no longer needed for day to day operations are accumulated – Every ticket from every store • Once the data arrives it is retained for years • Data mining is used to give insight into: – Sales trends – Types of shoppers – How product arrangement affects sales Copyright © 2014-2016 Curt Hill Contrast • Although the previous is a big data application, it also differs from many others • The data in this example is well formed, accurate and from well controlled sources • Big data tends to be noisy, dynamic, inter-related and not always trustworthy – Fortunately, the magnitude of the data allows statistical methods to ignore the noise Copyright © 2014-2016 Curt Hill Interpretation • What does the data tell us? – This is the really hard part and must be done by a person • Part of the problem is understanding the pipeline from acquisition to model • This involves understanding the types of errors: – Data errors – Programming errors – Assumptions made in the models Copyright © 2014-2016 Curt Hill Challenges of Big Data • Heterogeneity – People are better at processing different types of data than machines – Also preserving the provenance throughout the pipeline • Inconsistency and Incompleteness – Sensors are notoriously unreliable – Statistics can help, but the modeling must account for the possibility Copyright © 2014-2016 Curt Hill • Scale Challenges 2 – Data volume has been increasing much faster than Moore’s Law – Storing the data is a problem – Processing the data in reasonable times requires substantial computer power • Timeliness – We generally cannot store all the data – Filtering and compressing the data becomes important – This must be done without loss of usefulness Copyright © 2014-2016 Curt Hill Challenges 3 • Privacy and data ownership – Health care has the most legislation concerning what may be shared and how it may be shared – After Edward Snowden revealed what NSA was saving, there has been considerable privacy concerns – For example, most smart phones have some location awareness – This is good for finding a restaurant nearby and even better for tracking you Copyright © 2014-2016 Curt Hill Up and Out • Scaling Up – Specialized hardware – expensive – Simplified software – Single point of failure • Scaling Out – Commoditized hardware – Specialized software for replication, querying and communications • This is the hard part – Any point may fail, but with no apparent loss of availability Copyright © 2014-2016 Curt Hill Summary • Big data is characterized by 4 Vs – – – – Volume Velocity Variety Veracity • Hadoop is an open source system – Replicates and distributes the data – Uses map and reduce scripts to process Copyright © 2014-2016 Curt Hill