Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data What characteristics constitute Big Data Copyright © 2014-2017 Curt Hill What is big data? • Large – Overwhelms hand analysis by people • Complex – many different formats – Binary – JSON – Tab delimited • Unstructured or semi-structured – Not easy to make sense of by a machine – Consider medical notes – Little metadata Copyright © 2014-2017 Curt Hill Where does it come from? • In last two years we have generated 90% of the world’s data – How can this keep up? • Most of this is machine generated data – – – – Sensors Mobile devices Smart devices Web data Copyright © 2014-2017 Curt Hill Why is big data important? • Has the potential to revolutionize: – Science – Business – Governament • The hope is that using big data and data mining we will find knowledge that could not be found any other way • Lets consider some examples Copyright © 2014-2017 Curt Hill First • The first big data center was built in1965 by the US government • The goal was to store federal data – 742 million tax returns – 175 million sets of fingerprints • This is not large by todays standards but was very impressive then Copyright © 2014-2017 Curt Hill Data Mining • Fast claims they saved millions of dollars for a state by scanning tax returns for duplicates • If two returns were essentially identical except for a few details like: – SSN – Address – Then there is a good chance that they are fraudulent • Such things are near impossible for people Copyright © 2014-2017 Curt Hill • Astronomy Science – Sloan Digital Sky Survey has altered the jobs of astronomers – They used to spend a significant amount of time taking pictures of the sky – That data is now in a database • Biology – A single Next Generation Sequencing machine can produce Terabytes of data per day • The Large Hadron Collider produces 11 TByte per second – Only 2 TByte can be retained Copyright © 2014-2017 Curt Hill Business • Web focused companies use the vast quantity of data to customize advertising, content recommendations for their users • Health care companies monitor and analyze data from hospital and home care devices • Energy companies are using the consumption data to make their production scheduling more efficient • Many other Copyright examples © 2014-2017 Curt Hill Government • Census data was the perhaps the first big data application • Los Angeles is using sensors and big data analysis to make decisions concerning traffic • Use sensors to model and analyze: seismic activity, weather data, etc. Copyright © 2014-2017 Curt Hill 4 Vs • • • • Volume Velocity Variety Variability or Veracity Copyright © 2014-2017 Curt Hill Volume • What is the most data that can be stored in a single box – There are physical boundaries • We can up these boundaries by distributing this over many boxes – Scale up – increase the box – Scale out – distribute over many boxes • With Scale out we have the well known communication problems Copyright © 2014-2017 Curt Hill Velocity • The data is coming in fast – Faster than can be structured by a person • 1 Terabyte of data for each New York Stock Exchange day • 15 Billion network connections on internet • 100 sensors per car – How many cars? • Store first – Then figure out if you want it Copyright © 2014-2017 Curt Hill Variety • Formats of data: – JSON, XML – Tab delimited / CSV – Tweets and Facebook updates • The formats of the future are yet to be revealed • We need to be able to handle any of these Copyright © 2014-2017 Curt Hill Variability or Veracity • Making sense of it on its own is problematic • The meaning may change over the course of time • The reliability issues cannot be dismissed Copyright © 2014-2017 Curt Hill Challenges • If we had only one of these Vs to deal with, we could handle it successfully • The problem is we often have to deal with two or more at a time – This may make traditional relational databases too slow to respond – This has led to the rise of the NoSQL databases Copyright © 2014-2017 Curt Hill Big Data Life Cycle • Several phases: – Acquisition – Extraction and cleaning – Integration, Aggregation and Representation – Modeling and Analysis – Interpretation • Let us look at each of these Copyright © 2014-2017 Curt Hill Acquisition • The data is usually recorded by sensors and transmitted to a computer – In the case of computer data the sensor is software: an application, a web server, web browser or part of the network • Often the data is so large that real time filtering/compression is required to reduce – Eg. Large Hadron Collider (11 TB/sec) or upcoming square kilometer telescope (100 million TB/day) Copyright © 2014-2017 Curt Hill Extraction and cleaning • Data arrives in a variety of formats • Consider health care data – Raw data from sensors, each in its own format – Admission data from an existing database – Physician comments • These diverse sources must be formatted to be usable Copyright © 2014-2017 Curt Hill Cleaning • Sensor data may have errors from transmission or interference • Transcription of handwritten information may reflect bias of author • Part of the extraction process is to attempt to clean up the data • No predefined way to do so – Dependent on the source of the data, which determines the types of errors possible Copyright © 2014-2017 Curt Hill Integration, Aggregation and Representation • How is the data stored for use? • The use of data from multiple sources complicates this • Since the data represents many different components it needs to be stored in a form useful for modeling and analysis Copyright © 2014-2017 Curt Hill Provenance • How and where the data was collected – Context information • Multiple usage data may need to record the provenance – Web data originally collected to inform customized advertising may be used to study traffic patterns • Single usage data, such as health records, is easier – Even health records might be anonymized for statistical studies Copyright © 2014-2017 Curt Hill Modeling and Analysis • Once the data is established in a data warehouse the data mining can proceed • Data Mining extracts previously unknown and potentially useful information • Big data provides issues not present in normal data mining • Let us digress to traditional data mining example Copyright © 2014-2017 Curt Hill A Local Retailer • Sales tickets are collected at the point of sale terminal • These live in the store database to guide mid level management with the following information: – Income and expenses – Products selling poorly or well • After a short time these are purged from this database Copyright © 2014-2017 Curt Hill The Data Warehouse • At a corporate data warehouse the information no longer needed for day to day operations are accumulated – Every ticket from every store • Once the data arrives it is retained for years or decades • Data mining is used to give insight into: – Sales trends – Types of shoppers – How product arrangement affects sales Copyright © 2014-2017 Curt Hill Contrast • Although the previous is a big data application, it also differs from many others • The data in this example is well formed, accurate and from well controlled sources • Big data tends to be noisy, dynamic, inter-related and not always trustworthy – Fortunately, the magnitude of the data allows statistical methods to ignore the noise Copyright © 2014-2017 Curt Hill Interpretation • What does the data tell us? – This is the really hard part and must be done by a knowledge worker – Usually armed with several programs • Part of the problem is understanding the pipeline from acquisition to model • This involves understanding the types of errors: – Data errors – Programming errors – Assumptions made in the models Copyright © 2014-2017 Curt Hill Challenges of Big Data • Heterogeneity – People are better at processing different types of data than machines – Also preserving the provenance throughout the pipeline • Inconsistency and Incompleteness – Sensors are notorious for giving an occasional false reading – Statistics can help, but the modeling must account for the possibility Copyright © 2014-2017 Curt Hill • Scale Challenges 2 – Data volume has been increasing much faster than Moore’s Law – Storing the data is a problem – Processing the data in reasonable times requires substantial computer power • Timeliness – We generally cannot store all the data – Filtering and compressing the data becomes important – This must be done without loss of usefulness Copyright © 2014-2017 Curt Hill Challenges 3 • Privacy and data ownership – Health care has the most legislation concerning what may be shared and how it may be shared – After Edward Snowden revealed what NSA was saving, there has been considerable privacy concerns – For example, most smart phones have some location awareness – This is good for finding a restaurant nearby and even better for spying on you Copyright © 2014-2017 Curt Hill Up and Out • Scaling Up – Specialized hardware – expensive – Simplified software – Single point of failure • Scaling Out – Commoditized hardware – Specialized software for replication, querying and communications • This is the hard part – Any point may fail, but with no apparent loss of availability Copyright © 2014-2017 Curt Hill Summary • Big data is characterized by 4 Vs – – – – Volume Velocity Variety Veracity • The potential to gain useful information from it has already been astonishing – We have just scratched the surface Copyright © 2014-2017 Curt Hill