* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Rich Text Format Formatting Help Pages
Survey
Document related concepts
Transcript
Big Data What is it? How will it affect business? Copyright © 2014-2016 Curt Hill What is big data? • Large - more data than will fit in a single spreadsheet • Complex – many different formats – Binary – JSON – Tab delimited • Unstructured or semi-structured – Not easy to make sense of by a machine – Consider medical notes – Little metadata Copyright © 2014-2016 Curt Hill Where does it come from? • In last two years we have generated 90% of the world’s data • Most of this is machine generated data – – – – Sensors Mobile devices Smart devices Web data Copyright © 2014-2016 Curt Hill Why is big data important? • Has the potential to revolutionize: – Science – Business – Governament • Lets consider some examples Copyright © 2014-2016 Curt Hill First • The first big data center was built in1965 by the US government • The goal was to store federal data – 742 million tax returns – 175 million sets of fingerprints • This is not large by todays standards but was very impressive then Copyright © 2014-2016 Curt Hill Science • Astonomy – Sloan Digital Sky Survey has altered the jobs of astronomers – They used to spend a significant amount of time taking pictures of the sky – The data is now in a database • Biology – A single Next Generation Sequencing machine can produce Terabytes of data per day • The Large Hadron Collider produces 11 TByte per second – Only 2 TByte can be retained Copyright © 2014-2016 Curt Hill Business • Web focused companies use the vast quantity of data to customize advertising, content recommendations for their users • Health care companies monitor and analyze data from hospital and home care devices • Energy companies are using the consumption data to make their production scheduling more efficient • Many other Copyright examples © 2014-2016 Curt Hill Government • Census data was the perhaps the first big data application • Los Angeles is using sensors and big data analysis to make decisions concerning traffic • Use sensors to model and analyze and seismic activity, weather data, etc. Copyright © 2014-2016 Curt Hill 4 Vs Define Big Data • • • • Volume Velocity Variety Variability or Veracity Copyright © 2014-2016 Curt Hill Volume • What is the most data that can be stored in a single disk drive – There are physical boundaries • We can up these boundaries by distributing this over many boxes – Scale up – increase the box – Scale out – distribute over many boxes • Then we have the well known communication problems Copyright © 2014-2016 Curt Hill Velocity • The data is coming in fast – Faster than can be structured by a person • 1 Terabyte of data for each New York Stock Exchange day • 15 Billion network connections in a day • 100 sensors per car • Storage first – Then figure out if you want it Copyright © 2014-2016 Curt Hill Variety • Formats of data: – JSON – Tab delimited / CSV – Tweets and Facebook updates • The formats of the future are yet to be revealed • We need to be able to handle these Copyright © 2014-2016 Curt Hill Variability or Veracity • Making sense of it on its own is problematic • The meaning may change over the course of time • The reliability issues must be considered – False signal from a sensor about to malfunction Copyright © 2014-2016 Curt Hill Challenges • If we had only one of these Vs to deal with, we could handle it successfully • The problem is we often have to deal with two or more at a time – This may make traditional relational databases too slow to respond Copyright © 2014-2016 Curt Hill Big Data Life Cycle • Several phases: – Acquisition – Extraction and cleaning – Integration, Aggregation and Representation – Modeling and Analysis – Interpretation Copyright © 2014-2016 Curt Hill Acquisition • The data is usually recorded by sensors and transmitted to a computer – In the case of computer data the sensor is software: an application, a web server, web browser or part of the network • Often the data is so large that real time filtering/compression is required to reduce – Eg. Large Hadron Collider (11 TB/sec) or upcoming square kilometer telescope (100 million TB/day) Copyright © 2014-2016 Curt Hill Extraction and cleaning • Data arrives in a variety of formats • Consider health care data – Raw data from sensors, each in its own format – Admission data from an existing database – Physician comments • These diverse sources must be formatted to be usable Copyright © 2014-2016 Curt Hill Cleaning • Sensor data may have errors from transmission or interference • Transcription of handwritten information may reflect bias of author • Part of the extraction process is to attempt to clean up the data • No predefined way to do so – Dependent on the source of the data, which determines the types of errors possible Copyright © 2014-2016 Curt Hill Integration, Aggregation and Representation • How is the data stored for use? • The use of data from multiple sources complicates this • Since the data represents many different components it needs to be stored in form useful for modeling and analysis Copyright © 2014-2016 Curt Hill Provenance • How and where the data was collected – Context information • Multiple usage data may need to record the provenance – Web data originally collected to inform customized advertising may be used to study traffic patterns • Single usage data, such as health records, is easier – Even health records might be anonymized for statistical studies Copyright © 2014-2016 Curt Hill Modeling and Analysis • Once the data is established in a data warehouse the data mining can proceed • Data Mining extracts previously unknown and potentially useful information • Big data provides issues not present in normal data mining • Let us digress to traditional data mining Copyright © 2014-2016 Curt Hill A Local Retailer • Sales tickets are collected at the point of sale terminal • These live in the store transactional database to guide low or mid level management with the following information: – Income and expenses – Products selling poorly or well • After a short time these are purged from this database Copyright © 2014-2016 Curt Hill The Data Warehouse • At a corporate data warehouse the information no longer needed for day to day operations are accumulated – Every ticket from every store • Once the data arrives it is retained for years • Data mining is used to give insight into: – Sales trends – Types of shoppers – How product arrangement affects sales Copyright © 2014-2016 Curt Hill Contrast • Although the previous is a big data application, it also differs from many others • The data in this example is well formed, accurate and from well controlled sources • Big data tends to be noisy, dynamic, inter-related and not always trustworthy – Fortunately, the magnitude of the data allows statistical methods to ignore the noise Copyright © 2014-2016 Curt Hill Interpretation • What does the data tell us? – This is the really hard part and must be done by a person • Part of the problem is understanding the pipeline from acquisition to model • This involves understanding the types of errors: – Data errors – Programming errors – Assumptions made in the models Copyright © 2014-2016 Curt Hill Challenges of Big Data • Heterogeneity – People are better at processing different types of data than machines – Also preserving the provenance throughout the pipeline • Inconsistency and Incompleteness – Sensors are notoriously unreliable – Statistics can help, but the modeling must account for the possibility Copyright © 2014-2016 Curt Hill • Scale Challenges 2 – Data volume has been increasing much faster than Moore’s Law – Storing the data is a problem – Processing the data in reasonable times requires substantial computer power • Timeliness – We generally cannot store all the data – Filtering and compressing the data becomes important – This must be done without loss of usefulness Copyright © 2014-2016 Curt Hill Challenges 3 • Privacy and data ownership – Health care has the most legislation concerning what may be shared and how it may be shared – After Edward Snowden revealed what NSA was saving, there has been considerable privacy concerns – For example, most smart phones have some location awareness – This is good for finding a restaurant nearby and even better for tracking you Copyright © 2014-2016 Curt Hill Up and Out • Scaling Up – Specialized hardware – expensive – Simplified software – Single point of failure • Scaling Out – Commoditized hardware – Specialized software for replication, querying and communications • This is the hard part – Any point may fail, but with no apparent loss of availability Copyright © 2014-2016 Curt Hill Summary • Big data is characterized by 4 Vs – – – – Volume Velocity Variety Veracity • Hadoop is an open source system – Replicates and distributes the data – Uses map and reduce scripts to process Copyright © 2014-2016 Curt Hill