Download Big Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data
What characteristics constitute
Big Data
Copyright © 2014-2017 Curt Hill
What is big data?
• Large
– Overwhelms hand analysis by people
• Complex – many different formats
– Binary
– JSON
– Tab delimited
• Unstructured or semi-structured
– Not easy to make sense of by a
machine
– Consider medical notes
– Little metadata
Copyright © 2014-2017 Curt Hill
Where does it come from?
• In last two years we have generated
90% of the world’s data
– How can this keep up?
• Most of this is machine generated
data
–
–
–
–
Sensors
Mobile devices
Smart devices
Web data
Copyright © 2014-2017 Curt Hill
Why is big data important?
• Has the potential to revolutionize:
– Science
– Business
– Governament
• The hope is that using big data and
data mining we will find knowledge
that could not be found any other
way
• Lets consider some examples
Copyright © 2014-2017 Curt Hill
First
• The first big data center was built
in1965 by the US government
• The goal was to store federal data
– 742 million tax returns
– 175 million sets of fingerprints
• This is not large by todays standards
but was very impressive then
Copyright © 2014-2017 Curt Hill
Data Mining
• Fast claims they saved millions of
dollars for a state by scanning tax
returns for duplicates
• If two returns were essentially
identical except for a few details
like:
– SSN
– Address
– Then there is a good chance that they
are fraudulent
• Such things are near impossible for
people
Copyright © 2014-2017 Curt Hill
• Astronomy
Science
– Sloan Digital Sky Survey has altered the jobs
of astronomers
– They used to spend a significant amount of
time taking pictures of the sky
– That data is now in a database
• Biology
– A single Next Generation Sequencing
machine can produce Terabytes of data per
day
• The Large Hadron Collider produces 11
TByte per second
– Only 2 TByte can be retained
Copyright © 2014-2017 Curt Hill
Business
• Web focused companies use the
vast quantity of data to customize
advertising, content
recommendations for their users
• Health care companies monitor and
analyze data from hospital and home
care devices
• Energy companies are using the
consumption data to make their
production scheduling more
efficient
• Many other Copyright
examples
© 2014-2017 Curt Hill
Government
• Census data was the perhaps the
first big data application
• Los Angeles is using sensors and
big data analysis to make decisions
concerning traffic
• Use sensors to model and analyze:
seismic activity, weather data, etc.
Copyright © 2014-2017 Curt Hill
4 Vs
•
•
•
•
Volume
Velocity
Variety
Variability or Veracity
Copyright © 2014-2017 Curt Hill
Volume
• What is the most data that can be
stored in a single box
– There are physical boundaries
• We can up these boundaries by
distributing this over many boxes
– Scale up – increase the box
– Scale out – distribute over many boxes
• With Scale out we have the well
known communication problems
Copyright © 2014-2017 Curt Hill
Velocity
• The data is coming in fast
– Faster than can be structured by a
person
• 1 Terabyte of data for each New
York Stock Exchange day
• 15 Billion network connections on
internet
• 100 sensors per car
– How many cars?
• Store first
– Then figure out if you want it
Copyright © 2014-2017 Curt Hill
Variety
• Formats of data:
– JSON, XML
– Tab delimited / CSV
– Tweets and Facebook updates
• The formats of the future are yet to
be revealed
• We need to be able to handle any of
these
Copyright © 2014-2017 Curt Hill
Variability or Veracity
• Making sense of it on its own is
problematic
• The meaning may change over the
course of time
• The reliability issues cannot be
dismissed
Copyright © 2014-2017 Curt Hill
Challenges
• If we had only one of these Vs to
deal with, we could handle it
successfully
• The problem is we often have to deal
with two or more at a time
– This may make traditional relational
databases too slow to respond
– This has led to the rise of the NoSQL
databases
Copyright © 2014-2017 Curt Hill
Big Data Life Cycle
• Several phases:
– Acquisition
– Extraction and cleaning
– Integration, Aggregation and
Representation
– Modeling and Analysis
– Interpretation
• Let us look at each of these
Copyright © 2014-2017 Curt Hill
Acquisition
• The data is usually recorded by
sensors and transmitted to a
computer
– In the case of computer data the sensor
is software: an application, a web
server, web browser or part of the
network
• Often the data is so large that real
time filtering/compression is
required to reduce
– Eg. Large Hadron Collider (11 TB/sec)
or upcoming square kilometer
telescope (100 million TB/day)
Copyright © 2014-2017 Curt Hill
Extraction and cleaning
• Data arrives in a variety of formats
• Consider health care data
– Raw data from sensors, each in its own
format
– Admission data from an existing
database
– Physician comments
• These diverse sources must be
formatted to be usable
Copyright © 2014-2017 Curt Hill
Cleaning
• Sensor data may have errors from
transmission or interference
• Transcription of handwritten
information may reflect bias of
author
• Part of the extraction process is to
attempt to clean up the data
• No predefined way to do so
– Dependent on the source of the data,
which determines the types of errors
possible
Copyright © 2014-2017 Curt Hill
Integration, Aggregation
and Representation
• How is the data stored for use?
• The use of data from multiple
sources complicates this
• Since the data represents many
different components it needs to be
stored in a form useful for modeling
and analysis
Copyright © 2014-2017 Curt Hill
Provenance
• How and where the data was
collected
– Context information
• Multiple usage data may need to
record the provenance
– Web data originally collected to inform
customized advertising may be used to
study traffic patterns
• Single usage data, such as health
records, is easier
– Even health records might be
anonymized for statistical studies
Copyright © 2014-2017 Curt Hill
Modeling and Analysis
• Once the data is established in a
data warehouse the data mining can
proceed
• Data Mining extracts previously
unknown and potentially useful
information
• Big data provides issues not present
in normal data mining
• Let us digress to traditional data
mining example
Copyright © 2014-2017 Curt Hill
A Local Retailer
• Sales tickets are collected at the
point of sale terminal
• These live in the store database to
guide mid level management with
the following information:
– Income and expenses
– Products selling poorly or well
• After a short time these are purged
from this database
Copyright © 2014-2017 Curt Hill
The Data Warehouse
• At a corporate data warehouse the
information no longer needed for
day to day operations are
accumulated
– Every ticket from every store
• Once the data arrives it is retained
for years or decades
• Data mining is used to give insight
into:
– Sales trends
– Types of shoppers
– How product
arrangement affects sales
Copyright © 2014-2017 Curt Hill
Contrast
• Although the previous is a big data
application, it also differs from many
others
• The data in this example is well
formed, accurate and from well
controlled sources
• Big data tends to be noisy, dynamic,
inter-related and not always
trustworthy
– Fortunately, the magnitude of the data
allows statistical methods to ignore the
noise
Copyright © 2014-2017 Curt Hill
Interpretation
• What does the data tell us?
– This is the really hard part and must be
done by a knowledge worker
– Usually armed with several programs
• Part of the problem is understanding
the pipeline from acquisition to
model
• This involves understanding the
types of errors:
– Data errors
– Programming errors
– Assumptions
made in the models
Copyright © 2014-2017 Curt Hill
Challenges of Big Data
• Heterogeneity
– People are better at processing
different types of data than machines
– Also preserving the provenance
throughout the pipeline
• Inconsistency and Incompleteness
– Sensors are notorious for giving an
occasional false reading
– Statistics can help, but the modeling
must account for the possibility
Copyright © 2014-2017 Curt Hill
• Scale
Challenges 2
– Data volume has been increasing much
faster than Moore’s Law
– Storing the data is a problem
– Processing the data in reasonable times
requires substantial computer power
• Timeliness
– We generally cannot store all the data
– Filtering and compressing the data
becomes important
– This must be done without loss of
usefulness
Copyright © 2014-2017 Curt Hill
Challenges 3
• Privacy and data ownership
– Health care has the most legislation
concerning what may be shared and
how it may be shared
– After Edward Snowden revealed what
NSA was saving, there has been
considerable privacy concerns
– For example, most smart phones have
some location awareness
– This is good for finding a restaurant
nearby and even better for spying on
you
Copyright © 2014-2017 Curt Hill
Up and Out
• Scaling Up
– Specialized hardware – expensive
– Simplified software
– Single point of failure
• Scaling Out
– Commoditized hardware
– Specialized software for replication,
querying and communications
• This is the hard part
– Any point may fail, but with no apparent
loss of availability
Copyright © 2014-2017 Curt Hill
Summary
• Big data is characterized by 4 Vs
–
–
–
–
Volume
Velocity
Variety
Veracity
• The potential to gain useful
information from it has already been
astonishing
– We have just scratched the surface
Copyright © 2014-2017 Curt Hill