Download Lesson 7 Big Data

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Relational model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Data sets grow rapidly - in part because they are
increasingly gathered by cheap and numerous
information-sensing mobile devices, aerial (remote
sensing), software logs, cameras, microphones, radiofrequency identification (RFID) readers and wireless
sensor networks. The world‘s technological per-capita
(人均) capacity to store information has roughly
doubled every 40 months since the 1980s; as of 2012,
every day 2.5 exabytes (2.5×1018) of data are
Big data is a term for data sets that are so large or
complex that traditional data processing application
software is inadequate to deal with them.
Big data is high volume, high velocity, and/or high
variety information assets that require new forms of
processing to enable enhanced decision making,
insight discovery and process optimization.
The term "big data" often refers simply to the use
of predictive analytics, user behavior analytics, or
certain other advanced data analytics methods that
extract value from data, and seldom to a particular
size of data set.
Big data "size" is a constantly moving target, as of
2012 ranging from a few dozen terabytes to many
petabytes of data.
Big Data - Sources
Internet search
social media
Mobile devices
Internet of Things
Next-generation radio astronomy telescopes
and etc.
Big Data - Characteristics
Big data is characterized by ‘four Vs’: volume, variety,
velocity and veracity. That is, big data comes in large
amounts (volume), is a mixture of structured and
unstructured information (variety), arrives at (often realtime) speed (velocity) and can be of uncertain
provenance (出处origin) (veracity).
Such information is unsuitable for processing using
traditional SQL-queried relational database management
systems (RDBMSs), which is why a constellation of
alternative tools -- notably Apache's open-source
Hadoop distributed data processing system, plus various
NoSQL databases and a range of business intelligence
platforms -- has evolved to service this market.
A typical big data application supported by cloud computing
The framework for big data in cloud computing
File Systems for Big Data
• Google File System (GFS) is a proprietary distributed
file system developed by Google to provide efficient,
reliable access to data using large clusters of
commodity hardware. Compared to traditional file
systems, GFS is designed and optimized to run on
data centers to provide extremely high data
throughputs, low latency and strong fault-tolerant
• Hadoop Distributed File System (HDFS) is inspired
by GFS and it is the storage module of the opensource Apache Hadoop software framework.
The framework for big data in cloud computing
Database Management - NoSQL
A NoSQL (originally referring to "non SQL", "non
relational" or "not only SQL") database provides a
mechanism for storage and retrieval of data which is
modeled in means other than the tabular relations
used in relational databases. NoSQL databases are
increasingly used in big data and real-time
web applications.
The data structures used by NoSQL databases (e.g.
key-value, wide column, graph, or document) are
different from those used by default in relational
databases, making some operations faster in NoSQL.
Database Management - NoSQL
DBMS based
on NoSQL
The database engine based on GFS, which includes set
of key-value pairs that are of sparsity, distribution,
durability and multi dimension
Provides a tightly handle over tradeoffs between
consistency, availability and extendibility and the
technology of consistent hashing
Open source
The Column-oriented database is built on HDFS, which
supports executing of MapReduce tasks and Java API
Open source
The hybrid-type NoSQL database which has a
hierarchical structure based on columns
Open source The NoSQL-type database supports a flexible processing
of JSON type documents.
The framework for big data in cloud computing
Environment of execution tools
MapReduce is a programming model and an associated
implementation for processing and generating big data
sets with a parallel, distributed algorithm on a cluster.
A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner. The framework
sorts the outputs of the maps, which are then input to
the reduce tasks. Typically both the input and the output
of the job are stored in a file-system. The framework
takes care of scheduling tasks, monitoring them and reexecutes the failed tasks.
Environment of execution tools
The key aspect of the MapReduce algorithm is that if
every Map and Reduce is independent of all other
ongoing Maps and Reduces, then the operation can be
run in parallel on different keys and lists of data. On a
large cluster of machines, you can go one step further,
and run the Map operations on servers where the
data lives. Rather than copy the data over the network
to the program, you push out the program to the
machines. The output list can then be saved to the
distributed filesystem, and the reducers run to merge
the results.
Environment of execution tools
Apache spark is a framework for performing general
data analytics on distributed computing cluster like
Hadoop. It provides in memory computations for
increase speed and data process over MapReduce. It
runs on top of existing hadoop cluster and access
hadoop data store (HDFS).
Spark Streaming is an extension of the core Spark
API that enables stream processing of real-time data
Environment of execution tools
In the annual Daytona Gray Sort Challenge, which
benchmarks the speed of data analysis systems, Spark
easily trumped Hadoop MapReduce, and was able to
sort through 100 terabytes of records within 23
minutes; It took Hadoop over three times as long to
execute the same task, about 72 minutes. Spark
sorted 1 petabyte in 234 minutes -- even better than
O(n log n) growth (246 minutes).
The framework for big data in cloud computing
Query Systems
In order to solve the problem that the time cost of
data query is too high in big data environment, some
query systems tailored for big data are proposed. They
provide interfaces for users to access the overall big
data system transparently.
Query Systems
Some solutions have been proposed to support
information acquiring from NoSQL databases by
carrying out SQL-type query languages.
Hive is a platform developed by Facebook. Users
can execute the HiveQL language on it, which is
similar to SQL-like scheme. When users input a
query, the HiveQL statements are translated into
MapReduce jobs expressed as a directed acyclic
graph, which is committed to Hadoop for
Data Visualization
A primary goal of data visualization is to communicate
information clearly and efficiently via statistical
graphics and information graphics. Numerical data
may be encoded using dots, lines, or bars, to visually
communicate a quantitative message. Effective
visualization helps users analyze and reason about
data and evidence. It makes complex data more
accessible, understandable and usable.
Data Visualization
Most frequently used statistical graphics include plots such
as scatter plots, histograms, and box plots.
An example of scatter plot
An example of box plot
[3] Y. Zhang, J. Ren, J. Liu, C. Xu, H. Guo and Y. Liu, "A
Survey on Emerging Computing Paradigms for Big
Data," in Chinese Journal of Electronics, vol. 26, no.
1, pp. 1-12, 1 2017.
• Listen and read the interview “D-Wave Aims to
Bring Quantum Computing to the Cloud” from
ardware/dwave-aims-to-bring-quantumcomputing-to-the-cloud, and get to know more
about how this quantum computer works from