Download Big data techniques and Applications – A General Review Li,

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
Big data techniques and
Applications – A General Review
Dr. Jun Li, [email protected]
School of Mathematics and Computer Science, University of Wolverhampton
OUTLINE





Big Data Concept
Different Schools: Hadoop, HPCC, Splunk
Databases and NoSQL
Parallel/Distributed Computing & Databases
Research Scenarios
BIG DATA CONCEPT

Big-Data is characterized by three Vs




Volume: Terabytes1012,Yottabytes 1024, Brontobytes1027 and
Geopbytes 1030
Velocity: the speed at which the data is generated and processed
Variety: from unstructured (raw files and log files) to structured
(relational databases), with different types such as messages, social
media conversations, photos, sensor data, video and voice
recordings.
Everything we do leaves digital trace, which can be used and
analysed. Because of the size and complexity, they can not be
processed and analysed through traditional methods such as a
RDBMS.
BIG DATA EXAMPLES



A supermarket could use their loyalty card data, and
monitor social media sites to get an overall view of
customer behaviour and preferences.
Hospitals analyse medical data and patient records to
predict if certain type of treatment is efficacious, e.g. fractal
analysis of large amount of medical images.
Calculate information entropy by languages and person ID
and characters, i.e., Personal Information Entropy (pie),
through data from social media and web pages etc.
Fractal Analysis

An image is called "fractal" if it displays self-similarity: e.g.
the tree shown, it can be split into parts, each of which is
(at least approximately) a reduced-size copy of the whole.
Fractal Analysis

A possible characterisation of a fractal set is provided by
the "box-counting" method
Fractal Analysis
Fractal Analysis

Number of boxes are counted based on different size for one image
as shown in the blue curve; the calculation is time-consuming

What if hundred of thousands of images (in large storage)? what if we
count a sliding box – one pixel per move horizontally and vertically?
What is



An Apache open-source framework for distributed
computing and data storage.
Developed for large scale computation and data
processing on a network of commodity hardware (i.e.,
affordable).
Moves computation (i.e. applications) to the data rather
than move data around
Hadoop Architecture
Hadoop Logical Deployment
Hadoop Physical Deployment
Hadoop Data Import/Export
Hadoop Architecture






HDFS – Hadoop distributed file system
MapReduce – A YARN-based system for parallel processing
of large data sets.
YARN – A framework for job scheduling and cluster resource
management toward an distributed operating system
HBase – A non-relational, distributed database
Hive – A data warehouse infrastructure for data
summarization, query, and analysis
Pig – A high-level platform for creating MapReduce programs
using language Pig Latin
HDFS - Hadoop distributed file system

Hadoop distributed file system




HDFS stores large files (typically in the range of gigabytes to
terabytes) across multiple machines by blocks (64M or 128M)
With data awareness (Metadata in-memory)
Runs on top of native filesystems
HDFS Daemons



Namenode: manages the file system's namespace/meta-data of
file blocks
Datanodes: Stores and retrieves data blocks and reports to
namenode
Secondary Namenode: snapshots of the primary namenode's
directory information
HDFS - Hadoop distributed file system

Upload a file

File distribution by locations and blocks
HBase





HBase is a non-relational, distributed database running on
top of HDFS.
Column-Oriented key-value (NoSQL)
Supports random real-time CRUD operations (unlike
HDFS)
Integrated with MapReduce framework
Not an ACID compliant database.
What is NoSQL



NoSQL: Not only SQL, schema-free
Provides a mechanism for storage and retrieval of data
that is modelled in data structures such as key-value,
graph or document other than RDBMS.
Applied in Big Data


NoSQL databases use Map/Reduce to query and index the
database
Map/Reduce tasks are distributed among multiple nodes for
parallel processing
What is Key-Value Pair Databases

KVP Examples,
Key
Value
Color
Blue
Libation
Beer
Hero
Soldier
Key
Value
FacebookUser12345_Color
Red
TwitterUser67890_Color
Brownish
FoursquareUser45678_Libation
“White wine”
Google+User24356_Libation
“Dry martini with a twist”
LinkedInUser87654_Hero
“Top sales performer”
What is Column-oriented Data Model



Store in columns by block
Primary key is the data
Assume whole-row operations
are rare
1
2
3
4
1
2
3
4
HBase Data Model


Key: row _ column family _ column
E.g. a personal information table with family columns
HBase



Cells are stored by Column Family as a file (HFile) on HDFS
Cells are not set will not be stored (no NULLs)
Table is made of Column Families.
HBase Data Model

Create table

Insert data

Retrieve data
NoSQL DATABASES

Types of NoSQL databases





Column
Document
Key-value Pair
Graph
Multi-model
DATABASES

Database models









Hierarchical databases
Network databases
Relational databases
Object oriented databases
Object-Relational Databases
Entity-Attribute-Value (EAV) data model
Semi-structured model
Associative model
Context model
HIERARCHICAL DATABASES

The data is organized into a tree-like structure.

An entity type corresponds to a table in the relational
database model and a record corresponds to a row.
HIERARCHICAL DATABASES


Hierarchical databases were IBM's first database, called
IMS (Information Management System), which was
released in 1960.
A hierarchical schema consists of record types and PCR
types.



A record/segment is a collection of field values.
Records of the same type are grouped into record types.
A PCR type (parent-child relationship type) is a 1:N
relationship between two record types.
HIERARCHICAL DATABASES

PDBR – Physical Data Base Record Type
PCR
department
dname dnumber mgrname mgrstartdate
employee
name ssn bdate address
project
pname pnumber plocation
HIERARCHICAL DATABASES
LOGICAL ORGANIZATION

Logically organized in PDB (Physical Data Base) – a collection
of occurrence trees.

An occurrence of tree – root is a single record with
multiple child records
Math
Department A
001
…
employee1
Jones 1234 April London
employee2
Tom …
…
…
employee3
Mary …
…
…
…
CS
IS
project1
001
MI125
project2
…
…
HIERARCHICAL DATABASES

Physical organization in storage-1

Sequential order using array – “top-down, left-right”
HIERARCHICAL DATABASES

Sequential method using linked list instead of array
HIERARCHICAL DATABASES

Doubly linked list: one pointing to the first child, another
to neighbour brother
NETWORK DATABASES


Very similar to the hierarchical model; the hierarchical
model is a subset of the network model.
But, child tables were allowed to have more than one
parent.
NETWORK DATABASES

Network databases concepts


Record – represents object (e.g., customer, branch)
Set – represents one to many relationship (e.g. depositor
consisted of customer and account)
NETWORK DATABASES

Data store structure – data is organized by set
3 set values
3 set values
NETWORK DATABASES

DML commands







find – locates a record or set in the database
get – get a copy of the record from the database
store – insert a record into the database
modify – modify the current record
erase – delete the current record
connect – insert a new record into a set:
connect <record > to <set>
disconnect – remove a record from a set
disconnect <record> from <set>
NETWORK DATABASES
NETWORK DATABASES

Advantages of a Network Database Model



Because it has the many-many relationship, network database
model can easily be accessed in any table record
For complex data, it is easier to use because of the multiple
relationships among data
Disadvantage of a Network Database Model


Difficult for first time users
Difficulties with alterations of the database because when
information entered can alter the entire database
MapReduce

Now, MapReduce 2.0 on YARN – Yet Another Resource
Negotiator
YARN Daemons Deployment

Yarn replaced resource management and job scheduler
YARN
Daemons
Word Count
MapReduce
Parallel Computing:
Data decomposition, task dependency
and interaction

Sparse matrix-vector multiplication
Given n x n sparse matrix A and vector b, 𝒚 = 𝑨 × 𝒃
In parallel to calculate, 𝒚 𝒊 = 𝒏𝒋=𝟏 𝑨[𝒊, 𝒋] × 𝒃[𝒋]

Each process owns y[i], A[i,*] and b[i]


Parallel Computing:
Exploratory Decomposition

15-puzzle problem


A number can be moved to a blank region
Determine a path/sequence or shortest path/sequence to the
final configuration ( Here, a sequence from 1 to 15)
Parallel Computing:
Exploratory Decomposition

15-puzzle problem


A number can be moved to a blank region
Determine a path/sequence or shortest path/sequence to the
final configuration ( Here, a sequence from 1 to 15)
Parallel Computing:
Exploratory Decomposition
Parallel Computing Design

Decomposition techniques


Characteristics of tasks





As shown in examples above
Task generation (static or dynamic)
Task sizes (i.e. time required to complete, or data sizes)
Knowledge of task sizes
Inter-task relations (i.e., dependency, acyclic) and interactions
Mapping tasks to processes for load balancing
Parallel Computing Design

Parallel Algorithm Models





The Data-Parallel Model
The Task Graph Model
The Work Pool Model
The Master-Slave Model
The Pipeline or Producer-Consumer Model
MapReduce Workflows using Oozie
MapReduce Workflows using Oozie



Describes workflows in set of XML and configuration files
Has coordinator engine that schedules workflows based
on time and incoming data
Provides ability to re-run failed portions of the workflow
No directed cycles
Hadoop Supports of Relational Databses



Hive provides SQL-like query language named HiveQL,
but NOT low latency or real-time queries; supports table
partition (partitioning and bucketing).
Pig Latin using bag (table), tuple, field
Both run on HDFS and MapReduce (stored in the format
of file)
Hive to MapReduce and HDFS
Concurrency Control



Is to prevent transactions conflicting with each other.
Problems normally occur if more than one transaction
tries to access the same record or set of records at the
same time.
Solutions



Timestamping algorithms
Optimistic algorithms
Pessimistic algorithms
Distributed systems/databases

Essential Requirements

Local transparency


Data fragmentation & replication



Intelligent optimizer for fragmentation and query to minimize the cost (
I/O cost + CPU cost + Communication cost)
Update issue of replication
Transaction scheduling




Data naming scheme, or, dictionary in case of databases
ACID
Two-Phase Commit (coordinated by an agent)
Concurrency issues (where is the lock manager?)
The above three requires a distributed operating system

Is YARN the reason developed?
Distributed systems/databases

Time synchronization and global state issues


We cannot synchronize clocks perfectly across a distributed
system, therefore to use physical time to find out the order of
any arbitrary pair of events occurring within it.
Lamport logical time
Distributed systems/databases

To examine whether a particular property is true, e.g.
determine whether is a deadlock, and global debugging

Consistent Global State (cause-effect)
Real-Time Stream Processing

Spark



User interfaces e.g. SQL – provided by Hive, and Real-time streaming.
Transparent interfaces to connect the lower level components, e.g.
YARN and HDFS.
At Client, to launch a program through a ‘standalone manager’
bin/spark-submit --master spark://host:7077 --executor -memory 10g myProgram.py
Converts a user
program into tasks, i.e.
directed acyclic graph
(DAG)
Launch workers, i.e.
executors, and
schedule them
Real-Time Stream Processing

Data is split by time interval.
Spark Streaming
Receivers
Tn
…
T2
T1
Input data
streams

T0
Results pushed to
external
Input, processing, and output are distributed on different
work nodes, scheduled by server.
Worker Node
Driver Program
Executor
Long
Task Receiver
Input Stream
StreamingContext
Spark jobs to
process
received data
SparkContext
Data replicated
Worker Node
Executor
Task
Task
Output results in batches
Splunk



Reads all sorts (almost any type, even in real time) of data
into Splunk's internal repository, add indexes and create
events – the data unit in Splunk.
Users can then set up metrics and dashboards (using
Splunk) that support basic business intelligence, analytics,
and reporting on key performance indicators (KPIs).
A NoSQL query approach is used, reportedly based on
the Unix command's pipeline concepts and does not
involve or impose any predefined schema, called search
processing language (SPL)
Splunk Architecture
4. Functions &
Interfaces
3.
1. Load
2.
Conventional use cases

Investigational Searching


Monitoring and Alerting


A Splunk app (or application) can be a simple search collecting
events, a group of alerts categorized for efficiency (or for many
other reasons), or an entire program developed using the
Splunk's REST API.
Monitor any infrastructure (e.g. Windows event logs) in real
time.
Decision Support Analysis
Splunk Deployment
Dedicated search head


Dedicated search head is an instance that handles search
management functions, directing search requests to a set of search
peers and then merging the results back to users
Forwarder to gather data from a variety of inputs and forward the
data to a Splunk Enterprise server for indexing and searching.
HPCC

High-Performance Computing Cluster
HPCC




Thor cluster is for extract, transform, load (ETL) processing of
the raw data, as well as large-scale complex analytics, and
creation of keyed data and indexes for Roxie cluster.
Thor cluster is similar in its function, execution environment,
filesystem and capabilities to Hadoop MapReduce.
Roxie cluster designed as an online high-performance
structured query and analysis platform or data warehouse
delivering the parallel data access processing requirements of
online applications.
Roxie cluster is similar in its function and capabilities to
Hadoop with HBase and Hive capabilities, and provides for
near real time predictable query.
Big Data Complexity and Lambda

Operational complexity


Eventual consistency complexity


E.g. two replicas with a count of 10, one increases 2 and the
other increases 1. What should be the merge value be?
Lack of human-fault tolerance


E.g. index compaction at times for all nodes
Programming mistakes
CAP theorem – You can have at most two of Consistency,
Availability and Partition tolerance.

In our context, ‘In a distributed system, it can be consistent or
available but not both’
Lambda Architecture




Lambda is to build Big-Data
systems as three layers.
Batch layer run parallel
tasks on distributed
datasets to produce batch
views for Serving layer.
Speed layer accept changes
to produce real-time view,
intended to solve CAP. The
solution is trivial.
Queries answered by
combining Batch and Realtime views.
Lambda principle
‘Data is immutable’
An example of strength of Lambda

Batch and Serving layers together solve normalization and
de-normalization issue
Normalized
De-normalized
Lambda Data Model


Graph schema: (fact and properties) vs (table and fields)
Physically stored by fact
Lambda architecture

Hadoop as an Enterprise Data Hub
What else NoSQL Databases

Types of NoSQL databases





Column
Document Oriented Database (DOD)
Key-value Pair
Graph
Multi-model
Document Oriented Databases


DOD as a subclass of key-value database, consists of a
collection of documents.
CouchDB – A JSON document-oriented database


JSON Documents - Everything stored in CouchDB boils
down to a JSON document.
RESTful Interface - From creation to replication to data
insertion, every management and data task in CouchDB can be
done via HTTP.
Document Oriented Databases

JSON document – person.json

CouchDB DML commands




POST - creates a new record
GET - reads records
PUT - updates a record
DELETE - deletes a record
MapReduce Operation on DOD

Map Function – Retrieve order from person.json

Reduce Function – Calculate sales of products
No-SQL Doubts

Concurrency control


Unavailability of ACID properties, therefore transactions are reliably
supported. (Is Neo4j an exception?)
Data integrity

Inability to define relationships - parent versus child (graph could be
complex), therefore data can be inconsistent

Absence of support for JOIN and cross-entity query

Suggestions – 1




RDBMS for Transactional applications
NoSQL/RDBMS for Computational applications (e.g. sales record
management)
NoSQL for Web-scale application (e.g. web analytics)
Suggestions – 2
Polyglot Persistence

Polyglot Persistence: using different data storage
technologies for varying data storage needs
Information entropy

Claude Shannon's information entropy is defined by ,
(1)

Where P is the probability of occurrence 𝒙𝒊 . H is an expected value as a
measure of uncertainty.

For example, to calculate 26 English letters entroy in a big corpus,
(2)


𝑃
P are the occurrences of letters in corpus, 𝑙𝑜𝑔2 is the number of bits can
represent the probability. Then H(L) is the expected value
Information entropy





Shannon estimated the entropy of written English to be 1.0 and 1.5
bits per character based on clean English. But, in reality the spoken
and typed English on Internet is full of noises, so should be higher.
How about English word?
How about other languages?
How about each person?
I expect each person is associated with a unique number – Personal
Information Entropy (pie) in both real and virtual world, with
featured computation beyond language model (See more in thesis - Noisy
Language Modeling Framework Using Neural Network Techniques)

Skynet is coming true, AlphaGo has beaten human
being, we need to hide our ‘pie’.
Questions – what is Big-Data?
References













Hierarchical Model: http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/e.pdf
Hierarchical Database:
www.uwinnipeg.ca/~ychen2/databaseNotes/hierarchicalDB.ppt
Network Model: http://codex.cs.yale.edu/avi/db-book/db5/slide-dir/appA.ppt and
http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/d.pdf
CouchDB – Get Started, http://guide.couchdb.org/draft/tour.html
Jiawei Han and Micheline Kamber (2006), Data Mining – Concepts and Techniques,
2nd
Ananth Grama et al. (2003), Introduction tot Parallel Computing, 2nd
Jun Li, 2009, ‘Noisy Language Modelling Framework using Neural Network
techniques’
http://www.coreservlets.com/hadoop-tutorial
Holden Karau et al (2015), O’Reily – Learning Spark
George Coulouris, Distributed Systems Concepts and Design, 5th Edition
Nathan Marz et al (2015), Big Data – Principles and best practices of scalable realtime data systems (Lambda Architecture)
Michael Manoochehri (2014), Data Just Right - Introduction to Large Scale Data &
Analytics
See related references in the notes of each slide