* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Big data techniques and Applications – A General Review Li,
Survey
Document related concepts
Transcript
Big data techniques and Applications – A General Review Dr. Jun Li, [email protected] School of Mathematics and Computer Science, University of Wolverhampton OUTLINE Big Data Concept Different Schools: Hadoop, HPCC, Splunk Databases and NoSQL Parallel/Distributed Computing & Databases Research Scenarios BIG DATA CONCEPT Big-Data is characterized by three Vs Volume: Terabytes1012,Yottabytes 1024, Brontobytes1027 and Geopbytes 1030 Velocity: the speed at which the data is generated and processed Variety: from unstructured (raw files and log files) to structured (relational databases), with different types such as messages, social media conversations, photos, sensor data, video and voice recordings. Everything we do leaves digital trace, which can be used and analysed. Because of the size and complexity, they can not be processed and analysed through traditional methods such as a RDBMS. BIG DATA EXAMPLES A supermarket could use their loyalty card data, and monitor social media sites to get an overall view of customer behaviour and preferences. Hospitals analyse medical data and patient records to predict if certain type of treatment is efficacious, e.g. fractal analysis of large amount of medical images. Calculate information entropy by languages and person ID and characters, i.e., Personal Information Entropy (pie), through data from social media and web pages etc. Fractal Analysis An image is called "fractal" if it displays self-similarity: e.g. the tree shown, it can be split into parts, each of which is (at least approximately) a reduced-size copy of the whole. Fractal Analysis A possible characterisation of a fractal set is provided by the "box-counting" method Fractal Analysis Fractal Analysis Number of boxes are counted based on different size for one image as shown in the blue curve; the calculation is time-consuming What if hundred of thousands of images (in large storage)? what if we count a sliding box – one pixel per move horizontally and vertically? What is An Apache open-source framework for distributed computing and data storage. Developed for large scale computation and data processing on a network of commodity hardware (i.e., affordable). Moves computation (i.e. applications) to the data rather than move data around Hadoop Architecture Hadoop Logical Deployment Hadoop Physical Deployment Hadoop Data Import/Export Hadoop Architecture HDFS – Hadoop distributed file system MapReduce – A YARN-based system for parallel processing of large data sets. YARN – A framework for job scheduling and cluster resource management toward an distributed operating system HBase – A non-relational, distributed database Hive – A data warehouse infrastructure for data summarization, query, and analysis Pig – A high-level platform for creating MapReduce programs using language Pig Latin HDFS - Hadoop distributed file system Hadoop distributed file system HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines by blocks (64M or 128M) With data awareness (Metadata in-memory) Runs on top of native filesystems HDFS Daemons Namenode: manages the file system's namespace/meta-data of file blocks Datanodes: Stores and retrieves data blocks and reports to namenode Secondary Namenode: snapshots of the primary namenode's directory information HDFS - Hadoop distributed file system Upload a file File distribution by locations and blocks HBase HBase is a non-relational, distributed database running on top of HDFS. Column-Oriented key-value (NoSQL) Supports random real-time CRUD operations (unlike HDFS) Integrated with MapReduce framework Not an ACID compliant database. What is NoSQL NoSQL: Not only SQL, schema-free Provides a mechanism for storage and retrieval of data that is modelled in data structures such as key-value, graph or document other than RDBMS. Applied in Big Data NoSQL databases use Map/Reduce to query and index the database Map/Reduce tasks are distributed among multiple nodes for parallel processing What is Key-Value Pair Databases KVP Examples, Key Value Color Blue Libation Beer Hero Soldier Key Value FacebookUser12345_Color Red TwitterUser67890_Color Brownish FoursquareUser45678_Libation “White wine” Google+User24356_Libation “Dry martini with a twist” LinkedInUser87654_Hero “Top sales performer” What is Column-oriented Data Model Store in columns by block Primary key is the data Assume whole-row operations are rare 1 2 3 4 1 2 3 4 HBase Data Model Key: row _ column family _ column E.g. a personal information table with family columns HBase Cells are stored by Column Family as a file (HFile) on HDFS Cells are not set will not be stored (no NULLs) Table is made of Column Families. HBase Data Model Create table Insert data Retrieve data NoSQL DATABASES Types of NoSQL databases Column Document Key-value Pair Graph Multi-model DATABASES Database models Hierarchical databases Network databases Relational databases Object oriented databases Object-Relational Databases Entity-Attribute-Value (EAV) data model Semi-structured model Associative model Context model HIERARCHICAL DATABASES The data is organized into a tree-like structure. An entity type corresponds to a table in the relational database model and a record corresponds to a row. HIERARCHICAL DATABASES Hierarchical databases were IBM's first database, called IMS (Information Management System), which was released in 1960. A hierarchical schema consists of record types and PCR types. A record/segment is a collection of field values. Records of the same type are grouped into record types. A PCR type (parent-child relationship type) is a 1:N relationship between two record types. HIERARCHICAL DATABASES PDBR – Physical Data Base Record Type PCR department dname dnumber mgrname mgrstartdate employee name ssn bdate address project pname pnumber plocation HIERARCHICAL DATABASES LOGICAL ORGANIZATION Logically organized in PDB (Physical Data Base) – a collection of occurrence trees. An occurrence of tree – root is a single record with multiple child records Math Department A 001 … employee1 Jones 1234 April London employee2 Tom … … … employee3 Mary … … … … CS IS project1 001 MI125 project2 … … HIERARCHICAL DATABASES Physical organization in storage-1 Sequential order using array – “top-down, left-right” HIERARCHICAL DATABASES Sequential method using linked list instead of array HIERARCHICAL DATABASES Doubly linked list: one pointing to the first child, another to neighbour brother NETWORK DATABASES Very similar to the hierarchical model; the hierarchical model is a subset of the network model. But, child tables were allowed to have more than one parent. NETWORK DATABASES Network databases concepts Record – represents object (e.g., customer, branch) Set – represents one to many relationship (e.g. depositor consisted of customer and account) NETWORK DATABASES Data store structure – data is organized by set 3 set values 3 set values NETWORK DATABASES DML commands find – locates a record or set in the database get – get a copy of the record from the database store – insert a record into the database modify – modify the current record erase – delete the current record connect – insert a new record into a set: connect <record > to <set> disconnect – remove a record from a set disconnect <record> from <set> NETWORK DATABASES NETWORK DATABASES Advantages of a Network Database Model Because it has the many-many relationship, network database model can easily be accessed in any table record For complex data, it is easier to use because of the multiple relationships among data Disadvantage of a Network Database Model Difficult for first time users Difficulties with alterations of the database because when information entered can alter the entire database MapReduce Now, MapReduce 2.0 on YARN – Yet Another Resource Negotiator YARN Daemons Deployment Yarn replaced resource management and job scheduler YARN Daemons Word Count MapReduce Parallel Computing: Data decomposition, task dependency and interaction Sparse matrix-vector multiplication Given n x n sparse matrix A and vector b, 𝒚 = 𝑨 × 𝒃 In parallel to calculate, 𝒚 𝒊 = 𝒏𝒋=𝟏 𝑨[𝒊, 𝒋] × 𝒃[𝒋] Each process owns y[i], A[i,*] and b[i] Parallel Computing: Exploratory Decomposition 15-puzzle problem A number can be moved to a blank region Determine a path/sequence or shortest path/sequence to the final configuration ( Here, a sequence from 1 to 15) Parallel Computing: Exploratory Decomposition 15-puzzle problem A number can be moved to a blank region Determine a path/sequence or shortest path/sequence to the final configuration ( Here, a sequence from 1 to 15) Parallel Computing: Exploratory Decomposition Parallel Computing Design Decomposition techniques Characteristics of tasks As shown in examples above Task generation (static or dynamic) Task sizes (i.e. time required to complete, or data sizes) Knowledge of task sizes Inter-task relations (i.e., dependency, acyclic) and interactions Mapping tasks to processes for load balancing Parallel Computing Design Parallel Algorithm Models The Data-Parallel Model The Task Graph Model The Work Pool Model The Master-Slave Model The Pipeline or Producer-Consumer Model MapReduce Workflows using Oozie MapReduce Workflows using Oozie Describes workflows in set of XML and configuration files Has coordinator engine that schedules workflows based on time and incoming data Provides ability to re-run failed portions of the workflow No directed cycles Hadoop Supports of Relational Databses Hive provides SQL-like query language named HiveQL, but NOT low latency or real-time queries; supports table partition (partitioning and bucketing). Pig Latin using bag (table), tuple, field Both run on HDFS and MapReduce (stored in the format of file) Hive to MapReduce and HDFS Concurrency Control Is to prevent transactions conflicting with each other. Problems normally occur if more than one transaction tries to access the same record or set of records at the same time. Solutions Timestamping algorithms Optimistic algorithms Pessimistic algorithms Distributed systems/databases Essential Requirements Local transparency Data fragmentation & replication Intelligent optimizer for fragmentation and query to minimize the cost ( I/O cost + CPU cost + Communication cost) Update issue of replication Transaction scheduling Data naming scheme, or, dictionary in case of databases ACID Two-Phase Commit (coordinated by an agent) Concurrency issues (where is the lock manager?) The above three requires a distributed operating system Is YARN the reason developed? Distributed systems/databases Time synchronization and global state issues We cannot synchronize clocks perfectly across a distributed system, therefore to use physical time to find out the order of any arbitrary pair of events occurring within it. Lamport logical time Distributed systems/databases To examine whether a particular property is true, e.g. determine whether is a deadlock, and global debugging Consistent Global State (cause-effect) Real-Time Stream Processing Spark User interfaces e.g. SQL – provided by Hive, and Real-time streaming. Transparent interfaces to connect the lower level components, e.g. YARN and HDFS. At Client, to launch a program through a ‘standalone manager’ bin/spark-submit --master spark://host:7077 --executor -memory 10g myProgram.py Converts a user program into tasks, i.e. directed acyclic graph (DAG) Launch workers, i.e. executors, and schedule them Real-Time Stream Processing Data is split by time interval. Spark Streaming Receivers Tn … T2 T1 Input data streams T0 Results pushed to external Input, processing, and output are distributed on different work nodes, scheduled by server. Worker Node Driver Program Executor Long Task Receiver Input Stream StreamingContext Spark jobs to process received data SparkContext Data replicated Worker Node Executor Task Task Output results in batches Splunk Reads all sorts (almost any type, even in real time) of data into Splunk's internal repository, add indexes and create events – the data unit in Splunk. Users can then set up metrics and dashboards (using Splunk) that support basic business intelligence, analytics, and reporting on key performance indicators (KPIs). A NoSQL query approach is used, reportedly based on the Unix command's pipeline concepts and does not involve or impose any predefined schema, called search processing language (SPL) Splunk Architecture 4. Functions & Interfaces 3. 1. Load 2. Conventional use cases Investigational Searching Monitoring and Alerting A Splunk app (or application) can be a simple search collecting events, a group of alerts categorized for efficiency (or for many other reasons), or an entire program developed using the Splunk's REST API. Monitor any infrastructure (e.g. Windows event logs) in real time. Decision Support Analysis Splunk Deployment Dedicated search head Dedicated search head is an instance that handles search management functions, directing search requests to a set of search peers and then merging the results back to users Forwarder to gather data from a variety of inputs and forward the data to a Splunk Enterprise server for indexing and searching. HPCC High-Performance Computing Cluster HPCC Thor cluster is for extract, transform, load (ETL) processing of the raw data, as well as large-scale complex analytics, and creation of keyed data and indexes for Roxie cluster. Thor cluster is similar in its function, execution environment, filesystem and capabilities to Hadoop MapReduce. Roxie cluster designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications. Roxie cluster is similar in its function and capabilities to Hadoop with HBase and Hive capabilities, and provides for near real time predictable query. Big Data Complexity and Lambda Operational complexity Eventual consistency complexity E.g. two replicas with a count of 10, one increases 2 and the other increases 1. What should be the merge value be? Lack of human-fault tolerance E.g. index compaction at times for all nodes Programming mistakes CAP theorem – You can have at most two of Consistency, Availability and Partition tolerance. In our context, ‘In a distributed system, it can be consistent or available but not both’ Lambda Architecture Lambda is to build Big-Data systems as three layers. Batch layer run parallel tasks on distributed datasets to produce batch views for Serving layer. Speed layer accept changes to produce real-time view, intended to solve CAP. The solution is trivial. Queries answered by combining Batch and Realtime views. Lambda principle ‘Data is immutable’ An example of strength of Lambda Batch and Serving layers together solve normalization and de-normalization issue Normalized De-normalized Lambda Data Model Graph schema: (fact and properties) vs (table and fields) Physically stored by fact Lambda architecture Hadoop as an Enterprise Data Hub What else NoSQL Databases Types of NoSQL databases Column Document Oriented Database (DOD) Key-value Pair Graph Multi-model Document Oriented Databases DOD as a subclass of key-value database, consists of a collection of documents. CouchDB – A JSON document-oriented database JSON Documents - Everything stored in CouchDB boils down to a JSON document. RESTful Interface - From creation to replication to data insertion, every management and data task in CouchDB can be done via HTTP. Document Oriented Databases JSON document – person.json CouchDB DML commands POST - creates a new record GET - reads records PUT - updates a record DELETE - deletes a record MapReduce Operation on DOD Map Function – Retrieve order from person.json Reduce Function – Calculate sales of products No-SQL Doubts Concurrency control Unavailability of ACID properties, therefore transactions are reliably supported. (Is Neo4j an exception?) Data integrity Inability to define relationships - parent versus child (graph could be complex), therefore data can be inconsistent Absence of support for JOIN and cross-entity query Suggestions – 1 RDBMS for Transactional applications NoSQL/RDBMS for Computational applications (e.g. sales record management) NoSQL for Web-scale application (e.g. web analytics) Suggestions – 2 Polyglot Persistence Polyglot Persistence: using different data storage technologies for varying data storage needs Information entropy Claude Shannon's information entropy is defined by , (1) Where P is the probability of occurrence 𝒙𝒊 . H is an expected value as a measure of uncertainty. For example, to calculate 26 English letters entroy in a big corpus, (2) 𝑃 P are the occurrences of letters in corpus, 𝑙𝑜𝑔2 is the number of bits can represent the probability. Then H(L) is the expected value Information entropy Shannon estimated the entropy of written English to be 1.0 and 1.5 bits per character based on clean English. But, in reality the spoken and typed English on Internet is full of noises, so should be higher. How about English word? How about other languages? How about each person? I expect each person is associated with a unique number – Personal Information Entropy (pie) in both real and virtual world, with featured computation beyond language model (See more in thesis - Noisy Language Modeling Framework Using Neural Network Techniques) Skynet is coming true, AlphaGo has beaten human being, we need to hide our ‘pie’. Questions – what is Big-Data? References Hierarchical Model: http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/e.pdf Hierarchical Database: www.uwinnipeg.ca/~ychen2/databaseNotes/hierarchicalDB.ppt Network Model: http://codex.cs.yale.edu/avi/db-book/db5/slide-dir/appA.ppt and http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/d.pdf CouchDB – Get Started, http://guide.couchdb.org/draft/tour.html Jiawei Han and Micheline Kamber (2006), Data Mining – Concepts and Techniques, 2nd Ananth Grama et al. (2003), Introduction tot Parallel Computing, 2nd Jun Li, 2009, ‘Noisy Language Modelling Framework using Neural Network techniques’ http://www.coreservlets.com/hadoop-tutorial Holden Karau et al (2015), O’Reily – Learning Spark George Coulouris, Distributed Systems Concepts and Design, 5th Edition Nathan Marz et al (2015), Big Data – Principles and best practices of scalable realtime data systems (Lambda Architecture) Michael Manoochehri (2014), Data Just Right - Introduction to Large Scale Data & Analytics See related references in the notes of each slide