Download Lesson 7 Big Data

BIG DATA Introduction Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radiofrequency identification (RFID) readers and wireless sensor networks. The world‘s technological per-capita （人均） capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated. Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Introduction Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big Data - Sources • • • • • • • Internet search social media Mobile devices Internet of Things E-business Next-generation radio astronomy telescopes and etc. Big Data - Characteristics Big data is characterized by ‘four Vs’: volume, variety, velocity and veracity. That is, big data comes in large amounts (volume), is a mixture of structured and unstructured information (variety), arrives at (often realtime) speed (velocity) and can be of uncertain provenance (出处origin) (veracity). Such information is unsuitable for processing using traditional SQL-queried relational database management systems (RDBMSs), which is why a constellation of alternative tools -- notably Apache's open-source Hadoop distributed data processing system, plus various NoSQL databases and a range of business intelligence platforms -- has evolved to service this market. A typical big data application supported by cloud computing The framework for big data in cloud computing File Systems for Big Data • Google File System (GFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Compared to traditional file systems, GFS is designed and optimized to run on data centers to provide extremely high data throughputs, low latency and strong fault-tolerant ability. • Hadoop Distributed File System (HDFS) is inspired by GFS and it is the storage module of the opensource Apache Hadoop software framework. The framework for big data in cloud computing Database Management - NoSQL A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. Database Management - NoSQL DBMS based on NoSQL Authorizer Description BigTable Google The database engine based on GFS, which includes set of key-value pairs that are of sparsity, distribution, durability and multi dimension Dynamo Amazon Provides a tightly handle over tradeoffs between consistency, availability and extendibility and the technology of consistent hashing HBase Open source The Column-oriented database is built on HDFS, which supports executing of MapReduce tasks and Java API Cassandra Open source The hybrid-type NoSQL database which has a hierarchical structure based on columns MongoDB Open source The NoSQL-type database supports a flexible processing of JSON type documents. The framework for big data in cloud computing Environment of execution tools MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks. Environment of execution tools The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can then be saved to the distributed filesystem, and the reducers run to merge the results. Environment of execution tools Apache spark is a framework for performing general data analytics on distributed computing cluster like Hadoop. It provides in memory computations for increase speed and data process over MapReduce. It runs on top of existing hadoop cluster and access hadoop data store (HDFS). Spark Streaming is an extension of the core Spark API that enables stream processing of real-time data streams. Environment of execution tools In the annual Daytona Gray Sort Challenge, which benchmarks the speed of data analysis systems, Spark easily trumped Hadoop MapReduce, and was able to sort through 100 terabytes of records within 23 minutes; It took Hadoop over three times as long to execute the same task, about 72 minutes. Spark sorted 1 petabyte in 234 minutes -- even better than O(n log n) growth (246 minutes). The framework for big data in cloud computing Query Systems In order to solve the problem that the time cost of data query is too high in big data environment, some query systems tailored for big data are proposed. They provide interfaces for users to access the overall big data system transparently. Query Systems Some solutions have been proposed to support information acquiring from NoSQL databases by carrying out SQL-type query languages. Hive is a platform developed by Facebook. Users can execute the HiveQL language on it, which is similar to SQL-like scheme. When users input a query, the HiveQL statements are translated into MapReduce jobs expressed as a directed acyclic graph, which is committed to Hadoop for execution. Data Visualization A primary goal of data visualization is to communicate information clearly and efficiently via statistical graphics and information graphics. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message. Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Data Visualization Most frequently used statistical graphics include plots such as scatter plots, histograms, and box plots. An example of scatter plot An example of box plot Reference [1]https://en.wikipedia.org/wiki/Big_data#Technologi es [2] http://www.zdnet.com/article/the-internet-ofthings-and-big-data-unlocking-the-power/ [3] Y. Zhang, J. Ren, J. Liu, C. Xu, H. Guo and Y. Liu, "A Survey on Emerging Computing Paradigms for Big Data," in Chinese Journal of Electronics, vol. 26, no. 1, pp. 1-12, 1 2017. Assignment • Listen and read the interview “D-Wave Aims to Bring Quantum Computing to the Cloud” from http://spectrum.ieee.org/podcast/computing/h ardware/dwave-aims-to-bring-quantumcomputing-to-the-cloud, and get to know more about how this quantum computer works from https://www.dwavesys.com/quantumcomputing.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lesson 7 Big Data