Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NTT DATA Big Data Reference Architecture Ver. 1.0 Copyright © 2015 NTT DATA Corporation Big Data Reference Architecture is a joint work of NTT DATA and EVERIS SPAIN, S.L.U. NTT DATA Big Data Reference Architecture Table of Contents Chap.1 Advance of Big Data Utilization............................................................. 2 Chap.2 NTT DATA Big Data Reference Architecture…………………………… 3 Chap.3 Use cases of Big Data Reference Architecture………………………… 6 3.1. Forecast of variation of financial market index by using SNS data…….. 6 3.2. Automation Tool for System Development in Design Phase………….... 7 3.3. Real-time Bridge Monitoring System…………………………………….... 8 3.4. Traffic Congestion Control System………………………………………... 9 Chap.4 Challenges of Big Data utilization and the features of BDRA………… 10 Figure Fig.1:Cases of Big Data use.......................................................................... 2 Fig.2:NTT DATA Big Data Reference Architecture (BDRA) ....................... 3 Fig.3:Layers of NTT DATA Big Data Reference Architecture…………….... 4 Fig.4:Patterns of analysis scenario.............................................................. 11 Copyright © 2015 NTT DATA Corporation 1 NTT DATA Big Data Reference Architecture Chap.1 Advance of Big Data Utilization It has been said that the world will be filled with a substantial amount of data and thus, the utilization of Big Data will drive the competitiveness of enterprises. In fact, the Ministry of Internal Affairs and Communications in Japan stated that the estimated value of the amount of transition of data distribution in enterprises expanded 8.7 times in 9 years from 2005 to 2013. In data utilization, there are some use cases: “Ad technology”, which is applied for Internet advertising and demand forecasting for individuals in marketing domain; and accuracy improvement in design for the manufacturing industry and improvement of operation efficiency in the transportation industry in operational management and quality control domain etc. (Figure 1) The IoT (Internet of Things) is one of the most important subjects of Big Data utilization. Every single product is connected to network, and equipped sensor with to understand the situation of each product. Therefore, we can collect that information in real time from a remote location and manipulate the product. New services that utilize this generated information in real time will soon follow. In response to this situation, a lot of enterprises work on the construction of the mechanism of accumulating, analyzing and utilizing Big Data more than before. Figure 1: Cases of Big Data use Utilization domain Use case Marketing DSP (Demand-Side Platform) for Internet advertising (Ad technology) Demand forecasting for individual produce management Accuracy improvement in designing/machining operators in manufacturing Business management and quality control industry Forecasting and management for growth conditions of livestock Optimization of operation schedule by onboard GPS data and number of passengers Source: (Information and Communications in Japan, Ministry of Internet Affairs and Communications, Japan, 2014) Copyright © 2015 NTT DATA Corporation 2 NTT DATA Big Data Reference Architecture Chap.2 NTT DATA Big Data Reference Architecture Looking at the mechanism of data utilization in the world, individual technologies have been provided such as the Hadoop, which is the infrastructure supporting distributed processing for large amounts of data, and CEP (Complex Event Processing), which supports real time analysis. Furthermore, some technologies are distributed as open source technology, so anyone can easily use these technologies. However, the key for utilizing Big Data for business is not only about gathering elemental technologies but also constructing the mechanism to fit the purpose of business by promptly combining these elemental technologies, and then flexibly expanding and developing it. Thus, NTT DATA Group systematizes the Big Data Reference Architecture (BDRA), which makes use of the global experience of developing Big Data solutions. (Figure 2) By using BDRA, we can represent the policy of Big Data utilization in accordance with the purpose and situation of the existing systems in each enterprise. Figure 2: NTT DATA Big Data Reference Architecture (BDRA) Copyright © 2015 NTT DATA Corporation 3 NTT DATA Big Data Reference Architecture This section describes the introduction of the framework of BDRA, which helps understanding use cases in the following section. The features of BDRA will be discussed later. BDRA is composed of three platforms and seven layers. The first platform, which has a role processing the various data for analysis, contains three layers: “Information Gathering”, “Information Store”, and “Data Processing”. The second one is the analytics platform, which is the core function for data utilization, and contains two layers: “Data Analytics” and “Information Utilization”. The third one is the management platform for total management and contains two layers: “Governance” and “Infrastructure”. (Figure 3) Figure 3: Layers of NTT DATA Big Data Reference Architecture Category Layer Data Platform Information Overview Gathering This layer contains functions that gather various data generated and stored in various data sources such as web media, sensors and databases, changing them into a form that can be easily analyzed. It implements integration of different types of data by ETL, and deals with the improved reliability, availability, and accessibility by messaging/replication and shared information between different resources such as software and hardware in this layer. Information Store This layer contains database functions for flexibly storing and processing massive amounts of data. For example, distributed data store which realizes the processing of massive amounts of data, an in-memory database which realizes processing at high-speed, and NoSQL which realizes high scalability and flexibility, are contained in this layer. Data Processing This layer contains a function for high-speed processing of massive amounts of data collected and a pre-processing function for analysis. For example, the core functions of Big Data solution such as distributed parallel processing which realizes massive data processing technology and complex event processing technology that realizes processing at high-speed, are in this layer. Copyright © 2015 NTT DATA Corporation 4 NTT DATA Big Data Reference Architecture Category Analytics Layer Data Analytics Overview Platform This layer contains functions for analyzing stored and collected data such as correlation analysis, natural language analysis and machine learning. For example, text mining and data mining are contained in this layer. Moreover, the analytics method, “BICLAVIS” originally developed by NTT DATA, optimizes various analytical methods and utilizes them in multiple ways. Information Utilization This layer contains functions for decision support with the results of analysis. Data visualization, OLAP, and business process management are contained in this layer. Management Governance Platform This layer contains functions for data management like data quality control and data protection. It realizes data quality management through data management such as information lifecycle management, data profiling, master data management, and metadata management. From the point of data protection view, it contains security management and auditing. Infrastructure This layer contains functions that realize both operation management and system management for the purpose of managing reliability, availability, performance, and scalability. Details about the data analytics method, “BICLAVIS” will be described in “Challenges of Big Data utilization and the features of BDRA”. Copyright © 2015 NTT DATA Corporation 5 NTT DATA Big Data Reference Architecture Chap.3 Use cases of Big Data Reference Architecture We introduce four main cases using BDRA mentioned in the previous section. 3.1. Forecast of variation of financial market index by using SNS data The following section describes the “Twitter sentiment index”, developed by real time analytics with a huge amount of data, that revealed the relation between a stock index and our “Twitter sentiment index” consisting of Twitter data. Recently, information utilization of SNS data such as Twitter data among financial sectors is becoming popular in the United States. There is also increasing demand in Japan for such utilization. In order to meet this demand, NTT DATA and NTT DATA Mathematical Systems developed the “Twitter sentiment index”, which is a numerical indicator of the proportion of positive or negative sentiments expressed in tweets relating to the stock market by extracting and analyzing Twitter data in real time. We verified that there is a statistically significant correlation between the “Twitter sentiment index” and the “Nikkei 225 volatility index” by extracting several millions of stock-related tweets for 35 months (from January 2011 to November 2013). Key points of Big Data utilization in this case are efficiently maintaining real time analysis and selecting analytical technologies. In order to analyze in real time, it has to construct a mechanism for quickly extracting data from high volume data. Besides, it takes more time to process Japanese text than other languages because it is necessary to take a process to judge the smallest word units by the context in Japanese while there are separations among words in English etc.Therefore, analyzing tweets in real time is realized by “Distributed Parallel Processing” in the Data Processing layer and “Distributed Data Store” in the Information Store layer (specifically, utilizing the Hadoop Distributed File System) in BDRA. In addition, integrating various technologies such as “Text Mining” and “Data Mining” in the Data Analytics layer and “Rule Engine” in the Data Processing layer is one of the features in this case. Furthermore, the use of the data analytics method, “BICLAVIS” systematized by NTT DATA, assists in the selection of efficient analysis methods. In this case, an “Evaluation and Important Analysis” type scenario pattern is used in evaluating the correlation between the “Twitter sentiment index” and the “Nikkei 225 volatility index”. Details about the data analytics method, “BICLAVIS” will be described in “Issues of Big Data utilization and the features of BDRA”. Copyright © 2015 NTT DATA Corporation 6 NTT DATA Big Data Reference Architecture 3.2. Automation Tool for System Development The following section describes the case in which we introduced the automation tool in system development by flexible data model construction and the use of metadata management. NTT DATA provides a total solution for open system development called “TERASOLUNA”, which realizes a conventional IT system with high quality in a short term due to a change in the business environment such as the progress of the globalization. We developed “TERASOLUNA DS” as one of the solutions that enables gathering information for system development such as design information and contributes to optimizing system development and quality assurance by implementing the consistency check of design documents and an accumulation of the design know-how. “TERASOLUNA DS” provides various functions: automating consistency and notation variability check among design documents, accelerated full-text searching of design documents and source codes, influential range analysis in changing specifications, and supporting input design document. It drastically improves productivity in the design phase by reducing reviews and supporting the identification of the influence range of specification changes or bug occurrences. In this case, Key points of Big Data utilization are that; complex schemas due to the difference in document formats depending on projects and the redesign of the schema due to the new document format being added. In this case, all documents are firstly converted into XML files by “ETL” processing in the Data Gathering layer, and are then stored in the NoSQL database. This constructs a flexible and schema-independent data model. In addition, "Metadata Management" in the Governance layer enables to automate a consistency check on design documents with efficiency and accuracy. The amount of design documents in large scale system development is enormous, extending to 40,000 files and 400,000 pages. By applying a mechanism of managing metadata such as structure, attribute, and recorded information about these design documents, it realizes the check and analysis with accuracy more than a manual review. Copyright © 2015 NTT DATA Corporation 7 NTT DATA Big Data Reference Architecture 3.3. Real-time Bridge Monitoring System The following section describes the case of applying high speed processing for massive amounts of data based on the service of monitoring the Tokyo Gate Bridge in Japan and Can Tho Bridge in Vietnam. Bridges and roads are social infrastructure supporting the life of people and are thought to be safe anytime. Therefore, the road administrator is required to detect defect or damage in bridges, and thus make a decision on the road traffic flow or specify available routes. NTT DATA works on continuously collecting and analyzing various data in real time by using several sensors placed on the bridges such as strains of bridge beams and piers. The key point of Big Data utilization in this case is processing large amounts of sensor data in a short period. By using “Complex Event Processing” in the Data Processing layer in BDRA, this system can quickly analyze the sensor data of more than 100 bridges with just one server, in case of large-scale disasters, when the road administrators need to panoramically monitor multiple bridges. This system is realized by combing technologies in each layer: “Data Mining” in the Data Analytics layer to extract abnormal patterns; and “Data Visualization” in the Information Utilization layer to clearly visualize the anomalies in the detected results. Additionally, we can improve the accuracy in anomaly detection by using the data analytics method “BICLAVIS” developed by NTT DATA. The abnormal values detected from the sensor data include measurement failures because of the sensor malfunction, and external forces such as high winds and earthquakes. In order to distinguish between abnormal values and defects in bridges, we implemented pre-processing for low-frequency component removal and determination logic using lag correlation based on the positional relation between the sensors. Anomaly detection uses BICLAVIS scenario patterns: “Outlier Detection” if it is possible to define the abnormal patterns and “Incorrect Detection” if it is difficult to define them. Copyright © 2015 NTT DATA Corporation 8 NTT DATA Big Data Reference Architecture 3.4. Traffic Congestion Control System The following section describes the case to ease traffic congestion by utilizing simulation technology of massive data and using a prediction/control analytics model. Traffic congestion is one of the biggest problems for both developed and developing countries. Congestion causes environmental problems like fossil fuel consumption and CO 2 emissions as well as enormous time and financial losses. Many countries have a strong interest in reducing and easing traffic congestion, however, most measures are expensive and the effectiveness of each measure is unclear. In addition, the problem is that these measures tend to be only partial optimization and not overall optimization. In order to solve the problem, NTT DATA developed a traffic simulation system that can evaluate the effectiveness of measures for easing traffic congestion such as traffic light control and traffic restriction. This system uses GPS data collected from car navigation systems and smart phones in each vehicle for a traffic simulation we tested in Jilin, China, and achieved a 27 percent improvement in bus service times by using the simulation results to ease traffic congestion. The technology we developed is based on the statistical traffic models with vehicles, roads, intersections, traffic lights and reproduces the traffic environment on a computer. Also, it enables to control the traffic lights by the best pattern that a light turns to green to minimize traffic congestions. The pattern are produced by traffic simulations, and they are evaluated through turning relevant parameters. Multi-agent simulation technology sets multiple system construction factors to operate in the computer and predicts the future. A traffic administrator can judge the effectiveness of traffic measures on some scenarios in advance with this system. Also, a traffic administrator is able to detect the causes of the current traffic condition such as road and time slots which tend to cause traffic congestion. In this case, the key point of Big Data utilization is high-speed processing for the traffic simulation platform to simulate a large amount of traffic volume. By utilizing “Distributed Parallel Processing” in the Data Processing layer, this system can handle over one million vehicles. In addition to “Distributed Parallel Processing”, we combine functions in each layer of BDRA: “Real Time Capture” in the Data Gathering layer and “Data Visualization” in the Information Utilization layer. This realizes efficient and appropriate development of architecture. Moreover, in this case, we use analysis method “BICLAVIS” for the data analysis of prediction and control, and adopt the “Risk Simulation” scenario pattern for this system. Copyright © 2015 NTT DATA Corporation 9 NTT DATA Big Data Reference Architecture Chap.4 Challenges of Big Data utilization and the features of BDRA The following chapter describes common issues of Big Data utilization found in the cases previously mentioned and the features of BDRA. ・Issues of Big Data utilization (1) The combination of multiple IT infrastructure technologies Single use of IT infrastructure technology is not enough and thus, the combination of data gathering, data storing, data processing, and data analysis is necessary when enterprises utilize Big Data. Especially, it becomes common to select items of realizing data storing among various technologies including relational database, NoSQL database etc. Therefore it is required to provide how to store data and/or how to use stored data. (2) Data analysis of various industries Various enterprises utilize Big Data such as finance, IT, social infrastructure etc. Furthermore, data analysis methods become complicated as enterprises require more advanced data analysis results. It is important to select the correct data analysis method in order to respond to these requirements without slowing business speed down. (3) Assurance of the data quality Stored data by enterprises is not originally assumed to be analyzed as we mentioned in the case of the automation of system development. Therefore, it is required to verify the availability of data for analysis by data profiling. Also, it is important to properly manage the lifecycle of data in order to get significant results from data analysis. ・The features of BDRA BDRA has the features as below to solve the three problems above. (1) Comprehensive framework to realize rapid and flexible technology integration BDRA systematized the knowledge of the Big Data utilization with the deep understanding as well as the combination of this knowledge. BDRA verified the combination of products by vendors and open source software (OSS) and thus, it can support selecting the combination of products by different vendors. The combination of products with high frequency of use is provided as a set; besides, it is possible to select products fitting with existing IT systems. (2) “BICLAVIS” to realize systematic and efficient approach to analysis NTT DATA developed cross-industrial data analytical methods “BICLAVIS” generated based on data analysis implemented over 200 cases. Data analysis work tends to be individualistic and a wide range of industries and business seek data analysis. Therefore, NTT DATA constructs the mechanism to gather know-how of data analysis and systematizes the analytics model in order to utilize this information in cross-industrial. Specifically, we organize them as patterns of analysis scenario based on the analysis purpose by categorizing and summarizing purposes, procedures, and techniques of data analysis with the template of those scenarios. (Figure 4) Thus, we can get results about requests from any industries. Copyright © 2015 NTT DATA Corporation 10 NTT DATA Big Data Reference Architecture Figure 4: Patterns of analysis scenario Scenario Pattern Overall Portent Detect the signs of structure change and situation change from Big Data. Anomaly Automatically detect an abnormal pattern in real time and Detection stimulate early crisis response through alerting. False Detection Detect an illegal or outlier situation that is fitted to the definition of an abnormal pattern. Outlier Detection Detect a deviation from the standard or normal situation. Prediction and By clarifying relation between the cause and effect at work and Control estimating the change in result due to manipulating causes, understand the appropriate standard for the cause. Profit Simulation Estimate the effect of work restructuring measures and prioritize them by simulation. Risk Simulation Assess risks by business modeling with uncertainties and prioritize them. Optimization Select the measures that maximize performance by the optimization method. Risk Hedge Targeting Support risk reduction with the risk scattered method Extract the targets to be approached, such as a potential customer in order to maximize the cost-effectiveness. Credit Control Determine the default risk of individuals or bankruptcy risk of enterprises. Evaluation and Factor Analysis Weigh up the various objects and identify the factors. Context Awareness Recommend the product and the service through analysis of behavior and preferences in advance. Process Trace Extract the process of growth and development and identify the accelerator or inhibitor. Copyright © 2015 NTT DATA Corporation 11 (3)”Governance” layer to establish Big Data governance BDRA has abundant governance functions required for the utilization of Big Data, such as improving reliability of data and security. As the words “Garbage in, Garbage out”, only meaningless results come out from the inaccurate data. Especially about functions to improve reliability of data, we define the implementation of data profiling before data cleansing and fixing rules. Also, the system to manage master data has been confirmed. Furthermore, various functions are arranged from the viewpoint of security in order to protect data. Recently, there have been various discussions about personal data, therefore security is necessary to utilize Big Data with peace of mind. BDRA has security management as a series of methods with various audit points such as IT audit, information security audit, and DataCentric Audit and Protection (DCAP). As we mentioned above, BDRA is aggregating know-how about various architecture and technology integration for the utilization of Big Data. NTT DATA has, thus far, been providing a lot of architecture by BDRA, and we are continuously going to improve it. NTT DATA Corporation Toyosu Center Building, 3-3, Toyosu 3-chome, Koto-ku, Tokyo 135-6033, Japan http://www.nttdata.com Copyright © 2015 NTT DATA Corporation The display of the “™(TM)” mark or the “®(R)” mark might be omitted in this paper.