Download NTT DATA Big Data Reference Architecture Ver. 1.0

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
NTT DATA
Big Data Reference Architecture
Ver. 1.0
Copyright © 2015 NTT DATA Corporation
Big Data Reference Architecture is a joint work of
NTT DATA and EVERIS SPAIN, S.L.U.
NTT DATA Big Data Reference Architecture
Table of Contents
Chap.1 Advance of Big Data Utilization............................................................. 2
Chap.2 NTT DATA Big Data Reference Architecture…………………………… 3
Chap.3 Use cases of Big Data Reference Architecture………………………… 6
3.1. Forecast of variation of financial market index by using SNS data…….. 6
3.2. Automation Tool for System Development in Design Phase………….... 7
3.3. Real-time Bridge Monitoring System…………………………………….... 8
3.4. Traffic Congestion Control System………………………………………... 9
Chap.4 Challenges of Big Data utilization and the features of BDRA………… 10
Figure
Fig.1:Cases of Big Data use.......................................................................... 2
Fig.2:NTT DATA Big Data Reference Architecture (BDRA) ....................... 3
Fig.3:Layers of NTT DATA Big Data Reference Architecture…………….... 4
Fig.4:Patterns of analysis scenario.............................................................. 11
Copyright © 2015 NTT DATA Corporation
1
NTT DATA Big Data Reference Architecture
Chap.1 Advance of Big Data Utilization
It has been said that the world will be filled with a substantial amount of data and thus, the
utilization of Big Data will drive the competitiveness of enterprises.
In fact, the Ministry of Internal Affairs and Communications in Japan stated that the estimated
value of the amount of transition of data distribution in enterprises expanded 8.7 times in 9
years from 2005 to 2013.
In data utilization, there are some use cases: “Ad technology”, which is applied for Internet
advertising and demand forecasting for individuals in marketing domain; and accuracy
improvement in design for the manufacturing industry and improvement of operation efficiency
in the transportation industry in operational management and quality control domain etc.
(Figure 1)
The IoT (Internet of Things) is one of the most important subjects of Big Data utilization. Every
single product is connected to network, and equipped sensor with to understand the situation
of each product. Therefore, we can collect that information in real time from a remote location
and manipulate the product. New services that utilize this generated information in real time
will soon follow.
In response to this situation, a lot of enterprises work on the construction of the mechanism of
accumulating, analyzing and utilizing Big Data more than before.
Figure 1: Cases of Big Data use
Utilization domain
Use case
Marketing

DSP (Demand-Side Platform) for Internet advertising (Ad technology)

Demand forecasting for individual produce management

Accuracy improvement in designing/machining operators in manufacturing
Business management
and quality control
industry

Forecasting and management for growth conditions of livestock

Optimization of operation schedule by onboard GPS data and number of
passengers
Source: (Information and Communications in Japan, Ministry of Internet Affairs and Communications, Japan, 2014)
Copyright © 2015 NTT DATA Corporation
2
NTT DATA Big Data Reference Architecture
Chap.2 NTT DATA Big Data Reference Architecture
Looking at the mechanism of data utilization in the world, individual technologies have been
provided such as the Hadoop, which is the infrastructure supporting distributed processing for
large amounts of data, and CEP (Complex Event Processing), which supports real time
analysis.
Furthermore, some technologies are distributed as open source technology, so anyone can
easily use these technologies.
However, the key for utilizing Big Data for business is not only about gathering elemental
technologies but also constructing the mechanism to fit the purpose of business by promptly
combining these elemental technologies, and then flexibly expanding and developing it.
Thus, NTT DATA Group systematizes the Big Data Reference Architecture (BDRA), which
makes use of the global experience of developing Big Data solutions. (Figure 2)
By using BDRA, we can represent the policy of Big Data utilization in accordance with the
purpose and situation of the existing systems in each enterprise.
Figure 2: NTT DATA Big Data Reference Architecture (BDRA)
Copyright © 2015 NTT DATA Corporation
3
NTT DATA Big Data Reference Architecture
This section describes the introduction of the framework of BDRA, which helps understanding
use cases in the following section. The features of BDRA will be discussed later.
BDRA is composed of three platforms and seven layers. The first platform, which has a role
processing the various data for analysis, contains three layers: “Information Gathering”,
“Information Store”, and “Data Processing”. The second one is the analytics platform, which is
the core function for data utilization, and contains two layers: “Data Analytics” and
“Information Utilization”. The third one is the management platform for total management and
contains two layers: “Governance” and “Infrastructure”. (Figure 3)
Figure 3: Layers of NTT DATA Big Data Reference Architecture
Category
Layer
Data Platform
Information
Overview

Gathering
This layer contains functions that gather various data generated and
stored in various data sources such as web media, sensors and
databases, changing them into a form that can be easily analyzed. It
implements integration of different types of data by ETL, and deals
with the improved reliability, availability, and accessibility by
messaging/replication and shared information between different
resources such as software and hardware in this layer.
Information

Store
This layer contains database functions for flexibly storing and
processing massive amounts of data. For example, distributed data
store which realizes the processing of massive amounts of data, an
in-memory database which realizes processing at high-speed, and
NoSQL which realizes high scalability and flexibility, are contained in
this layer.
Data
Processing

This layer contains a function for high-speed processing of massive
amounts of data collected and a pre-processing function for analysis.
For example, the core functions of Big Data solution such as
distributed parallel processing which realizes massive data
processing technology and complex event processing technology
that realizes processing at high-speed, are in this layer.
Copyright © 2015 NTT DATA Corporation
4
NTT DATA Big Data Reference Architecture
Category
Analytics
Layer
Data Analytics
Overview

Platform
This layer contains functions for analyzing stored and collected data
such as correlation analysis, natural language analysis and machine
learning. For example, text mining and data mining are contained in
this layer. Moreover, the analytics method, “BICLAVIS” originally
developed by NTT DATA, optimizes various analytical methods and
utilizes them in multiple ways.
Information

Utilization
This layer contains functions for decision support with the results of
analysis. Data visualization, OLAP, and business process
management are contained in this layer.
Management
Governance

Platform
This layer contains functions for data management like data quality
control and data protection. It realizes data quality management
through data management such as information lifecycle management,
data profiling, master data management, and metadata management.
From the point of data protection view, it contains security
management and auditing.
Infrastructure

This layer contains functions that realize both operation management
and system management for the purpose of managing reliability,
availability, performance, and scalability.
Details about the data analytics method, “BICLAVIS” will be described in “Challenges of Big Data utilization and the features of BDRA”.
Copyright © 2015 NTT DATA Corporation
5
NTT DATA Big Data Reference Architecture
Chap.3 Use cases of Big Data Reference Architecture
We introduce four main cases using BDRA mentioned in the previous section.
3.1. Forecast of variation of financial market index by using SNS data
The following section describes the “Twitter sentiment index”, developed by real time analytics
with a huge amount of data, that revealed the relation between a stock index and our “Twitter
sentiment index” consisting of Twitter data.
Recently, information utilization of SNS data such as Twitter data among financial sectors is
becoming popular in the United States. There is also increasing demand in Japan for such
utilization.
In order to meet this demand, NTT DATA and NTT DATA Mathematical Systems developed
the “Twitter sentiment index”, which is a numerical indicator of the proportion of positive or
negative sentiments expressed in tweets relating to the stock market by extracting and
analyzing Twitter data in real time.
We verified that there is a statistically significant correlation between the “Twitter sentiment
index” and the “Nikkei 225 volatility index” by extracting several millions of stock-related
tweets for 35 months (from January 2011 to November 2013).
Key points of Big Data utilization in this case are efficiently maintaining real time analysis and
selecting analytical technologies.
In order to analyze in real time, it has to construct a mechanism for quickly extracting data
from high volume data. Besides, it takes more time to process Japanese text than other
languages because it is necessary to take a process to judge the smallest word units by the
context in Japanese while there are separations among words in English etc.Therefore,
analyzing tweets in real time is realized by “Distributed Parallel Processing” in the Data
Processing layer and “Distributed Data Store” in the Information Store layer (specifically,
utilizing the Hadoop Distributed File System) in BDRA. In addition, integrating various
technologies such as “Text Mining” and “Data Mining” in the Data Analytics layer and “Rule
Engine” in the Data Processing layer is one of the features in this case.
Furthermore, the use of the data analytics method, “BICLAVIS” systematized by NTT DATA,
assists in the selection of efficient analysis methods. In this case,​ an “Evaluation and
Important Analysis” type scenario pattern is used in evaluating the correlation between the
“Twitter sentiment index” and the “Nikkei 225 volatility index”.
Details about the data analytics method, “BICLAVIS” will be described in “Issues of Big Data
utilization and the features of BDRA”.
Copyright © 2015 NTT DATA Corporation
6
NTT DATA Big Data Reference Architecture
3.2. Automation Tool for System Development
The following section describes the case in which we introduced the automation tool in
system development by flexible data model construction and the use of metadata
management.
NTT DATA provides a total solution for open system development called “TERASOLUNA”,
which realizes a conventional IT system with high quality in a short term due to a change in
the business environment such as the progress of the globalization. We developed
“TERASOLUNA DS” as one of the solutions that enables gathering information for system
development such as design information and contributes to optimizing system development
and quality assurance by implementing the consistency check of design documents and an
accumulation of the design know-how.
“TERASOLUNA DS” provides various functions: automating consistency and notation
variability check among design documents, accelerated full-text searching of design
documents and source codes, influential range analysis in changing specifications, and
supporting input design document. It drastically improves productivity in the design phase by
reducing reviews and supporting the identification of the influence range of specification
changes or bug occurrences.
In this case, Key points of Big Data utilization are that; complex schemas due to the difference
in document formats depending on projects and the redesign of the schema due to the new
document format being added. In this case, all documents are firstly converted into XML files
by “ETL” processing in the Data Gathering layer, and are then stored in the NoSQL database.
This constructs a flexible and schema-independent data model.
In addition, "Metadata Management" in the Governance layer enables to automate a
consistency check on design documents with efficiency and accuracy. The amount of design
documents in large scale system development is enormous, extending to 40,000 files and
400,000 pages. By applying a mechanism of managing metadata such as structure, attribute,
and recorded information about these design documents, it realizes the check and analysis
with accuracy more than a manual review.
Copyright © 2015 NTT DATA Corporation
7
NTT DATA Big Data Reference Architecture
3.3. Real-time Bridge Monitoring System
The following section describes the case of applying high speed processing for massive
amounts of data based on the service of monitoring the Tokyo Gate Bridge in Japan and Can
Tho Bridge in Vietnam.
Bridges and roads are social infrastructure supporting the life of people and are thought to be
safe anytime. Therefore, the road administrator is required to detect defect or damage in
bridges, and thus make a decision on the road traffic flow or specify available routes.
NTT DATA works on continuously collecting and analyzing various data in real time by using
several sensors placed on the bridges such as strains of bridge beams and piers.
The key point of Big Data utilization in this case is processing large amounts of sensor data in
a short period. By using “Complex Event Processing” in the Data Processing layer in BDRA,
this system can quickly analyze the sensor data of more than 100 bridges with just one server,
in case of large-scale disasters, when the road administrators need to panoramically monitor
multiple bridges. This system is realized by combing technologies in each layer: “Data Mining”
in the Data Analytics layer to extract abnormal patterns; and “Data Visualization” in the
Information Utilization layer to clearly visualize the anomalies in the detected results.
Additionally, we can improve the accuracy in anomaly detection by using the data analytics
method “BICLAVIS” developed by NTT DATA. The abnormal values detected from the sensor
data include measurement failures because of the sensor malfunction, and external forces
such as high winds and earthquakes. In order to distinguish between abnormal values and
defects in bridges, we implemented pre-processing for low-frequency component removal and
determination logic using lag correlation based on the positional relation between the sensors.
Anomaly detection uses BICLAVIS scenario patterns: “Outlier Detection” if it is possible to
define the abnormal patterns and “Incorrect Detection” if it is difficult to define them.
Copyright © 2015 NTT DATA Corporation
8
NTT DATA Big Data Reference Architecture
3.4. Traffic Congestion Control System
The following section describes the case to ease traffic congestion by utilizing simulation
technology of massive data and using a prediction/control analytics model.
Traffic congestion is one of the biggest problems for both developed and developing
countries. Congestion causes environmental problems like fossil fuel consumption and CO 2
emissions as well as enormous time and financial losses. Many countries have a strong
interest in reducing and easing traffic congestion, however, most measures are expensive
and the effectiveness of each measure is unclear. In addition, the problem is that these
measures tend to be only partial optimization and not overall optimization.
In order to solve the problem, NTT DATA developed a traffic simulation system that can
evaluate the effectiveness of measures for easing traffic congestion such as traffic light
control and traffic restriction. This system uses GPS data collected from car navigation
systems and smart phones in each vehicle for a traffic simulation we tested in Jilin, China,
and achieved a 27 percent improvement in bus service times by using the simulation results
to ease traffic congestion.
The technology we developed is based on the statistical traffic models with vehicles, roads,
intersections, traffic lights and reproduces the traffic environment on a computer. Also, it
enables to control the traffic lights by the best pattern that a light turns to green to minimize
traffic congestions. The pattern are produced by traffic simulations, and they are evaluated
through turning relevant parameters.
Multi-agent simulation technology sets multiple system construction factors to operate in the
computer and predicts the future. A traffic administrator can judge the effectiveness of traffic
measures on some scenarios in advance with this system. Also, a traffic administrator is able
to detect the causes of the current traffic condition such as road and time slots which tend to
cause traffic congestion.
In this case, the key point of Big Data utilization is high-speed processing for the traffic
simulation platform to simulate a large amount of traffic volume. By utilizing “Distributed
Parallel Processing” in the Data Processing layer, this system can handle over one million
vehicles. In addition to “Distributed Parallel Processing”, we combine functions in each layer
of BDRA: “Real Time Capture” in the Data Gathering layer and “Data Visualization” in the
Information Utilization layer. This realizes efficient and appropriate development of
architecture.
Moreover, in this case, we use analysis method “BICLAVIS” for the data analysis of
prediction and control, and adopt the “Risk Simulation” scenario pattern for this system.
Copyright © 2015 NTT DATA Corporation
9
NTT DATA Big Data Reference Architecture
Chap.4 Challenges of Big Data utilization and the features of
BDRA
The following chapter describes common issues of Big Data utilization found in the cases
previously mentioned and the features of BDRA.
・Issues of Big Data utilization
(1) The combination of multiple IT infrastructure technologies
Single use of IT infrastructure technology is not enough and thus, the combination of data
gathering, data storing, data processing, and data analysis is necessary when enterprises
utilize Big Data. Especially, it becomes common to select items of realizing data storing
among various technologies including relational database, NoSQL database etc. Therefore it
is required to provide how to store data and/or how to use stored data.
(2) Data analysis of various industries
Various enterprises utilize Big Data such as finance, IT, social infrastructure etc. Furthermore,
data analysis methods become complicated as enterprises require more advanced data
analysis results. It is important to select the correct data analysis method in order to respond
to these requirements without slowing business speed down.
(3) Assurance of the data quality
Stored data by enterprises is not originally assumed to be analyzed as we mentioned in the
case of the automation of system development. Therefore, it is required to verify the
availability of data for analysis by data profiling. Also, it is important to properly manage the
lifecycle of data in order to get significant results from data analysis.
・The features of BDRA
BDRA has the features as below to solve the three problems above.
(1) Comprehensive framework to realize rapid and flexible technology integration
BDRA systematized the knowledge of the Big Data utilization with the deep understanding as
well as the combination of this knowledge. BDRA verified the combination of products by
vendors and open source software (OSS) and thus, it can support selecting the combination of
products by different vendors. The combination of products with high frequency of use is
provided as a set; besides, it is possible to select products fitting with existing IT systems.
(2) “BICLAVIS” to realize systematic and efficient approach to analysis
NTT DATA developed cross-industrial data analytical methods “BICLAVIS” generated based
on data analysis implemented over 200 cases. Data analysis work tends to be individualistic
and a wide range of industries and business seek data analysis. Therefore, NTT DATA
constructs the mechanism to gather know-how of data analysis and systematizes the analytics
model in order to utilize this information in cross-industrial. Specifically, we organize them as
patterns of analysis scenario based on the analysis purpose by categorizing and summarizing
purposes, procedures, and techniques of data analysis with the template of those scenarios.
(Figure 4) Thus, we can get results about requests from any industries.
Copyright © 2015 NTT DATA Corporation
10
NTT DATA Big Data Reference Architecture
Figure 4: Patterns of analysis scenario
Scenario Pattern
Overall
Portent
Detect the signs of structure change and situation change from
Big Data.
Anomaly
Automatically detect an abnormal pattern in real time and
Detection
stimulate early crisis response through alerting.
False Detection
Detect an illegal or outlier situation that is fitted to the definition of
an abnormal pattern.
Outlier Detection
Detect a deviation from the standard or normal situation.
Prediction and
By clarifying relation between the cause and effect at work and
Control
estimating the change in result due to manipulating causes,
understand the appropriate standard for the cause.
Profit Simulation
Estimate the effect of work restructuring measures and prioritize
them by simulation.
Risk Simulation
Assess risks by business modeling with uncertainties and
prioritize them.
Optimization
Select the measures that maximize performance by the
optimization method.
Risk Hedge
Targeting
Support risk reduction with the risk scattered method
Extract the targets to be approached, such as a potential
customer in order to maximize the cost-effectiveness.
Credit Control
Determine the default risk of individuals or bankruptcy risk of
enterprises.
Evaluation and Factor Analysis
Weigh up the various objects and identify the factors.
Context Awareness
Recommend the product and the service through analysis of
behavior and preferences in advance.
Process Trace
Extract the process of growth and development and identify the
accelerator or inhibitor.
Copyright © 2015 NTT DATA Corporation
11
(3)”Governance” layer to establish Big Data governance
BDRA has abundant governance functions required for the utilization of Big Data, such as
improving reliability of data and security. As the words “Garbage in, Garbage out”, only
meaningless results come out from the inaccurate data.
Especially about functions to improve reliability of data, we define the implementation of data
profiling before data cleansing and fixing rules. Also, the system to manage master data has
been confirmed.
Furthermore, various functions are arranged from the viewpoint of security in order to protect
data. Recently, there have been various discussions about personal data, therefore security is
necessary to utilize Big Data with peace of mind. BDRA has security management as a series
of methods with various audit points such as IT audit, information security audit, and DataCentric Audit and Protection (DCAP).
As we mentioned above, BDRA is aggregating know-how about various architecture and
technology integration for the utilization of Big Data. NTT DATA has, thus far, been providing a
lot of architecture by BDRA, and we are continuously going to improve it.
NTT DATA Corporation
Toyosu Center Building,
3-3, Toyosu 3-chome, Koto-ku,
Tokyo 135-6033, Japan
http://www.nttdata.com
Copyright © 2015 NTT DATA Corporation
The display of the “™(TM)” mark or the “®(R)” mark might be omitted in this paper.