Download Proposal of knowledge discovery platform for big data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mathematics and Computers in Sciences and Industry
Proposal of knowledge discovery platform for big
data processing in manufacturing
Lukas Spendla, Lukas Hrcka, Pavol Tanuska
Faculty of Materials Science and Technology
Slovak University of Technology
Trnava, Slovakia
[email protected], [email protected], [email protected]
Abstract—In the proposed paper, we described the approach
to building Data Lake based knowledge discovery platform. The
proposal is focused on integrating Data Lake based storage, built
on Hadoop framework and NoSQL systems, into traditional data
warehouse discovery platform, preserving the well proven and
robust data warehouse decision support and analytic tools. The
proposed knowledge discovery platform processes data from all
hierarchical control levels in manufacturing and can be used to
address the main manufacturing issues in the knowledge
discovery domain
II. HIERARCHICAL CONTROL MODEL
Current information and control systems primarily employ
hierarchical (pyramid) architecture integrated as a whole with
elements of physical and logical distribution thus providing
open and scalable solutions. Many of hierarchical control
systems are built as multiprocessor control systems enabling
both horizontal and vertical communication. Intelligent
features arising from deploying sensors and actuators have
been intensely utilised recently with direct hierarchical
relations being transformed into network relations. Emerging
tendencies such as connecting previously independent systems
leading to new behaviour attributes are strongly reflected in
current systems.[1] [2]
Keywords—knowledge discovery; data warehouse; data lake;
hadoop; manufacturing; hierarchical control
I. INTRODUCTION
The current trend in manufacturing is marked by the large
increase in amount of data, originating from the field level of
hierarchical control. This increase is mainly due to
implementation of new automation technologies and machines
based on internet of things concept, a part of Industry 4.0,
enabling direct communication with upper control levels.
Each parameter of manufacturing process is represented by
a large amount of production data applicable in information or
control systems at various levels. Despite the fact that most of
manufacturing companies gather these data, they are not
further used as information or knowledge in decision support
process.
This was one of the reasons resulting in the urgent need for
storing and processing large quantities of data and yet, it will
be possible to work with them flexibly. These needs are
reflected by current big data technologies based on NoSQL
systems and Hadoop framework. However, integrating these
new technologies into a company structure disrupts the wellestablished architecture based on data warehouses. This
structure represents proven and robust solution from the
company decision support point of view. Therefore, these new
technologies must be integrated into manufacturing companies
in a way allowing users to preserve the currently used
solutions based on the data warehouse concept, while
exploiting the advantages of the deployed NoSQL or Hadoop
solution.
ISBN: 978-1-61804-327-6
Fig. 1. Hierarchy of the industrial control system
Therefore, also the process control is nowadays being
implemented deploying control systems with a hierarchical
structure. The model of complex control process, so-called
pyramid model, is shown in Fig. 1. [1] [2]
At all levels of the production process control model, large
amount of data are produced, collected and stored often
resulting in data redundancy. Still, the fact that different levels
produce different types of data needs to be respected.
150
Mathematics and Computers in Sciences and Industry
III. PROBLEMS INDENTIFICATION IN MANUFACTURING
A. Control Level
Technology (control) level is the lowest layer of the
pyramid model of hierarchical process control and constitutes
a basic interface with production. It consists of production
lines, machines and equipment, which include integrated
sensors and actuators, communicating using technology
network with control computers, mainly with PLC
(Programmable Logic Controller).[3] At this level, collecting
and primary processing of technological parameters is carried
out. Data are collected in real-time, with different sampling
times, which results in collection of large amounts of data to
be saved or archived for further implementation. The cyclic
data collection traditionally used to collect data without
transmitting the signal differences leads to redundancy.
Collected data represents also a part of the noisy industrial
environment and contains errors stemming technological
information processing. Collected data are often noisy from
manufacturing environment and contain errors primary
processing of technological information. Removing these
adverse conditions, filtering the required data and its
subsequent validation are the tasks necessary to be carried out.
From the application area point of view, manufacturing
process does not focus only on production itself, but extends
and integrates data from all hierarchical control levels. For
effective process control and management, not only
production data are required, but also data including
customers, resources, and suppliers’ information from the
upper hierarchy control levels are inevitable.
Manufacturing can therefore generate big amounts of data
suitable for application of data mining and / or knowledge
discovery process and might provide suitable means to deal
with the problems arising in the field of production systems.
Existing approaches in manufacturing utilising data mining
techniques can be divided into five main application areas [4]:
 Quality analysis of products to correlate output quality
and system parameters, esp. machine settings, in order
to identify causes for deteriorating product quality
 Failure analysis of production resources, esp. machines,
to analyse causes of errors and prevent break downs in
the future
B. Production (Supervisory )Level
Supervisory level of pyramid model is a higher
(intermediate) level of the hierarchical process control, which
is alternatively called as SCADA / HMI (Supervisory Control
and Data Acquisition / Human Machine Interface) level. It is
used for primary collecting and integrating process data, for
monitoring, visualization, evaluation and direct interference in
the process. [3] At this level of control, system data are mostly
stored in SCADA systems having an apparent purpose:
processing alarm, monitoring mixing ratios, batch processes
and data history, as well as archiving the operational variables.
All other data stored at higher levels of pyramid model of
hierarchical process control.
 Maintenance analysis to enhance the availability of
production resources, e. g., by optimized maintenance
planning
 Production planning and scheduling analysis to
improve planning quality, e. g., by a higher capacity
utilisation of production resources
 Strategically planning and scheduling analysis to
improve customer relationship and increase sales, e.g.
by identification of customer behaviour
Each of these application areas covers multiple applied
techniques and also different approaches form hierarchical
model point of view. Therefore, it is impossible to identify a
specific technique and an approach that needs to be applied to
optimize or solve issues in selected application areas. It should
be noted that all application areas span across multiple
hierarchical control levels and therefore, it is impossible to
assign them to a one specific level. Due to this fact, we
identified the dependencies based on various research applied
in the manufacturing area utilising data mining techniques.
MES systems, typical representatives of information
systems at the production level, are responsible for obtaining
and collecting data from manufacturing. Obtained data are
processed and in real-time stored in an aggregated form into
data storage (mostly transactional SQL database). The data
saved in the database in a structured form containing the
current value of the variable, validity and timestamp (VTQ =
Value / Time / Quality).
Generally, the data mining approaches used in main
application areas in manufacturing mostly utilise manual
processing of specific data collection to analyse specific
manufacturing process aspects in various manufacturing
specific cases, e.g. machines, equipment, products, quality,
etc. Most of the approaches can be integrated into real time
support systems; however they mainly focus only on the
approaches and methods themselves. [5][6][7]
C. Management Level
Management level of pyramid model covers the previous
levels. It consists of database resources for higher levels of
control, management information system and tools for internet
visualization. It is the level of planning and management. At
this level, the data are archived and processed and long-term
decisions for production are accepted. [3] At the management
level of the pyramid model, data are not directly collected
from manufacturing process, but are transferred in the
transaction mode from information system of a real-time
interface using the ERP Gateway. As the ERP does not
operate continuously, continuous data transfer is carried out
utilising ERP Gateway. Therefore, huge volumes of
predominantly structured data arise in ERP systems.
ISBN: 978-1-61804-327-6
IV. CURRENT STATE OF KNOWLEDGE DISCOVERY IN
MANUFACTURING AREA
At present, the application of data mining and knowledge
discovery is very broad. However, according to recent studies
[8][9], data mining is mostly employed in the fields like
marketing, consumer analytics, finance, telecommunication,
151
Mathematics and Computers in Sciences and Industry
The main advantage of this approach is robustness and
stability, due to the widespread deployment and long term real
world experience in various enterprise areas. This factor is
important due to the fact that the company KPIs, affecting the
company management and business bottomline, are based on
data from the data warehouse provided by the data marts.
insurance, health care etc. The usage of data mining in
manufacturing is usually between 9 and 10 percent. Major part
of this share is created by large international industries.
The weakness of the current form of manufacturing is
often in the subjective perception of global production aims
(profitability, production efficiency, plant productivity and
product quality), frequent and often unforeseen variations in
both manufacturing parameters and variables, the subjective
decision making, and also in the vast amount of unstructured
data provided by various information systems. [10]
The data mining and knowledge discovery process is
usually based on data warehouse integrating all data required
in this process. This concept of analytic environment is
captures in Fig. 2.
Multiple systems operate at various hierarchical control
levels, each using its own databases mostly independent from
each other. [11] Very often there is no defined relationship
between data in each system, e.g. manufactured product
identifier has different numbering schema and order across
control level data, SCADA, MES and ERP data. Therefore, it
is necessary to integrate these data together to perform
analytic reporting and knowledge discovery process.
In most large manufacturing companies, data warehouse is
used to store the data from various company systems. Data
integrated in the data warehouse serve as the basis for decision
support, through the corresponding data mart or decision
support tools [12]. Therefore, the ETL process transforming
data into data warehouse for further use in business
intelligence and analytic tools is extremely important.
Fig. 2. Current state of knowledge discovery platform in manufacturing
The knowledge discovery process in company analytic
environment is usually performed according to company
methodology. This methodology can be specific for each
individual company. However in recent years, more and more
companies are starting to adopt the CRISP-DM methodology.
Most companies however, don’t adopt this methodology
strictly. Due to this fact, the methodology is usually modified
to suit the company needs. Since the knowledge discovery and
data mining methodology is part of company know-how and
not publicly accessible, it is impossible to generalise it as a
whole.
The obtained data are accessed through data marts, created
through ETL process from the data warehouse, providing
organised view on the data from various business perspectives.
[12] Data marts for various company specific aspects, like
management, manufacturing, quality, etc. provide basis for
decision making process.
It should be noted, that the data in data marts are not
always integrated in the data warehouse itself. In other words,
the data warehouse and data mart data can be separated. In
order to obtain the complex view on company data for
reporting and knowledge discovery, the data stored in data
marts must supplement the company data stored in data
warehouse.
The discovery platform must be set over all company data
stored in data marts. If data warehouse does not integrate all
data, discovery platform must be able to obtain and process
them. In companies the discovery platform is mostly used for
KPI based reporting and quality assurance [13]. The main
advantages of data mining and knowledge discovery have still
not been fully exploited.
A variety of knowledge discovery and analytic tools used
in discovery platform is available. All major software tools
provide connectors for relational databases and data
warehouses. However most of the data operations must be
handled by tools themselves, whether it is a standalone
workstation or a client-server solution.
Fig. 3. Relationship between different phases of CRISP-DM [15]
ISBN: 978-1-61804-327-6
152
Mathematics and Computers in Sciences and Industry
However, most of the companies preserve the continuity of
the main phases of the CRISP-DM model, as shown in Fig. 3.
Hence, the CRISP-DM methodology is, with a certain degree
of abstraction, applicable in any manufacturing industry. [14]
V. KNOWLEDGE DISCOVERY APPROACH PROPOSAL IN
MANUFACTURING AREA
The common analytics environment at most big
manufacturing companies includes a data warehouse, or
collection of federated data marts, which house and integrate
the data for knowledge discovery process. This includes
various ranges of analysis function and business intelligence
and analytics tools enabling decision support utilising ad hoc
queries, dashboards and data mining.
Data mining is now used in many different areas in
manufacturing engineering to extract knowledge for the use in
predictive maintenance, fault detection, design, production,
quality assurance, scheduling, and decision support systems.
Data can be analysed to identify hidden patterns in the
parameters controlling manufacturing processes or to
determine and improve the quality of products. It clearly
indicates data mining can be used in many different
application areas of manufacturing. [16]
Large manufacturing companies with large investments in
their data warehouses have neither the resources, nor the will
to replace the existing environment that works well and do
what it was designed to. The majority of large companies
utilise a coexistence strategy combining the best of data
warehouse and analytics environment, with the new trends in
big data solutions.
However, the manufacturing process brings a huge amount
of data stored in databases containing enormous number of
records. Every record has attributes needed to be explored to
discover useful information and knowledge. All of this factors
clearly demonstrates, that the choosing the right methods is
crucial to successful discovery of knowledge. [17] Nowadays,
there are a lot of types of methods, techniques and algorithms
used for data mining process. Kdnuggets in [18] carried out a
survey asking companies utilising data mining algorithms in
their company. According to this research, the most used
algorithms are: decision trees (rules), regression, clustering,
statistics, visualization and time series.
Many companies want to continue to rely on data
warehouses for standard BI and analytics reporting, including
sales reports, customer dashboards, risk history , etc. The
coexistence strategy allows the companies to use data
warehouse with its standard workload and storing historical
data to establish robust traditional business intelligence and
analytics tools. [19]
Despite the robustness of traditional business intelligence
and analytics tools, semi-structured and unstructured data
from the data collection process do not fit well into traditional
data warehouses. Furthermore, data warehouses may not be
able to handle the processing of frequently or even continually
updated big data sets. As a result, organisations are looking for
possibilities to collect, store and analyse big sets of data.
Newer class of technologies including the Hadoop
framework and NoSQL systems are often deployed for this
task. [20] In some cases, these technologies are being used as
staging areas for data before they are transformed into a data
warehouse, often in summarised form that is more suitable for
relational structures. Big data solution vendors are
increasingly pushing the concept of Hadoop Data Lake that is
used as central repository for raw data streams present in the
company. [21]
This coexistence approach, incorporating Data Lake as the
central repository serves as a baseline for our knowledge
discovery approach in manufacturing area, captured in Fig. 5.
The proposed knowledge discovery analytic environment
is based on common data warehouse approach. The data
warehouse integrates various data from heterogeneous systems
across various hierarchical control levels. These
heterogeneous data are extracted, transformed and integrated
into a data warehouse using ETL process. This approach is
mostly suitable for discontinuous and non-real-time data from
higher hierarchical control levels.
Fig. 4. Survey of methods and algorithms usage in data analysis [18]
Many of the methods are exploitable in several areas, but it
is very important to perform detailed analysis of the tasks to
be solved, because methods are not universally applicable, but
depend on the problem to be solved.
ISBN: 978-1-61804-327-6
The data marts, created from the data warehouse data,
provide organised view on data from business unit perspective
(like management, manufacturing, quality, etc.) and provide
basis for decision making process in selected area. The data
loaded into data marts needs to be extracted and transformed,
to create the data structure suitable for further use.
153
Mathematics and Computers in Sciences and Industry
Fig. 5. Knowledge discovery analytic environment proposal
The ETL process transforming data from a Data Lake into
a data warehouse is performed only for data not transferred
into data warehouse directly. The main use of this particular
ETL process is loading the manufacturing data from the field
level stored in a Data Lake into a data warehouse. [20]
The Data Lake, based on Hadoop framework, provides
central data storage for raw manufacturing data. The Data
Lake extracts and loads data from heterogeneous database
systems and stores them in a raw (original) form. Therefore
the data does not have to be transformed to be stored in the
Data Lake. In the Hadoop Data Lake cluster, subsets of the
data can be analysed using batch query tools, stream
processing software and SQL on Hadoop technologies that run
interactively or using ad hoc queries in SQL.
One of the biggest issues in obtaining manufacturing data
is the way of collecting and processing data from the field
level of hierarchical control. All these data serves as a basis
for decision support at higher hierarchical control levels, the
used field level data are usually aggregated into data more
suitable for particular decision support task. Therefore the data
suitability for business intelligence or analytic tools is very
limited.
The discovery platform in this environment is built on the
data integrated in the Data Lake. Due to the use of Hadoop
cluster, this environment provides higher performance when
working in big data sets than the traditional data warehouse.
Big advantage is also the availability of raw data from the
manufacturing that cannot be easily stored in the data
warehouse.
However with the increasing number of sensors connected
to network in production chain, it is easier to collect the
production chain data. This feature is provided by Field Level
Bus.
The offer of tools for discovery platform over Hadoop
cluster is not very wide. Most of the standard knowledge
discovery tools cannot connect to Hadoop cluster using SQLon-Hadoop solutions. However, this way most of the data
manipulation operations must be performed by the tool itself
and not by the Hadoop cluster. In order to enable utilising the
full potential of data manipulation performance of Hadoop,
software manufacturers offer add-ons or software solutions
able to perform selected sets of operations and algorithms
directly in a Hadoop cluster. This approach is preferred, since
the discovery platform must be able to process the big sets of
collected data.
ISBN: 978-1-61804-327-6
The Field Level Bus collects data from various industrial
control systems, and loads them into the Data Lake storage.
Due to the big amounts of periodic or continuous data
collected at this level, Data Lake builds on the Hadoop cluster
technology which is the most suitable solution to store the raw
field level data.
Main task of Field Level Bus is preparing the data which
is a fundamental step for the further use of field level data, as
the data can be collected from various, sensors, PLC, devices,
systems, etc.
154
Mathematics and Computers in Sciences and Industry
Data collected at the field level can also be inconsistent.
Therefore, transforming the collected data into cleaned forms
storable in Data Lake storage is necessary. This Field Level
Bus addresses the need of data analysis aimed at cleaning the
raw data. [22]
[4]
[5]
VI. CONCLUSION
[6]
Knowledge discovery analytic platform proposed in this
paper incorporates novel trends and methods used in the
knowledge discovery in manufacturing area. The traditional
data warehouse approach for knowledge discovery platform is
supplemented with Hadoop cluster, to store big data collected
at the field level of hierarchical control.
[7]
The proposed analytic platform preserves the robustness
and well-proven technology for traditional business
intelligence and analytic tools, and creates space for
knowledge discovery in frequently and continually updated
manufacturing data in a raw form. Therefore it represents an
ideal compromise between existing traditional tools and the
need for strong business intelligence, reporting and analytic
platform.
[8]
[9]
[10]
The main disadvantage is the necessity of integrating all
data in a Data Lake, which makes it difficult to ensure the
integrity and security of company data. In traditional relational
databases and data warehouses various approaches, methods
and tools for maintaining integrity and security of company
data are available. In Data Lake represented by Hadoop
cluster, all data needs to be integrated altogether, and the
discovery platform must have access to all the data. This is
one of the main issues addressed when implementing Data
Lake.
[11]
[12]
[13]
[14]
The proposed approach focuses on all hierarchical control
levels in manufacturing. Therefore, manufacturing area as a
whole represents the main application area of this approach.
With a certain degree of abstraction, the approach can be
applied also in other industrial fields, where lots of data needs
to be collected frequently or continuously.
[15]
[16]
[17]
ACKNOWLEDGMENT
This publication is the result of implementation of the
project VEGA 1/0673/15: “Knowledge discovery for
hierarchical control of technological and production processes”
supported by the VEGA.
[18]
[19]
REFERENCES
[1]
[2]
[3]
[20]
J. Jadlovský, S. Laciňák, M. Čopík and J. Ilkovič, “Technological level
of flexible manufacturing system control,” Acta Electrotechnica
Informatica, vol.11, No.1, pp. 20-24, 2011.
P. Tanuška, P. Važan, M. Kebísek and D. Jurovatá, “Knowledge
discovery from production databases for hierarchical process control,”
International Journal of Mechanical, Aerospace, Industrial, Mechatronic
and Manufacturing Engineering vol.7, No:11, 2013.
J. Jadlovský, J. Laciňák, J. Chovaňák and J. Ilkovič. “Návrh
distribuovaného systému riadenia pružnej výrobnej linky,”
In:
ISBN: 978-1-61804-327-6
[21]
[22]
155
International Conference – Cybernetics and informatics. Vyšná Boca.
2010.
C. Gröger, F. Niedermann and B. Mitschang, “Data Mining-driven
Manufacturing Process Optimization,” In: Proceedings of the World
Congress on Engineering 2012 Vol III, WCE 2012. Hong Kong:
Newswood 2012, pp. 1475-1481.
K. Wang, S. Tong, B. Eynard, L. Roucoules and N. Matta ,“ Fuzzy
systems and knowledge discovery,” FSKD, 2007.
P. Michalik, J. Štofa and I. Zolotová, “Testing the properties of Kmeans algorithm for data mining applications,” In: LINDI 2013 : 5th
IEEE International Symposium on Logistics and Industrial Informatics :
Proceedings : September 5-7, 2013, Wildau, Germany. - Piscataway :
IEEE, 2013 P. 99-102. - ISBN 978-1-4799-1257-5.
G. Köksal, İ. Batmaz and M. C. Testik, “A review of data mining
applications for quality improvement in manufacturing industry,”Expert
Systems with Applications, 38 (10) (2011), pp. 13448–13467.
Rexer Analytics, “Data miner survey – 2013 survey summary report,“
2014,
[cit.
20.06.2015].
Available
online:
http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html.
KDnuggets, “Data Mining Community‘s Top Resource,“ 2014, [cit.
23.06.2015]. Available online: http://www.kdnuggets.com.
B. Vorhies, “The big deal about big data: what’s inside – structured,
unstructured, and semi-structured data in data magnum blog,“ 2013. [cit.
25.06.2015] Available online: http://data-magnum.com/the-big-dealabout-big-data-whats-inside-structured-unstructured-and-semistructured-data/.
X.Z Wang, “Data mining and knowledge discovery for process
monitoring and control advances in industrial control,“ Springer Science
& Business Media, 2012. 251p. ISBN 978-1-44710-421-6.
X. Zhu, “Knowledge discovery and data mining: challenges and
realities,“ Challenges and Realities. 2007. Idea Group Inc (IGI). 290p.
ISBN 978-1-59904-252-7.
H. Chen, R. H. L. Chiang, and V. C. Storey, “Business intelligence and
analytics: from big data to big impact ,“ MIS Q. 36, 4 (December 2012),
1165-1188.
P. Chapman, P. Kerber, J. Clinton, J. Khabaza, T. Reinartz, C. Shearer
and R. Wirth, “The CRISP-DM Process Model,”. Discussion Paper. 0503-99. Marec 1999.
“What is the CRISP-DM Methodology,” [cit. 20.06.2015]. Available
online: http://www.sv-europe.com/crisp-dm-methodology/.
J.A. Harding, M. Shahbaz, S. Srinivas and A. Kusiak, “Data mining in
manufacturing: a review,” Journal of Manufacturing Science and
EngineeringTransactions of ASME, 128(4), 969–976.2006.
A.K. Choudhary, M.K. Tiwari and J.A. Harding, “Data mining in
manufacturing: a review based on the kind of knowledge. ” In: Journal
of Intelligent Manufacturing. Leicestershire: Loughborough University´s
Institutional Repository. 20 (5), s. 501 – 521. 2009.
KDnuggets, “Algorithms for data analysis/data mining. Which methods/
algorithms did you use for data analysis?,” [cit. 19.06.2015]. Available
online: http://www.kdnuggets.com/polls/2011/algorithms-analytics-datamining.html.
T.H. Davenport and J. Dyché, “Big data in big companies,” Thomas H.
Davenport and SAS Institute Inc May 2013.
W.Fan and A. Bifet, “Mining big data: current status, and forecast to the
future,” SIGKDD Explor. Newsl. 14, 2 (April 2013), 1-5.
M. Rouse, “Big data analytics,” [cit. 20.06.2015]. Available online:
http://searchbusinessanalytics.techtarget.com/definition/big-dataanalytics.
S. Zhang, C. Zhang and Q. Yang, “Data preparation for data mining,
Applied Artificial Intelligence,” An International Journal, Volume 17,
Issue 5-6, 2003. Taylor & Francis, 2003. doi: 10.1080/713827180.