Download Proposal of knowledge discovery platform for big data

Mathematics and Computers in Sciences and Industry Proposal of knowledge discovery platform for big data processing in manufacturing Lukas Spendla, Lukas Hrcka, Pavol Tanuska Faculty of Materials Science and Technology Slovak University of Technology Trnava, Slovakia [email protected], [email protected], [email protected] Abstract—In the proposed paper, we described the approach to building Data Lake based knowledge discovery platform. The proposal is focused on integrating Data Lake based storage, built on Hadoop framework and NoSQL systems, into traditional data warehouse discovery platform, preserving the well proven and robust data warehouse decision support and analytic tools. The proposed knowledge discovery platform processes data from all hierarchical control levels in manufacturing and can be used to address the main manufacturing issues in the knowledge discovery domain II. HIERARCHICAL CONTROL MODEL Current information and control systems primarily employ hierarchical (pyramid) architecture integrated as a whole with elements of physical and logical distribution thus providing open and scalable solutions. Many of hierarchical control systems are built as multiprocessor control systems enabling both horizontal and vertical communication. Intelligent features arising from deploying sensors and actuators have been intensely utilised recently with direct hierarchical relations being transformed into network relations. Emerging tendencies such as connecting previously independent systems leading to new behaviour attributes are strongly reflected in current systems.[1] [2] Keywords—knowledge discovery; data warehouse; data lake; hadoop; manufacturing; hierarchical control I. INTRODUCTION The current trend in manufacturing is marked by the large increase in amount of data, originating from the field level of hierarchical control. This increase is mainly due to implementation of new automation technologies and machines based on internet of things concept, a part of Industry 4.0, enabling direct communication with upper control levels. Each parameter of manufacturing process is represented by a large amount of production data applicable in information or control systems at various levels. Despite the fact that most of manufacturing companies gather these data, they are not further used as information or knowledge in decision support process. This was one of the reasons resulting in the urgent need for storing and processing large quantities of data and yet, it will be possible to work with them flexibly. These needs are reflected by current big data technologies based on NoSQL systems and Hadoop framework. However, integrating these new technologies into a company structure disrupts the wellestablished architecture based on data warehouses. This structure represents proven and robust solution from the company decision support point of view. Therefore, these new technologies must be integrated into manufacturing companies in a way allowing users to preserve the currently used solutions based on the data warehouse concept, while exploiting the advantages of the deployed NoSQL or Hadoop solution. ISBN: 978-1-61804-327-6 Fig. 1. Hierarchy of the industrial control system Therefore, also the process control is nowadays being implemented deploying control systems with a hierarchical structure. The model of complex control process, so-called pyramid model, is shown in Fig. 1. [1] [2] At all levels of the production process control model, large amount of data are produced, collected and stored often resulting in data redundancy. Still, the fact that different levels produce different types of data needs to be respected. 150 Mathematics and Computers in Sciences and Industry III. PROBLEMS INDENTIFICATION IN MANUFACTURING A. Control Level Technology (control) level is the lowest layer of the pyramid model of hierarchical process control and constitutes a basic interface with production. It consists of production lines, machines and equipment, which include integrated sensors and actuators, communicating using technology network with control computers, mainly with PLC (Programmable Logic Controller).[3] At this level, collecting and primary processing of technological parameters is carried out. Data are collected in real-time, with different sampling times, which results in collection of large amounts of data to be saved or archived for further implementation. The cyclic data collection traditionally used to collect data without transmitting the signal differences leads to redundancy. Collected data represents also a part of the noisy industrial environment and contains errors stemming technological information processing. Collected data are often noisy from manufacturing environment and contain errors primary processing of technological information. Removing these adverse conditions, filtering the required data and its subsequent validation are the tasks necessary to be carried out. From the application area point of view, manufacturing process does not focus only on production itself, but extends and integrates data from all hierarchical control levels. For effective process control and management, not only production data are required, but also data including customers, resources, and suppliers’ information from the upper hierarchy control levels are inevitable. Manufacturing can therefore generate big amounts of data suitable for application of data mining and / or knowledge discovery process and might provide suitable means to deal with the problems arising in the field of production systems. Existing approaches in manufacturing utilising data mining techniques can be divided into five main application areas [4]:  Quality analysis of products to correlate output quality and system parameters, esp. machine settings, in order to identify causes for deteriorating product quality  Failure analysis of production resources, esp. machines, to analyse causes of errors and prevent break downs in the future B. Production (Supervisory )Level Supervisory level of pyramid model is a higher (intermediate) level of the hierarchical process control, which is alternatively called as SCADA / HMI (Supervisory Control and Data Acquisition / Human Machine Interface) level. It is used for primary collecting and integrating process data, for monitoring, visualization, evaluation and direct interference in the process. [3] At this level of control, system data are mostly stored in SCADA systems having an apparent purpose: processing alarm, monitoring mixing ratios, batch processes and data history, as well as archiving the operational variables. All other data stored at higher levels of pyramid model of hierarchical process control.  Maintenance analysis to enhance the availability of production resources, e. g., by optimized maintenance planning  Production planning and scheduling analysis to improve planning quality, e. g., by a higher capacity utilisation of production resources  Strategically planning and scheduling analysis to improve customer relationship and increase sales, e.g. by identification of customer behaviour Each of these application areas covers multiple applied techniques and also different approaches form hierarchical model point of view. Therefore, it is impossible to identify a specific technique and an approach that needs to be applied to optimize or solve issues in selected application areas. It should be noted that all application areas span across multiple hierarchical control levels and therefore, it is impossible to assign them to a one specific level. Due to this fact, we identified the dependencies based on various research applied in the manufacturing area utilising data mining techniques. MES systems, typical representatives of information systems at the production level, are responsible for obtaining and collecting data from manufacturing. Obtained data are processed and in real-time stored in an aggregated form into data storage (mostly transactional SQL database). The data saved in the database in a structured form containing the current value of the variable, validity and timestamp (VTQ = Value / Time / Quality). Generally, the data mining approaches used in main application areas in manufacturing mostly utilise manual processing of specific data collection to analyse specific manufacturing process aspects in various manufacturing specific cases, e.g. machines, equipment, products, quality, etc. Most of the approaches can be integrated into real time support systems; however they mainly focus only on the approaches and methods themselves. [5][6][7] C. Management Level Management level of pyramid model covers the previous levels. It consists of database resources for higher levels of control, management information system and tools for internet visualization. It is the level of planning and management. At this level, the data are archived and processed and long-term decisions for production are accepted. [3] At the management level of the pyramid model, data are not directly collected from manufacturing process, but are transferred in the transaction mode from information system of a real-time interface using the ERP Gateway. As the ERP does not operate continuously, continuous data transfer is carried out utilising ERP Gateway. Therefore, huge volumes of predominantly structured data arise in ERP systems. ISBN: 978-1-61804-327-6 IV. CURRENT STATE OF KNOWLEDGE DISCOVERY IN MANUFACTURING AREA At present, the application of data mining and knowledge discovery is very broad. However, according to recent studies [8][9], data mining is mostly employed in the fields like marketing, consumer analytics, finance, telecommunication, 151 Mathematics and Computers in Sciences and Industry The main advantage of this approach is robustness and stability, due to the widespread deployment and long term real world experience in various enterprise areas. This factor is important due to the fact that the company KPIs, affecting the company management and business bottomline, are based on data from the data warehouse provided by the data marts. insurance, health care etc. The usage of data mining in manufacturing is usually between 9 and 10 percent. Major part of this share is created by large international industries. The weakness of the current form of manufacturing is often in the subjective perception of global production aims (profitability, production efficiency, plant productivity and product quality), frequent and often unforeseen variations in both manufacturing parameters and variables, the subjective decision making, and also in the vast amount of unstructured data provided by various information systems. [10] The data mining and knowledge discovery process is usually based on data warehouse integrating all data required in this process. This concept of analytic environment is captures in Fig. 2. Multiple systems operate at various hierarchical control levels, each using its own databases mostly independent from each other. [11] Very often there is no defined relationship between data in each system, e.g. manufactured product identifier has different numbering schema and order across control level data, SCADA, MES and ERP data. Therefore, it is necessary to integrate these data together to perform analytic reporting and knowledge discovery process. In most large manufacturing companies, data warehouse is used to store the data from various company systems. Data integrated in the data warehouse serve as the basis for decision support, through the corresponding data mart or decision support tools [12]. Therefore, the ETL process transforming data into data warehouse for further use in business intelligence and analytic tools is extremely important. Fig. 2. Current state of knowledge discovery platform in manufacturing The knowledge discovery process in company analytic environment is usually performed according to company methodology. This methodology can be specific for each individual company. However in recent years, more and more companies are starting to adopt the CRISP-DM methodology. Most companies however, don’t adopt this methodology strictly. Due to this fact, the methodology is usually modified to suit the company needs. Since the knowledge discovery and data mining methodology is part of company know-how and not publicly accessible, it is impossible to generalise it as a whole. The obtained data are accessed through data marts, created through ETL process from the data warehouse, providing organised view on the data from various business perspectives. [12] Data marts for various company specific aspects, like management, manufacturing, quality, etc. provide basis for decision making process. It should be noted, that the data in data marts are not always integrated in the data warehouse itself. In other words, the data warehouse and data mart data can be separated. In order to obtain the complex view on company data for reporting and knowledge discovery, the data stored in data marts must supplement the company data stored in data warehouse. The discovery platform must be set over all company data stored in data marts. If data warehouse does not integrate all data, discovery platform must be able to obtain and process them. In companies the discovery platform is mostly used for KPI based reporting and quality assurance [13]. The main advantages of data mining and knowledge discovery have still not been fully exploited. A variety of knowledge discovery and analytic tools used in discovery platform is available. All major software tools provide connectors for relational databases and data warehouses. However most of the data operations must be handled by tools themselves, whether it is a standalone workstation or a client-server solution. Fig. 3. Relationship between different phases of CRISP-DM [15] ISBN: 978-1-61804-327-6 152 Mathematics and Computers in Sciences and Industry However, most of the companies preserve the continuity of the main phases of the CRISP-DM model, as shown in Fig. 3. Hence, the CRISP-DM methodology is, with a certain degree of abstraction, applicable in any manufacturing industry. [14] V. KNOWLEDGE DISCOVERY APPROACH PROPOSAL IN MANUFACTURING AREA The common analytics environment at most big manufacturing companies includes a data warehouse, or collection of federated data marts, which house and integrate the data for knowledge discovery process. This includes various ranges of analysis function and business intelligence and analytics tools enabling decision support utilising ad hoc queries, dashboards and data mining. Data mining is now used in many different areas in manufacturing engineering to extract knowledge for the use in predictive maintenance, fault detection, design, production, quality assurance, scheduling, and decision support systems. Data can be analysed to identify hidden patterns in the parameters controlling manufacturing processes or to determine and improve the quality of products. It clearly indicates data mining can be used in many different application areas of manufacturing. [16] Large manufacturing companies with large investments in their data warehouses have neither the resources, nor the will to replace the existing environment that works well and do what it was designed to. The majority of large companies utilise a coexistence strategy combining the best of data warehouse and analytics environment, with the new trends in big data solutions. However, the manufacturing process brings a huge amount of data stored in databases containing enormous number of records. Every record has attributes needed to be explored to discover useful information and knowledge. All of this factors clearly demonstrates, that the choosing the right methods is crucial to successful discovery of knowledge. [17] Nowadays, there are a lot of types of methods, techniques and algorithms used for data mining process. Kdnuggets in [18] carried out a survey asking companies utilising data mining algorithms in their company. According to this research, the most used algorithms are: decision trees (rules), regression, clustering, statistics, visualization and time series. Many companies want to continue to rely on data warehouses for standard BI and analytics reporting, including sales reports, customer dashboards, risk history , etc. The coexistence strategy allows the companies to use data warehouse with its standard workload and storing historical data to establish robust traditional business intelligence and analytics tools. [19] Despite the robustness of traditional business intelligence and analytics tools, semi-structured and unstructured data from the data collection process do not fit well into traditional data warehouses. Furthermore, data warehouses may not be able to handle the processing of frequently or even continually updated big data sets. As a result, organisations are looking for possibilities to collect, store and analyse big sets of data. Newer class of technologies including the Hadoop framework and NoSQL systems are often deployed for this task. [20] In some cases, these technologies are being used as staging areas for data before they are transformed into a data warehouse, often in summarised form that is more suitable for relational structures. Big data solution vendors are increasingly pushing the concept of Hadoop Data Lake that is used as central repository for raw data streams present in the company. [21] This coexistence approach, incorporating Data Lake as the central repository serves as a baseline for our knowledge discovery approach in manufacturing area, captured in Fig. 5. The proposed knowledge discovery analytic environment is based on common data warehouse approach. The data warehouse integrates various data from heterogeneous systems across various hierarchical control levels. These heterogeneous data are extracted, transformed and integrated into a data warehouse using ETL process. This approach is mostly suitable for discontinuous and non-real-time data from higher hierarchical control levels. Fig. 4. Survey of methods and algorithms usage in data analysis [18] Many of the methods are exploitable in several areas, but it is very important to perform detailed analysis of the tasks to be solved, because methods are not universally applicable, but depend on the problem to be solved. ISBN: 978-1-61804-327-6 The data marts, created from the data warehouse data, provide organised view on data from business unit perspective (like management, manufacturing, quality, etc.) and provide basis for decision making process in selected area. The data loaded into data marts needs to be extracted and transformed, to create the data structure suitable for further use. 153 Mathematics and Computers in Sciences and Industry Fig. 5. Knowledge discovery analytic environment proposal The ETL process transforming data from a Data Lake into a data warehouse is performed only for data not transferred into data warehouse directly. The main use of this particular ETL process is loading the manufacturing data from the field level stored in a Data Lake into a data warehouse. [20] The Data Lake, based on Hadoop framework, provides central data storage for raw manufacturing data. The Data Lake extracts and loads data from heterogeneous database systems and stores them in a raw (original) form. Therefore the data does not have to be transformed to be stored in the Data Lake. In the Hadoop Data Lake cluster, subsets of the data can be analysed using batch query tools, stream processing software and SQL on Hadoop technologies that run interactively or using ad hoc queries in SQL. One of the biggest issues in obtaining manufacturing data is the way of collecting and processing data from the field level of hierarchical control. All these data serves as a basis for decision support at higher hierarchical control levels, the used field level data are usually aggregated into data more suitable for particular decision support task. Therefore the data suitability for business intelligence or analytic tools is very limited. The discovery platform in this environment is built on the data integrated in the Data Lake. Due to the use of Hadoop cluster, this environment provides higher performance when working in big data sets than the traditional data warehouse. Big advantage is also the availability of raw data from the manufacturing that cannot be easily stored in the data warehouse. However with the increasing number of sensors connected to network in production chain, it is easier to collect the production chain data. This feature is provided by Field Level Bus. The offer of tools for discovery platform over Hadoop cluster is not very wide. Most of the standard knowledge discovery tools cannot connect to Hadoop cluster using SQLon-Hadoop solutions. However, this way most of the data manipulation operations must be performed by the tool itself and not by the Hadoop cluster. In order to enable utilising the full potential of data manipulation performance of Hadoop, software manufacturers offer add-ons or software solutions able to perform selected sets of operations and algorithms directly in a Hadoop cluster. This approach is preferred, since the discovery platform must be able to process the big sets of collected data. ISBN: 978-1-61804-327-6 The Field Level Bus collects data from various industrial control systems, and loads them into the Data Lake storage. Due to the big amounts of periodic or continuous data collected at this level, Data Lake builds on the Hadoop cluster technology which is the most suitable solution to store the raw field level data. Main task of Field Level Bus is preparing the data which is a fundamental step for the further use of field level data, as the data can be collected from various, sensors, PLC, devices, systems, etc. 154 Mathematics and Computers in Sciences and Industry Data collected at the field level can also be inconsistent. Therefore, transforming the collected data into cleaned forms storable in Data Lake storage is necessary. This Field Level Bus addresses the need of data analysis aimed at cleaning the raw data. [22] [4] [5] VI. CONCLUSION [6] Knowledge discovery analytic platform proposed in this paper incorporates novel trends and methods used in the knowledge discovery in manufacturing area. The traditional data warehouse approach for knowledge discovery platform is supplemented with Hadoop cluster, to store big data collected at the field level of hierarchical control. [7] The proposed analytic platform preserves the robustness and well-proven technology for traditional business intelligence and analytic tools, and creates space for knowledge discovery in frequently and continually updated manufacturing data in a raw form. Therefore it represents an ideal compromise between existing traditional tools and the need for strong business intelligence, reporting and analytic platform. [8] [9] [10] The main disadvantage is the necessity of integrating all data in a Data Lake, which makes it difficult to ensure the integrity and security of company data. In traditional relational databases and data warehouses various approaches, methods and tools for maintaining integrity and security of company data are available. In Data Lake represented by Hadoop cluster, all data needs to be integrated altogether, and the discovery platform must have access to all the data. This is one of the main issues addressed when implementing Data Lake. [11] [12] [13] [14] The proposed approach focuses on all hierarchical control levels in manufacturing. Therefore, manufacturing area as a whole represents the main application area of this approach. With a certain degree of abstraction, the approach can be applied also in other industrial fields, where lots of data needs to be collected frequently or continuously. [15] [16] [17] ACKNOWLEDGMENT This publication is the result of implementation of the project VEGA 1/0673/15: “Knowledge discovery for hierarchical control of technological and production processes” supported by the VEGA. [18] [19] REFERENCES [1] [2] [3] [20] J. Jadlovský, S. Laciňák, M. Čopík and J. Ilkovič, “Technological level of flexible manufacturing system control,” Acta Electrotechnica Informatica, vol.11, No.1, pp. 20-24, 2011. P. Tanuška, P. Važan, M. Kebísek and D. Jurovatá, “Knowledge discovery from production databases for hierarchical process control,” International Journal of Mechanical, Aerospace, Industrial, Mechatronic and Manufacturing Engineering vol.7, No:11, 2013. J. Jadlovský, J. Laciňák, J. Chovaňák and J. Ilkovič. “Návrh distribuovaného systému riadenia pružnej výrobnej linky,” In: ISBN: 978-1-61804-327-6 [21] [22] 155 International Conference – Cybernetics and informatics. Vyšná Boca. 2010. C. Gröger, F. Niedermann and B. Mitschang, “Data Mining-driven Manufacturing Process Optimization,” In: Proceedings of the World Congress on Engineering 2012 Vol III, WCE 2012. Hong Kong: Newswood 2012, pp. 1475-1481. K. Wang, S. Tong, B. Eynard, L. Roucoules and N. Matta ,“ Fuzzy systems and knowledge discovery,” FSKD, 2007. P. Michalik, J. Štofa and I. Zolotová, “Testing the properties of Kmeans algorithm for data mining applications,” In: LINDI 2013 : 5th IEEE International Symposium on Logistics and Industrial Informatics : Proceedings : September 5-7, 2013, Wildau, Germany. - Piscataway : IEEE, 2013 P. 99-102. - ISBN 978-1-4799-1257-5. G. Köksal, İ. Batmaz and M. C. Testik, “A review of data mining applications for quality improvement in manufacturing industry,”Expert Systems with Applications, 38 (10) (2011), pp. 13448–13467. Rexer Analytics, “Data miner survey – 2013 survey summary report,“ 2014, [cit. 20.06.2015]. Available online: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html. KDnuggets, “Data Mining Community‘s Top Resource,“ 2014, [cit. 23.06.2015]. Available online: http://www.kdnuggets.com. B. Vorhies, “The big deal about big data: what’s inside – structured, unstructured, and semi-structured data in data magnum blog,“ 2013. [cit. 25.06.2015] Available online: http://data-magnum.com/the-big-dealabout-big-data-whats-inside-structured-unstructured-and-semistructured-data/. X.Z Wang, “Data mining and knowledge discovery for process monitoring and control advances in industrial control,“ Springer Science & Business Media, 2012. 251p. ISBN 978-1-44710-421-6. X. Zhu, “Knowledge discovery and data mining: challenges and realities,“ Challenges and Realities. 2007. Idea Group Inc (IGI). 290p. ISBN 978-1-59904-252-7. H. Chen, R. H. L. Chiang, and V. C. Storey, “Business intelligence and analytics: from big data to big impact ,“ MIS Q. 36, 4 (December 2012), 1165-1188. P. Chapman, P. Kerber, J. Clinton, J. Khabaza, T. Reinartz, C. Shearer and R. Wirth, “The CRISP-DM Process Model,”. Discussion Paper. 0503-99. Marec 1999. “What is the CRISP-DM Methodology,” [cit. 20.06.2015]. Available online: http://www.sv-europe.com/crisp-dm-methodology/. J.A. Harding, M. Shahbaz, S. Srinivas and A. Kusiak, “Data mining in manufacturing: a review,” Journal of Manufacturing Science and EngineeringTransactions of ASME, 128(4), 969–976.2006. A.K. Choudhary, M.K. Tiwari and J.A. Harding, “Data mining in manufacturing: a review based on the kind of knowledge. ” In: Journal of Intelligent Manufacturing. Leicestershire: Loughborough University´s Institutional Repository. 20 (5), s. 501 – 521. 2009. KDnuggets, “Algorithms for data analysis/data mining. Which methods/ algorithms did you use for data analysis?,” [cit. 19.06.2015]. Available online: http://www.kdnuggets.com/polls/2011/algorithms-analytics-datamining.html. T.H. Davenport and J. Dyché, “Big data in big companies,” Thomas H. Davenport and SAS Institute Inc May 2013. W.Fan and A. Bifet, “Mining big data: current status, and forecast to the future,” SIGKDD Explor. Newsl. 14, 2 (April 2013), 1-5. M. Rouse, “Big data analytics,” [cit. 20.06.2015]. Available online: http://searchbusinessanalytics.techtarget.com/definition/big-dataanalytics. S. Zhang, C. Zhang and Q. Yang, “Data preparation for data mining, Applied Artificial Intelligence,” An International Journal, Volume 17, Issue 5-6, 2003. Taylor & Francis, 2003. doi: 10.1080/713827180.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proposal of knowledge discovery platform for big data