Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analytical Platform (BDAP) - Final Project ת"ז שם אחראי על אפליקציה אחראי על סקר 313734436 ויבורוב מיכאל Apache Spark 3D Printing 200600583 בורדייניק יניב Apache Storm 200240588 פיראס אבו ג'בל HPCC Identifying 200904399 ג'וזף עון Google BigQuery Interactive Code 1 About the Domain 1.1 Background About The Domain Big Data Analytics Platforms Big Data concerns massive, heterogeneous, autonomous sources with distributed and decentralized control. These characteristics make it an extreme challenge for organizations using traditional data management mechanism to store and process these huge data-sets. Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. With big data analytics, data scientists can analyze huge volumes of data that conventional analytics and business intelligence solutions can't grasp. Big-data Platforms and Big-data Analytics Software focuses on providing efficient analytics for extremely large data sets. These analytics helps the organizations to gain insight, by turning data into high quality information, providing deeper insights about the business situation. This enables the business to take advantage of the digital universe 1 This work is intended to describe software platforms and tools available today to support an endeavor to discover hidden knowledge in the big data paradigm. Such analytical findings are expected to lead to new business and science opportunities. • The Benefits of Big Data Analytics ❖ Enterprises are increasingly looking to find actionable insights into their data. Many big data projects originate from the need to answer specific business questions. With the right big data analytics platforms in place, an enterprise can boost sales, increase efficiency, and improve operations, customer service and risk management • The Challenges of Big Data Analytics ❖ For most organizations, big data analysis is a challenge. Consider the sheer volume of data and the different formats of the data (both structured and unstructured data) that is collected across the entire organization and the many different ways different types of data can be combined, contrasted and analyzed to find patterns and other useful business information. ❖ The first challenge is in breaking down data silos to access all data an organization stores in different places and often in different systems. A second big data challenge is in creating platforms that can pull in unstructured data as easily as structured data. This massive volume of data is typically so large that it's difficult to process using traditional database and software methods. • Big Data Requires High-Performance Analytics ❖ To analyze such a large volume of data, big data analytics is typically performed using specialized software tools and applications for predictive analytics, data mining, text mining, and forecasting and data optimization. Collectively these processes are separate but highly integrated functions of high-performance analytics. Using big data tools and software enables an organization to process extremely large volumes of data that a business has collected to determine which data is relevant and can be analyzed to drive better business decisions in the future. 2 1.2 Reviews From The Literature 1.2.1 Terms Term Big Data Definition Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines Business Intelligence Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes by using computing technologies for the identification, discovery and analysis of business data - like sales revenue, products, costs and incomes. Analytics The process of collecting, processing and analyzing data to generate insights that inform fact-based decision-making. In many cases it involves softwarebased analysis using algorithms. The big data analytics visualization is a visual representation of the insights gained from analysis. Visualization Data mining Data Scientist Distributed Computing Distributed File System Big data visualization refers to the implementation of more contemporary visualization techniques to illustrate the relationships within data. Visualization tactics include applications that can display real-time changes and more illustrative graphics, thus going beyond pie, bar and other charts. These illustrations veer away from the use of hundreds of rows, columns and attributes toward a more artistic visual representation of the data. Computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Term used to describe an expert in extracting insights and value from data. It is usually someone that has skills in analytics, computer science, mathematics, statistics, creativity, data visualisation and communication as well as business and strategy. A software system in which components located on networked computers communicate and coordinate their actions by passing messages. Data storage system designed to store large volumes of data across multiple storage devices (often cloud based commodity servers), to decrease the cost and complexity of storing large amounts of data. 3 Computer Cluster A computer cluster consists of a set of loosely or tightly connected computers that work together. Computer clusters have each node set to perform the same task, controlled and scheduled. Is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Cluster Clusters are known for boosting the speed of data analysis applications. They also are highly scalable Data Warehouse Analytics Platform Anonymization A data warehouse (DW) is a collection of corporate information and data derived from operational systems and external data sources. A data warehouse is designed to support business decisions by allowing data consolidation, analysis and reporting at different aggregate levels. A software that provides the tools and computational power needed to build and perform many different analytical queries Data anonymization is the process of destroying tracks, or the electronic trail, on the data that would lead an eavesdropper to its origins. An electronic trail is the information that is left behind when someone sends data over a network Concurrency The ability to execute multiple processes at the same time Data Analysis Process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information. Computer Cluster A computer cluster consists of a loosely or tightly connected set of computers that work together so they can be viewed as a single system. Each node (computer) is set to perform the same task, controlled and scheduled by software. Data Parallelization This form of parallelism focuses on distribution of data sets across the multiple computation programs. In this form, same operations are performed on different parallel computing processors on the distributed data sub set Task Parallelization This form of parallelism covers the execution of computer programs across multiple processors on same or multiple machines. It focuses on executing different operations in parallel to fully utilize the available computing resources in form of processors and memory. Resilient Distribution Data-sets A technique to distribute data between computer clusters in a manner that supports fault-tolerance. Fault-Tolerance Fault-tolerant describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service. Fault tolerance can be provided with software, or embedded in hardware, or provided by some combination. 4 Batch Processing Structured v Unstructured Data Map-Reduce Hadoop Batch processing is a general term used for frequently used programs that are executed with minimum human interaction. Batch process jobs can run without any end-user interaction or can be scheduled to start up on their own as resources permit. Structured data is basically anything than can be put into a table and organized in such a way that it relates to other data in the same table. Unstructured data is everything that can’t – email messages, social media posts and recorded human speech. Refers to the software procedure of breaking up an analysis into pieces that can be distributed across different computers in different locations. It first distributes the analysis (map) and then collects the results back into one report (reduce). Apache Hadoop is one of the most widely used software frameworks in big data. It is a collection of programs which allow storage, retrieval and analysis of very large data sets using distributed hardware (allowing the data to be spread across many smaller storage devices rather than one very large one). 1.2.2 Domain Ontology Business Intelligence And Analytics (BI & A): From Big Data To Big Impact. In this research we can see how the Business Intelligence & Analytics impact on the data related problems to be solved in the business organizations. A data-centric approach, BI & A has it’s roots in the longstanding database management field. It relies on various data collection, extraction and analysis technologies. The relation between BI & A and Big Data is now stronger then ever. Many organizations realized that they need to be prepare to the transformation from normal data management to big data management and that means that the organization’s platforms need to be ready for it. Most of the platforms has the same idea, to build a job for the data streaming process, to divide the data between the nodes and afterwards to execute the relevant queries on the new data. ● http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf ● http://www.informationweek.com/big-data/big-data-analytics/16-top-bigdata-analytics-platforms/d/d-id/1113609 5 1.2.3 Literature review For many years companies and organizations used traditional data warehouses to analyze business activities improve their decision making processes [3, 4, 5, 7]. In recent times many new complex types of data have emerged and the rate at which much of the data is being created forces organizations to turn to advanced techniques for processing the data like Cleansing, Pre-processing, job parallelizing and so on. [1,2,11] Generally, it became apparent that in order to continue to produce business insights from the data organizations keep gathering at increasing rate the need of advanced tools for data analysis became more and more pressing. This need is answered by emerged new form of data analysis systems - Big Data systems [3, 11]. Big Data is therefore a term associated with the new types of workloads and underlying technologies needed to solve business problems that we could not previously support due to technology limitations, prohibitive cost or both. [3] Naturally, organizations investing into the novel and promising analytics techniques expect to harness the benefits and see data-driven insights across all the level of the organization. For example using GPS-enabled navigation one can expect get realtime traffic updates and route suggestions and so on. [10] Three major aspects of Analytics Architecture for Big Data systems are recognized: Unified Information Management, Real-Time Analytics and Intelligent Processes. [3, 4, 5, 7]. These aspects are handled by different systems from different vendors, including Open-Source organizations [6]. These aspects may be more or less important in different fields of applications of the Big Data Analytics system. For example Behavioral Threats detection system may want to put special accent on Real-Time analytic aspect [8], while Healthcare analytical system may see freestructured data performance as very important aspect of the system [9] and security analytics system may require very strong suite for information management part [11]. Valuable capabilities of Big Data Analytical platforms are noted and defined [12], e.g.: Information Delivery, Analysis and Integration. Existing tools are compared and evaluated. Interestingly enough, one of the major challenges for the commercial tools is that many of them are limited to specific use cases [12]. 6 1.3 Domain Boundaries In reality Big Data Analytics platforms are used in different areas for different tasks. Main areas are: ● Software solution for managing data storage, manipulation and retrieval tasks. ● Frameworks, platforms and tools that facilitate development and execution of analytics processes that bear business value. ● Various techniques, libraries, software products that facilitate data mining processes. ● Reporting and visualization tools and applications. It is important to note that all four of these areas are interconnected. Many industryknown solutions encapsulate several or all of them. Our focus, however, is the second bullet - Big Data Analytics Platforms. The applications from this field are designed to facilitate analytical tasks and techniques by creating an abstraction layer between the developer and the underlying robust software and hardware infrastructure that is capable to handle resource consuming tasks for analysis in Big Data paradigm. We are not going to allocate considerable attention towards issues of storing and assessing the big data, visualization techniques and solutions and particular data mining techniques applied to big data realm. But our focus will be on big data analytics which aims to uncover meaningful patterns in data. 7 1.3.1 Applications Within The Domain Application HPCC Google BigQuery Spark Storm Description High Performance Computing Cluster - is a massive parallelprocessing computing platform that solves Big Data problems. BigQuery is a Restful web service that enables interactive analysis of massively large data-sets working in conjunction with Google storage that may Cluster computing framework well suited to machine learning algorithms. Includes several components to handle task management, data access, Distributed real-time computation system aimed for easy and reliable processing unbounded streams of data in real-time. 8 Link Open Source? http://hpccsystems.com/ Y https://cloud.google.com / N https://spark.apache.org/ Y http://storm.apache.org/ Y 1.3.2 Relevant Applications Outside The Domain Application QlikView Description Why Outside? Self-service data visualization and guided analytics that lead to insight that ignite good ideas. The complexity of QlikView depends on the amount of data that the tool is trying to analyze. Oriented to BI. Empower users to gain actionable insights with data from many users. Excel supports technology such as Power Query, Power Pivot and Power View to perform dynamic analysis on the combined data set. Excel has its own opinion while many users questioning Excel’s ability with Big Data analytics Scalable, software only columnar database management system for analytic application InfiniDB is an RDBMS which in this category doesn’t support Big Data analytics. Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra doesn’t support large scalability although Apache claimed that it will support Big Data in the future. HIVE Hive is data warehouse software facilitates querying and managing large data-sets residing in distributed storage. HIVE is data warehouse software which represents only a part of the all Big Data system. The data warehouse only hosts the data while the analyzing made by a different tool. Weka Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization Weka is a machine learning tool for data mining and that maybe have some visualization although it is not an analysis tool. Gephi Gephi is an open source software for network visualization and analysis Gephi doesn’t support Big Data volume and not oriented towards running data processing. Microsoft Excel InfiniDB Apache Cassandra Sisense Business intelligence that enables non-technical Sisense is not a Big Data users to join and analyze large data sets from framework, but it is software multiple source (Sisense was founded in 2004 for making data analysis and 9 in Tel-Aviv) 1.4 visualizations about the company. Similarities And Differences 1.4.1 Similarities Between The Applications 1. Data Access provider: Any platform aimed to data analysis must be equipped with a module that is able to access the data. At least one of two options is used · Stream access. · Persistent data access. 2. Process/Task management facility: Parallelization process; the frameworks/platforms are required to provide at least one of · Data parallelization management. · Task parallelization management. 3. Analytics design: Provide a way for designing a job (Query) for processing the data using external tools or built-in facility or both. This job must be defined as one of the methods · Batch execution. · Stream Processing. · Interactive execution. 4. Environment management: Provide the ability to manage clusters and nodes configurations. 10 1.4.2 Differences Between The Applications 1. Reporting facility Reporting facility is optional for the area we are discussing. 2. Different methods of data access Several methods for accessing the data may be supported by different tools. Platforms that support persistent data access required to support at least one form File Systems or SQL-like access methods. Using In-Memory data access and Map-Reduce functionality are optional features in the domain of Big Data Analytics Platform. 3. Clustering Clustered organization of platform nodes is optional. 4. External tools support Different approaches to incorporate external tools. 5. Environment requirements differ from application to another. 6. Different interfaces; some applications are web applications while others are desktop applications 11 1.4.3 Feature Diagram 12 2 Domain Model 2.1 Mandatory vs. Optional First Order Elements Element Diagram Mandatory/ Optional Many/Single Cluster Class Optional Many Diagram Job Class Explanation The central logical part of environment. Mandatory Many The artifact of developers work. Optional Many The part of job.Optional because not Diagram Chunk all of the Class Diagram platforms may support job splitting. Node Class Responsible Diagram Mandatory Many Integral part of the environment. for processing the jobs/chunks. Data forStorage Class Mandatory Many Part of the system that is responsible Diagram providing/writing data to different data sources. Admin Class Mandatory Single Diagram Develope and r Class User entity responsible for system maintenance. Mandatory Single User entity responsible for creating Diagram executing jobs. User Use Case Optional Single Managem Part of maintenance procedure in the system. Optional because some systems ent may rely on external user management. Manage Use Case Mandatory Single General maintenance procedures. Use Case Optional Many Support activity. Optional because Environm ent Monitorin 13 this isg/not Reporting a core purpose of the system and may be delegated to external tools. Data Use Case Mandatory Many Utilization of data access. Provided as Storage core service. Design Use Case Mandatory Many Core activity by the developers. Use Case Mandatory Many Also core activity by the developers. Use Case Mandatory Job Execute Job Allocate Internal activity. Service provided by Resource the system. s Config Use Case Mandatory Activity by developer. Responsible for Job describing environment for job execution. 2.2 Variation Points & Variants At First Degree Elements שם נקודת השוני סוג+ שם הדיאגרמה מאפייני נקודת השוני אפשרויות השוני וכללים ליישומם Data Access Use Case Different systems Persistent, Memory, Stream support different data sources Persistent Use Case may not be supported Map Reduce, Database, File System in some systems Memory Class Diagr may not be supported Resilient in some systems Infrastructure Use Case, Different systems may Architecture Class have different Diagram infrastructure 14 Single Server, Clusters Node Roles different systems may Role Dedicated, Equal Roles handle different types of nodes. Management Use case, Sort and level of Users Management, Servers Class, management may Management Diagram diffre between the systems Language Use Case system dependant Internal Language, External Language External Use Case system dependant Scala, JavaScript, R, Java Use case system dependant Stream, Batch, Interactive Use Case, system dependant Reports, Graphs Language Execution Method Monitoring Class Diagram 15 2.3 Domain Models 2.4 Principal Use Case diagram for the BDAP domain . 16 2.5 Principal Class diagram for the BDAP domain 17 2.6 Principal Sequence diagram for the BDAP domain 18 3 Applications 3.1 Apache Spark 3.1.1 Apache Spark Background Apache Spark is a cluster computing framework designed to meet requirements for performance in Big Data realm. Apache Spark is built on top of cluster management and distributed storage systems and is able to interface with several existing infrastructure platforms like Hadoop, Cassandra, S3. 3.1.2 Apache Spark Requirements ● RAM. Minimum 8GB per server. ● CPU. Spark is highly scalable system. Recommended 8-16 cores per server. ● Local disk space. Recommended 4-8 disks per node. ● External storage systems. HDFS/DB ● Network communication between the components. Recommended 10Gbps 19 3.1.3 Apache Spark Models Legend for reading color coding: 3.1.3.1 Principal Use Case diagram for Apache Spark 20 3.1.3.2 Principal Class Diagram for Apache Spark 21 3.1.3.3 Principal Sequence running non-interactive job 22 diagram for Apache Spark 3.2 Apache Storm 3.2.1 Storm Background Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations. Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection. 3.2.2 Storm Requirements ● Internet connection. ● Linux machine. ● Java 6 installed on the machine. ● Python 2.6 installed on the machine. ● Storm cluster installation (is where the actual topology runs) ● Storm client installation (required for the topology management) 3.2.3 Storm Models Legend for reading color coding: 23 3.2.3.1 Storm Use Case Diagram 24 3.2.3.2 Storm Class Diagram 25 3.2.3.3 Storm Sequence Diagram 26 3.3 HPCC 3.3.1 HPCC Background HPCC (High Performance Computing Cluster): is an open source, data-intensive computing system platform developed by Lexis-Nexis Risk Solutions. It stores and processes large quantities of data, processing billions of records per second using massive parallel processing technology. Large amounts of data across disparate data sources can be accessed, analyzed and manipulated in fractions of seconds. HPCC functions as both a processing and a distributed data storage environment, capable of analyzing terabytes of information. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL 3.3.2 HPCC Requirements ● Enable massive amounts of data across disparate databases, to be accessed, analyzed and manipulated in fractions of seconds ● ● ● ● ● Functions as both a processing and distributed data storage environment The HPCC platform is designed to work on simple commodity hardware using a simple commodity operating system Plugs into any computing language Works over the Internet or over a private network Operates on either distributed or centralized systems 27 3.3.3 HPCC Models Legend for reading color coding: 3.3.3.1 HPCC Use Case Diagram 28 3.3.3.2 HPCC Class Diagram 29 3.3.3.3 HPCC Sequence Diagram 30 3.4 Google BigQuery 3.4.1 Background Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google's infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data. 3.4.2 Requirements These requirements were required in the Python Codelab and are also required for this lab: ● A laptop/notebook computer ● Python 2.7.x ● App Engine SDK: https://developers.google.com/appengine/downloads ● Development environment: an IDE or a text editor and command-shell pair In addition, this BigQuery Dashboard Codelab requires: ● Access to the command line (shell) on your computer. ● A valid Google Account. ● Access to the BigQuery service. You can sign up for BigQuery using these instructions. ● The gviz_data_table library https://pypi.python.org/pypi/gviz_data_table/ Google APIs Client Library for Python: https://github.com/google/google-apipython-client/ 31 3.4.3 BigQuery Models 3.4.3.1 Use Case Diagram 32 3.4.3.2 BigQuery Class Diagram 33 3.4.3.3 BigQuery Sequence Diagram 34 3.5 Differences Between The Applications Variance OVM 35 36 37 Criteria Execution Method Apache Spark All available Apache Storm All available options options Built-in Tools support support External Tools support support Environment support support User Management no support support Task Parallel support support Data Parallel support support Fault Tolerance support support Monitoring/Reporting no support support Stream Data access support support Persistant data access support support Map Reduce support support InMemory resilience support support Management 38 HPCC BigQuery 4 Papers reviewed and their relations to the Domain 4.1 3D Printing and Big Data Analytics Platforms Papers reviewed: Mathieu Acher, Benoit Baudry, Olivier Barais and Jean-Marc Jézéquel Customization and 3D Printing: A Challenging Playground for Software Product Line. Thomas Thum, Sven Apel, Christian Kastner, Ina Schaefer and Gunter Saake. A Classification and Survey of Analysis Strategies for Software Product Lines. After more than a decade being around 3D printing (also called additive manufacturing) technologies became more and more popular and widely available to general public. It is used in vast variety of area such as archaeology, industrial prototyping and design, medicine, rapid manufacturing and many more. It is also speculated that the next evolution of 3D printing will be “customization for the masses”. By printing one product at a time, 3D printing provides the means and tools to fit the customer's identity and preferences. The research focus of the paper is Thinguniverse - the most popular website and virtual community for sharing user-created 3D printable design files. This community encompass the methods for creating, controlling and managing digital models for 3D-printable objects. Currently Thinguniverse consist of more than 160,000 members and stores more than 200,000 downloadable digital designs. Thinguniverse employs SPLE-like approach to the 3D-printable objects. The objects are presented in a form of a design files, they could be customized and combined in order to fit any particular set of requirements. We have found the following SPLE characteristics and properties to be salient in 3D printing paradigm: Customization and variation management Focus on the means of efficiently designing and maintaining similar products Mastering base customization and heavily relies on software technologies 39 Design analysis: Type checking, Static analysis, Model checking and Theorem proving SPL strategy for design specification and variability management: Domain wide, Family wide, Product based and Family based BDAP and 3D Printing in SPLE perspective Platforms for Big Data analytics are very complex systems that often require very strong underlying hardware. Numerous existing frameworks, tools applications are utilized as components in the aggregative Big Data Analytical system. The target audience of such systems is Data Analysts that use variety of interconnected data-crunching services and very often do not have expertise to install and maintain the building blocks of such systems. This encourages the market of BDAP systems to support high customizability and modular architecture through very simple and intuitive interface. For example Amazon Cloud services allows to the subscriber to create new analytical cluster, configure software components and underlying hardware and run it in less than 5 minutes of work. This paradigm is made very beneficial by means and techniques described by advanced SPL Engineering and it is quite similar to the situation that we observe in Thinguniverse community - variety of 3D object are available for customization and composing into desirable architecture for printing is perform in a manner as the variety of software and hardware components is available to be customized and interconnected into one complex ready-to-go system. Reviewing 3D printable objects design through the lenses of SPLE helped us to see some abstraction of SPLE approach and gave practical tools for later investigation of BDAP domain. Decomposing complex BDAP system into functional modules and discussing each of them in relations to the others was facilitated by the previously learned application of SPLE principles. 40 4.2 Defining In-active Code – Overview Software companies provide several products to several companies; however they also sometimes provide the same product to several companies but with specific functionality to each company. Basically this product has the same basic functionality. However, differences might appear in the specific requirements for each customer. In order to handle this variability problem, software companies developing automated approaches to manage common functionality for a software product. Then they refine the source code to develop a solution to a particular customer. The approach presented in this paper to solve the variability issue in software product line comprises three steps: 1. Computing the system dependence graph (SDG): SDG is a graph consisting of nodes and edges which together can show the program elements and the different dependencies between them. The approach in this paper relies on the SDG. 2. Extracting presence conditions for the conditional system dependence graph (CSDG): adding conditions on the SDG dependencies to describe the program variation points, usually conditions would be a Boolean formula used to configure options. 3. Identifying inactive code: finally when the CSDG is ready, the identification of the inactive code for a specific product variable is accomplished using the CSDG and the product concrete configurations. Product configurations sets values to the conditions formulas in the CSDG and according to this values active and inactive code can be identified by a reachability analysis. In order to evaluate the approach they checked effectiveness, accuracy, and performance. The evaluation was performed using the approach as implemented and customized for Keba’s KePlast product line. 41 Effectiveness: It has been shown that the cost of code maintenance increases with source code size and complexity. Then the goal is to evaluate to what extent the approach can ease maintenance tasks for application engineers by reducing code size? To check the approach, it was used on the code of two different product variants. Results show that there is a significant size of inactive code for most of the configuration options; the inactive code also scatters over several files for a majority of the configuration options. Therefore, the inactive code will not be obvious to a developer when performing a maintenance task Accuracy: Approach accuracy was evaluated regarding the identification of inactive code for configuration options. Approach results were compared to domain expert’s results. According to the comparison results the approach has a very high level of accuracy and sometimes better than the domain expert (in cases where the code is scattered in many files). However, the domain expert can identify irrelevant code which the approach can’t. E.g. assignments to variables which are never read expert also removed declarations when the declared element was not used anymore. Performance: Evaluate the run time performance of the approach implementation. The measurements were made for 15 different product variants. Performance measurements were conducted to determine the times needed to parse the source code, to build the CSDG, to extract the presence conditions and to perform the configuration-based pruning. The results of the evaluation show that the performance of the configurationaware analysis is sufficient for application in industrial settings. 42 So for conclusion, how does all this relate to the course and to variability implementation We presented an approach for identifying inactive code in product variants of a product line using code analysis technique. In industrial product lines customer-specific products are frequently developed in a two stage process: first the required features are selected to create an initial product. Then the code of the initial product is refined and adapted in a clone-and-own way to address specific customer requirements. The technique allows hiding inactive code in a product configuration to support application engineers. It is based on a system dependence graph (SDG) encoding all the global control and data dependencies in a program. Furthermore, the SDG is configuration-aware, i.e., representing the variability of the system. The approach has currently been implemented for a programming language used in the domain of industrial automation. The approach also supports configuration parameters and branch statements in code – two widely used mechanisms for implementing variability in product lines. This work is regarded as a first step towards comprehensive configuration-aware code analysis technology for product lines. The conditional SDG represents a strong basis for implementing diverse analysis techniques. Furthermore, the KePlast product line comes with a Java-based HMI framework, where analogous variability mechanisms are used. 43 5. Process and Conclusions The work was conducted in three phases: Phase I : Select Domain In this phase we came to agreement for one particular domain (Big Data Analytical Platforms). Then we defined domain boundaries, found systems that are related to the domain and those that are not. We discussed particular examples of the systems that are bordering the domain. Phase II : Describe Particular Systems During second phase we worked on investigating four different systems in our domain and extracting related features and describing the systems with UML diagrams. Phase III: Describe Domain The last phase we modeled domain of Big Data Analytical Platforms based on description of four different systems from previous phase. We applied domain models to the UML design of each from the four systems and highlighted common, optional and system-specific features and elements. We used OVM modeling methodology for describing the variability allowed inside the domain. 44 Conclusions During the work on this course we experienced iterative process of modeling software domain. The process begun from discovering the domain, refining the domain borders, describing in details particular participants of the domain and finally gathering all the information together into domain model. The high-level review of this endeavor allows us to point out several key point relevant to the academic course: ● The iterative nature of this process forces one to review his own earlier conclusions with new learned knowledge over and over again. ● Precision in software domain definition is important and not trivial task for software engineer. This task must be completed on the very early stages of Software design and planning process. ● During our work we learned usage of OVM modeling technique. 45 6. Bibliography 6.1 Bibliography For Definitions and Applications What is Business Intelligence? -Retrieved from http://searchdatamanagement.techtarget.com/definition/business-intelligence ● What is Distributed Computing? - Definition from WhatIs.com. (n.d.). Retrieved from https://en.wikipedia.org/wiki/Distributed_computing ● What is Computer Cluster? - Retrieved from http://www.techopedia.com/definition/6581/computer-cluster ● What is Map Reduce? - Retrieved from http://www01.ibm.com/software/data/infosphere/hadoop/mapreduce/ ● What is Hadoop? Retrieved from http://www01.ibm.com/software/data/infosphere/hadoop/ ● Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing ● Apache Storm Application Details. retrived from http://hortonworks.com/hadoop/storm/ ● What is Apache Storm? Retrieved from http://en.wikipedia.org/wiki/Storm_(event_processor) ● Apache Storm Application Details. retrived from https://spark.apache.org/ ● What is Apache Spark? Retrieved from http://en.wikipedia.org/wiki/Apache_Spark ● Apache Spark details Retrieved from https://databricks.com/spark/about ● Features in Apache Spark Application Details. retrived from http://java.dzone.com/articles/6-sparkling-features-apache 46 ● HPCC Application Details. retrived from ttp://hpccsystems.com/ ● http://hpccsystems.com/demos/data-profiling-demo 6.2 Bibliography For The Domain ● Hsinchun Chen, Roger H.L Chiang & Veda C. Storey (2012), Business Intelligence And Analytics From Big Data To Big Impact. retrieved from: http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf ● feature model to orthogonal variability model transformation towards interoperability between tools. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.188.8846&re p=rep1&type=pdf ● http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics ● http://www.sas.com/en_us/insights/analytics/big-data-analytics.html ● http://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-analyticssoftware/ ● http://www.webopedia.com/TERM/B/big_data_analytics.html ● http://www.smartdatacollective.com/bernardmarr/287086/big-data-22-key-termseveryone-should-understand ● http://www.techopedia.com ● http://searchdisasterrecovery.techtarget.com References 47 1. Jeffery Lucas, Uzma Raja, Rafay Ishfaq. How Clean is Clean Enough? Determining the Most Effective Use of Resources in the Data Cleansing Process. International Conference on Information Systems, Auckland 2014 2. Zhang Ruojing,Jayawardene Vimukthi, Indulska Marta, Sadiq Shazia, Zhou Xiaofang. A data driven approach for discovering data quality requirements. ICIS 2014: 35th International Conference on Information Systems 3. Mike Ferguson. White paper prepared for IBM: Architecting A Big Data Platform for Analytics. October 2012. 4. An Oracle White Paper.September 2013. Big Data & Analytics Reference Architecture. 5. Hewlett Packard. BIG DATA PLATFORM. http://www8.hp.com/il/en/softwaresolutions/big-data-platform-haven/index.html 6. Andrea Mostosi. The Big-Data Ecosystem Table. http://bigdata.andreamostosi.name/ 7. Ericsson White paper. August 2013. Big Data Analytics. 8. Stephan Jou. Towards a Big Data Behavioral Analytics Platform. ISSA Journal, August 2014 9. Jimeng Sun, Chandan K. Reddy. Big Data Analytics for Healthcare. SIAM International Conference on Data Mining, Austin, TX, 2013 10. Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz. Big data, Analytics and the Path From Insights to Value, MIT Sloan Management Review. 11. CLOUD SECURITY ALLIANCE. Big Data Analytics for Security Intelligence. September 2013 12. Rita L. Sallam, Joao Tapadinhas, Josh Parenteau, Daniel Yuen, Bill Hostmann. Magic Quadrant for Business Intelligence and Analytics Platforms. Gartner, 20 February 2014 48