Download Big Data Analytical Platform (BDAP) - Final Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data Analytical Platform (BDAP) - Final Project
‫ת"ז‬
‫שם‬
‫אחראי על אפליקציה‬
‫אחראי על סקר‬
313734436
‫ויבורוב מיכאל‬
Apache Spark
3D Printing
200600583
‫בורדייניק יניב‬
Apache Storm
200240588
‫פיראס אבו ג'בל‬
HPCC
Identifying
200904399
‫ג'וזף עון‬
Google BigQuery
Interactive Code
1
About the Domain
1.1
Background About The Domain
Big Data Analytics Platforms
Big Data concerns massive, heterogeneous, autonomous sources with
distributed and decentralized control. These characteristics make it an
extreme challenge for organizations using traditional data management
mechanism to store and process these huge data-sets.
Big data analytics is the process of examining big data to uncover hidden
patterns, unknown correlations and other useful information that can be used
to make better decisions. With big data analytics, data scientists can analyze
huge volumes of data that conventional analytics and business intelligence
solutions can't grasp.
Big-data Platforms and Big-data Analytics Software focuses on providing
efficient analytics for extremely large data sets. These analytics helps the
organizations to gain insight, by turning data into high quality information,
providing deeper insights about the business situation. This enables the
business to take advantage of the digital universe
1
This work is intended to describe software platforms and tools available today
to support an endeavor to discover hidden knowledge in the big data
paradigm. Such analytical findings are expected to lead to new business and
science opportunities.
•
The Benefits of Big Data Analytics
❖ Enterprises are increasingly looking to find actionable insights into
their data. Many big data projects originate from the need to answer
specific business questions. With the right big data analytics platforms
in place, an enterprise can boost sales, increase efficiency, and
improve operations, customer service and risk management
•
The Challenges of Big Data Analytics
❖ For most organizations, big data analysis is a challenge. Consider the
sheer volume of data and the different formats of the data (both
structured and unstructured data) that is collected across the entire
organization and the many different ways different types of data can
be combined, contrasted and analyzed to find patterns and other
useful business information.
❖ The first challenge is in breaking down data silos to access all data an
organization stores in different places and often in different systems. A
second big data challenge is in creating platforms that can pull in
unstructured data as easily as structured data. This massive volume
of data is typically so large that it's difficult to process using traditional
database and software methods.
•
Big Data Requires High-Performance Analytics
❖ To analyze such a large volume of data, big data analytics is typically
performed using specialized software tools and applications for
predictive analytics, data mining, text mining, and forecasting and data
optimization. Collectively these processes are separate but highly
integrated functions of high-performance analytics. Using big data
tools and software enables an organization to process extremely large
volumes of data that a business has collected to determine which data
is relevant and can be analyzed to drive better business decisions in
the future.
2
1.2
Reviews From The Literature
1.2.1 Terms
Term
Big Data
Definition
Big data refers to a process that is used when traditional data mining and
handling techniques cannot uncover the insights and meaning of the
underlying data. Data that is unstructured or time sensitive or simply very
large cannot be processed by relational database engines
Business
Intelligence
Business intelligence (BI) is the set of techniques and tools for the
transformation of raw data into meaningful and useful information for business
analysis purposes by using computing technologies for the identification,
discovery and analysis of business data - like sales revenue, products, costs
and incomes.
Analytics
The process of collecting, processing and analyzing data to generate insights
that inform fact-based decision-making. In many cases it involves softwarebased analysis using algorithms.
The big data analytics visualization is a visual representation of the insights
gained from analysis.
Visualization
Data mining
Data Scientist
Distributed
Computing
Distributed File
System
Big data visualization refers to the implementation of more contemporary
visualization techniques to illustrate the relationships within data. Visualization
tactics include applications that can display real-time changes and more
illustrative graphics, thus going beyond pie, bar and other charts. These
illustrations veer away from the use of hundreds of rows, columns and
attributes toward a more artistic visual representation of the data.
Computational process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems.
Term used to describe an expert in extracting insights and value from data. It
is usually someone that has skills in analytics, computer science,
mathematics, statistics, creativity, data visualisation and communication as
well as business and strategy.
A software system in which components located on networked computers
communicate and coordinate their actions by passing messages.
Data storage system designed to store large volumes of data across multiple
storage devices (often cloud based commodity servers), to decrease the cost
and complexity of storing large amounts of data.
3
Computer
Cluster
A computer cluster consists of a set of loosely or tightly connected
computers that work together. Computer clusters have each node set to
perform the same task, controlled and scheduled.
Is a special type of computational cluster designed specifically for storing and
analyzing huge amounts of unstructured data in a distributed computing
environment.
Cluster
Clusters are known for boosting the speed of data analysis applications. They
also are highly scalable
Data
Warehouse
Analytics
Platform
Anonymization
A data warehouse (DW) is a collection of corporate information and data
derived from operational systems and external data sources. A data
warehouse is designed to support business decisions by allowing data
consolidation, analysis and reporting at different aggregate levels.
A software that provides the tools and computational power needed to build
and perform many different analytical queries
Data anonymization is the process of destroying tracks, or the electronic trail,
on the data that would lead an eavesdropper to its origins. An electronic trail
is the information that is left behind when someone sends data over a
network
Concurrency
The ability to execute multiple processes at the same time
Data Analysis
Process of inspecting, cleaning, transforming, and modeling data with the
goal of discovering useful information.
Computer
Cluster
A computer cluster consists of a loosely or tightly connected set of computers
that work together so they can be viewed as a single system. Each node
(computer) is set to perform the same task, controlled and scheduled by
software.
Data
Parallelization
This form of parallelism focuses on distribution of data sets across the
multiple computation programs. In this form, same operations are performed
on different parallel computing processors on the distributed data sub set
Task
Parallelization
This form of parallelism covers the execution of computer programs across
multiple processors on same or multiple machines. It focuses on executing
different operations in parallel to fully utilize the available computing
resources in form of processors and memory.
Resilient
Distribution
Data-sets
A technique to distribute data between computer clusters in a manner that
supports fault-tolerance.
Fault-Tolerance
Fault-tolerant describes a computer system or component designed so that,
in the event that a component fails, a backup component or procedure can
immediately take its place with no loss of service. Fault tolerance can be
provided with software, or embedded in hardware, or provided by some
combination.
4
Batch
Processing
Structured v
Unstructured
Data
Map-Reduce
Hadoop
Batch processing is a general term used for frequently used programs that
are executed with minimum human interaction. Batch process jobs can run
without any end-user interaction or can be scheduled to start up on their own
as resources permit.
Structured data is basically anything than can be put into a table and
organized in such a way that it relates to other data in the same table.
Unstructured data is everything that can’t – email messages, social media
posts and recorded human speech.
Refers to the software procedure of breaking up an analysis into pieces that
can be distributed across different computers in different locations. It first
distributes the analysis (map) and then collects the results back into one
report (reduce).
Apache Hadoop is one of the most widely used software frameworks in big
data. It is a collection of programs which allow storage, retrieval and analysis
of very large data sets using distributed hardware (allowing the data to be
spread across many smaller storage devices rather than one very large one).
1.2.2 Domain Ontology
Business Intelligence And Analytics (BI & A): From Big Data To Big Impact.
In this research we can see how the Business Intelligence & Analytics impact on the
data related problems to be solved in the business organizations.
A data-centric approach, BI & A has it’s roots in the longstanding database
management field.
It relies on various data collection, extraction and analysis technologies.
The relation between BI & A and Big Data is now stronger then ever.
Many organizations realized that they need to be prepare to the transformation from
normal data management to big data management and that means that the
organization’s platforms need to be ready for it.
Most of the platforms has the same idea, to build a job for the data streaming
process, to divide the data between the nodes and afterwards to execute the relevant
queries on the new data.
●
http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf
●
http://www.informationweek.com/big-data/big-data-analytics/16-top-bigdata-analytics-platforms/d/d-id/1113609
5
1.2.3 Literature review
For many years companies and organizations used traditional data warehouses to
analyze business activities improve their decision making processes [3, 4, 5, 7].
In recent times many new complex types of data have emerged and the rate at which
much of the data is being created forces organizations to turn to advanced
techniques for processing the data like Cleansing, Pre-processing, job parallelizing
and so on. [1,2,11] Generally, it became apparent that in order to continue to produce
business insights from the data organizations keep gathering at increasing rate the
need of advanced tools for data analysis became more and more pressing. This
need is answered by emerged new form of data analysis systems - Big Data systems
[3, 11]. Big Data is therefore a term associated with the new types of workloads and
underlying technologies needed to solve business problems that we could not
previously support due to technology limitations, prohibitive cost or both. [3]
Naturally, organizations investing into the novel and promising analytics techniques
expect to harness the benefits and see data-driven insights across all the level of the
organization. For example using GPS-enabled navigation one can expect get realtime traffic updates and route suggestions and so on. [10]
Three major aspects of Analytics Architecture for Big Data systems are recognized:
Unified Information Management, Real-Time Analytics and Intelligent Processes. [3,
4, 5, 7]. These aspects are handled by different systems from different vendors,
including Open-Source organizations [6]. These aspects may be more or less
important in different fields of applications of the Big Data Analytics system. For
example Behavioral Threats detection system may want to put special accent on
Real-Time analytic aspect [8], while Healthcare analytical system may see freestructured data performance as very important aspect of the system [9] and security
analytics system may require very strong suite for information management part [11].
Valuable capabilities of Big Data Analytical platforms are noted and defined [12], e.g.:
Information Delivery, Analysis and Integration. Existing tools are compared and
evaluated. Interestingly enough, one of the major challenges for the commercial tools
is that many of them are limited to specific use cases [12].
6
1.3
Domain Boundaries
In reality Big Data Analytics platforms are used in different areas for different tasks.
Main areas are:
● Software solution for managing data storage, manipulation and retrieval
tasks.
● Frameworks, platforms and tools that facilitate development and execution of
analytics processes that bear business value.
● Various techniques, libraries, software products that facilitate data mining
processes.
● Reporting and visualization tools and applications.
It is important to note that all four of these areas are interconnected. Many industryknown solutions encapsulate several or all of them. Our focus, however, is the
second bullet - Big Data Analytics Platforms.
The applications from this field are designed to facilitate analytical tasks and
techniques by creating an abstraction layer between the developer and the
underlying robust software and hardware infrastructure that is capable to handle
resource consuming tasks for analysis in Big Data paradigm.
We are not going to allocate considerable attention towards issues of storing and
assessing the big data, visualization techniques and solutions and particular data
mining techniques applied to big data realm. But our focus will be on big data
analytics which aims to uncover meaningful patterns in data.
7
1.3.1 Applications Within The Domain
Application
HPCC
Google
BigQuery
Spark
Storm
Description
High Performance Computing
Cluster - is a massive parallelprocessing computing platform
that solves Big Data problems.
BigQuery is a Restful web
service that enables interactive
analysis of massively large
data-sets working in
conjunction with Google
storage that may
Cluster computing framework
well suited to machine learning
algorithms. Includes several
components to handle task
management, data access,
Distributed real-time
computation system aimed for
easy and reliable processing
unbounded streams of data in
real-time.
8
Link
Open
Source?
http://hpccsystems.com/
Y
https://cloud.google.com
/
N
https://spark.apache.org/
Y
http://storm.apache.org/
Y
1.3.2 Relevant Applications Outside The Domain
Application
QlikView
Description
Why Outside?
Self-service data visualization and guided
analytics that lead to insight that ignite good
ideas.
The complexity of QlikView
depends on the amount of
data that the tool is trying to
analyze. Oriented to BI.
Empower users to gain actionable insights with
data from many users. Excel supports
technology such as Power Query, Power Pivot
and Power View to perform dynamic analysis
on the combined data set.
Excel has its own opinion
while many users questioning
Excel’s ability with Big Data
analytics
Scalable, software only columnar database
management system for analytic application
InfiniDB is an RDBMS which
in this category doesn’t
support Big Data analytics.
Apache Cassandra is an open source
distributed database management system
designed to handle large amounts of data
across many commodity servers, providing high
availability with no single point of failure.
Cassandra doesn’t support
large scalability although
Apache claimed that it will
support Big Data in the future.
HIVE
Hive is data warehouse software facilitates
querying and managing large data-sets residing
in distributed storage.
HIVE is data warehouse
software which represents
only a part of the all Big Data
system. The data warehouse
only hosts the data while the
analyzing made by a different
tool.
Weka
Weka is a collection of machine learning
algorithms for data mining tasks. Weka contains
tools for data pre-processing, classification,
regression, clustering, association rules, and
visualization
Weka is a machine learning
tool for data mining and that
maybe have some
visualization although it is not
an analysis tool.
Gephi
Gephi is an open source software for network
visualization and analysis
Gephi doesn’t support Big
Data volume and not oriented
towards running data
processing.
Microsoft
Excel
InfiniDB
Apache
Cassandra
Sisense
Business intelligence that enables non-technical Sisense is not a Big Data
users to join and analyze large data sets from
framework, but it is software
multiple source (Sisense was founded in 2004
for making data analysis and
9
in Tel-Aviv)
1.4
visualizations about the
company.
Similarities And Differences
1.4.1 Similarities Between The Applications
1. Data Access provider:
Any platform aimed to data analysis must be equipped with a module that is
able to access the data.
At least one of two options is used
·
Stream access.
·
Persistent data access.
2. Process/Task management facility:
Parallelization process; the frameworks/platforms are required to provide at
least one of
·
Data parallelization management.
·
Task parallelization management.
3. Analytics design:
Provide a way for designing a job (Query) for processing the data using
external tools or built-in facility or both. This job must be defined as one of
the methods
·
Batch execution.
·
Stream Processing.
·
Interactive execution.
4. Environment management:
Provide the ability to manage clusters and nodes configurations.
10
1.4.2 Differences Between The Applications
1.
Reporting facility
Reporting facility is optional for the area we are discussing.
2.
Different methods of data access
Several methods for accessing the data may be supported by different
tools. Platforms that support persistent data access required to support at
least one form File Systems or SQL-like access methods.
Using In-Memory data access and Map-Reduce functionality are optional
features in the domain of Big Data Analytics Platform.
3.
Clustering
Clustered organization of platform nodes is optional.
4.
External tools support
Different approaches to incorporate external tools.
5.
Environment requirements differ from application to another.
6.
Different interfaces; some applications are web applications while others are
desktop applications
11
1.4.3 Feature Diagram
12
2
Domain Model
2.1
Mandatory vs. Optional First Order Elements
Element
Diagram
Mandatory/
Optional
Many/Single
Cluster
Class
Optional
Many
Diagram
Job
Class
Explanation
The central logical part of
environment.
Mandatory
Many
The artifact of developers work.
Optional
Many
The part of job.Optional because not
Diagram
Chunk
all of the
Class
Diagram
platforms may support job splitting.
Node
Class
Responsible
Diagram
Mandatory
Many
Integral part of the environment.
for processing the jobs/chunks.
Data
forStorage
Class
Mandatory
Many
Part of the system that is responsible
Diagram
providing/writing data to different data
sources.
Admin
Class
Mandatory
Single
Diagram
Develope
and
r
Class
User entity responsible for system
maintenance.
Mandatory
Single
User entity responsible for creating
Diagram
executing jobs.
User
Use Case
Optional
Single
Managem
Part of maintenance procedure in the
system. Optional because some
systems
ent
may rely on external user
management.
Manage
Use Case
Mandatory
Single
General maintenance procedures.
Use Case
Optional
Many
Support activity. Optional because
Environm
ent
Monitorin
13
this isg/not
Reporting
a core purpose of the system and
may be
delegated to external tools.
Data
Use Case
Mandatory
Many
Utilization of data access. Provided
as Storage
core
service.
Design
Use Case
Mandatory
Many
Core activity by the developers.
Use Case
Mandatory
Many
Also core activity by the developers.
Use Case
Mandatory
Job
Execute
Job
Allocate
Internal activity. Service provided by
Resource
the system.
s
Config
Use Case
Mandatory
Activity by developer. Responsible for
Job
describing environment for job
execution.
2.2
Variation Points & Variants At First Degree Elements
‫שם נקודת‬
‫השוני‬
‫ סוג‬+ ‫שם‬
‫הדיאגרמה‬
‫מאפייני נקודת השוני‬
‫אפשרויות השוני וכללים ליישומם‬
Data Access
Use Case
Different systems
Persistent, Memory, Stream
support different data
sources
Persistent
Use Case
may not be supported
Map Reduce, Database, File System
in some systems
Memory
Class Diagr
may not be supported
Resilient
in some systems
Infrastructure
Use Case,
Different systems may
Architecture
Class
have different
Diagram
infrastructure
14
Single Server, Clusters
Node Roles
different systems may
Role Dedicated, Equal Roles
handle different types
of nodes.
Management
Use case,
Sort and level of
Users Management, Servers
Class,
management may
Management
Diagram
diffre between the
systems
Language
Use Case
system dependant
Internal Language, External
Language
External
Use Case
system dependant
Scala, JavaScript, R, Java
Use case
system dependant
Stream, Batch, Interactive
Use Case,
system dependant
Reports, Graphs
Language
Execution
Method
Monitoring
Class
Diagram
15
2.3
Domain Models
2.4
Principal Use Case diagram for the BDAP domain
.
16
2.5
Principal Class diagram for the BDAP domain
17
2.6
Principal Sequence diagram for the BDAP domain
18
3
Applications
3.1
Apache Spark
3.1.1 Apache Spark Background
Apache Spark is a cluster computing framework designed to meet
requirements for performance in Big Data realm. Apache Spark is built
on top of cluster management and distributed storage systems and is
able to interface with several existing infrastructure platforms like
Hadoop, Cassandra, S3.
3.1.2 Apache Spark Requirements
● RAM. Minimum 8GB per server.
● CPU. Spark is highly scalable system. Recommended 8-16 cores per
server.
● Local disk space. Recommended 4-8 disks per node.
● External storage systems. HDFS/DB
● Network communication between the components. Recommended
10Gbps
19
3.1.3 Apache Spark Models
Legend for reading color coding:
3.1.3.1
Principal Use Case diagram for Apache Spark
20
3.1.3.2
Principal Class Diagram for Apache Spark
21
3.1.3.3
Principal
Sequence
running non-interactive job
22
diagram
for
Apache
Spark
3.2
Apache Storm
3.2.1 Storm Background
Storm is a distributed real-time computation system for processing large volumes of
high-velocity data.
Storm is powerful for scenarios requiring real-time analytics, machine learning and
continuous monitoring of operations. Some of specific new business opportunities
include: real-time customer service management, data monetization, operational
dashboards, or cyber security analytics and threat detection.
3.2.2 Storm Requirements
● Internet connection.
● Linux machine.
● Java 6 installed on the machine.
● Python 2.6 installed on the machine.
● Storm cluster installation (is where the actual topology runs)
● Storm client installation (required for the topology management)
3.2.3 Storm Models
Legend for reading color coding:
23
3.2.3.1
Storm Use Case Diagram
24
3.2.3.2
Storm Class Diagram
25
3.2.3.3
Storm Sequence Diagram
26
3.3
HPCC
3.3.1 HPCC Background
HPCC (High Performance Computing Cluster): is an open source, data-intensive
computing system platform developed by Lexis-Nexis Risk Solutions. It stores and
processes large quantities of data, processing billions of records per second using
massive parallel processing technology.
Large amounts of data across disparate data sources can be accessed, analyzed
and manipulated in fractions of seconds. HPCC functions as both a processing and a
distributed data storage environment, capable of analyzing terabytes of information.
The HPCC platform includes system configurations to support both parallel batch
data processing (Thor) and high-performance online query applications using
indexed data files (Roxie). The HPCC platform also includes a data-centric
declarative programming language for parallel data processing called ECL
3.3.2 HPCC Requirements
● Enable massive amounts of data across disparate databases, to be
accessed, analyzed and manipulated in fractions of seconds
●
●
●
●
●
Functions as both a processing and distributed data storage environment
The HPCC platform is designed to work on simple commodity hardware
using a simple commodity operating system
Plugs into any computing language
Works over the Internet or over a private network
Operates on either distributed or centralized systems
27
3.3.3 HPCC Models
Legend for reading color coding:
3.3.3.1
HPCC Use Case Diagram
28
3.3.3.2
HPCC Class Diagram
29
3.3.3.3
HPCC Sequence Diagram
30
3.4
Google BigQuery
3.4.1 Background
Querying massive datasets can be time consuming and expensive without the
right hardware and infrastructure. Google BigQuery solves this problem by
enabling super-fast, SQL-like queries against append-only tables, using the
processing power of Google's infrastructure. Simply move your data into
BigQuery and let us handle the hard work. You can control access to both the
project and your data based on your business needs, such as giving others the
ability to view or query your data.
3.4.2 Requirements
These requirements were required in the Python Codelab and are also required for this
lab:
●
A laptop/notebook computer
●
Python 2.7.x
●
App Engine SDK: https://developers.google.com/appengine/downloads
●
Development environment: an IDE or a text editor and command-shell pair
In addition, this BigQuery Dashboard Codelab requires:
●
Access to the command line (shell) on your computer.
●
A valid Google Account.
●
Access to the BigQuery service. You can sign up for BigQuery using these
instructions.
●
The gviz_data_table library https://pypi.python.org/pypi/gviz_data_table/
Google APIs Client Library for Python: https://github.com/google/google-apipython-client/
31
3.4.3 BigQuery Models
3.4.3.1
Use Case Diagram
32
3.4.3.2
BigQuery Class Diagram
33
3.4.3.3
BigQuery Sequence Diagram
34
3.5
Differences Between The Applications
Variance OVM
35
36
37
Criteria
Execution Method
Apache Spark
All
available
Apache Storm
All
available
options
options
Built-in Tools
support
support
External Tools
support
support
Environment
support
support
User Management
no support
support
Task Parallel
support
support
Data Parallel
support
support
Fault Tolerance
support
support
Monitoring/Reporting
no support
support
Stream Data access
support
support
Persistant data access
support
support
Map Reduce
support
support
InMemory resilience
support
support
Management
38
HPCC
BigQuery
4
Papers reviewed and their relations to the Domain
4.1
3D Printing and Big Data Analytics Platforms
Papers reviewed:

Mathieu Acher, Benoit Baudry, Olivier Barais and Jean-Marc Jézéquel
Customization and 3D Printing: A Challenging Playground for Software
Product Line.

Thomas Thum, Sven Apel, Christian Kastner, Ina Schaefer and Gunter
Saake. A Classification and Survey of Analysis Strategies for Software
Product Lines.
After more than a decade being around 3D printing (also called additive
manufacturing) technologies became more and more popular and widely
available to general public. It is used in vast variety of area such as archaeology,
industrial prototyping and design, medicine, rapid manufacturing and many more.
It is also speculated that the next evolution of 3D printing will be “customization
for the masses”. By printing one product at a time, 3D printing provides the
means and tools to fit the customer's identity and preferences.
The research focus of the paper is Thinguniverse - the most popular website and
virtual community for sharing user-created 3D printable design files. This
community encompass the methods for creating, controlling and managing digital
models for 3D-printable objects. Currently Thinguniverse consist of more than
160,000 members and stores more than 200,000 downloadable digital designs.
Thinguniverse employs SPLE-like approach to the 3D-printable objects. The
objects are presented in a form of a design files, they could be customized and
combined in order to fit any particular set of requirements.
We have found the following SPLE characteristics and properties to be salient in
3D printing paradigm:

Customization and variation management

Focus on the means of efficiently designing and maintaining similar
products

Mastering base customization and heavily relies on software technologies
39

Design analysis: Type checking, Static analysis, Model checking and
Theorem proving

SPL strategy for design specification and variability management: Domain
wide, Family wide, Product based and Family based
BDAP and 3D Printing in SPLE perspective
Platforms for Big Data analytics are very complex systems that often require very
strong underlying hardware. Numerous existing frameworks, tools applications
are utilized as components in the aggregative Big Data Analytical system. The
target audience of such systems is Data Analysts that use variety of
interconnected data-crunching services and very often do not have expertise to
install and maintain the building blocks of such systems. This encourages the
market of BDAP systems to support high customizability and modular
architecture through very simple and intuitive interface. For example Amazon
Cloud services allows to the subscriber to create new analytical cluster, configure
software components and underlying hardware and run it in less than 5 minutes
of work. This paradigm is made very beneficial by means and techniques
described by advanced SPL Engineering and it is quite similar to the situation
that we observe in Thinguniverse community - variety of 3D object are available
for customization and composing into desirable architecture for printing is perform
in a manner as the variety of software and hardware components is available to
be customized and interconnected into one complex ready-to-go system.
Reviewing 3D printable objects design through the lenses of SPLE helped us to
see some abstraction of SPLE approach and gave practical tools for later
investigation of BDAP domain. Decomposing complex BDAP system into
functional modules and discussing each of them in relations to the others was
facilitated by the previously learned application of SPLE principles.
40
4.2
Defining In-active Code – Overview
Software companies provide several products to several companies; however
they also sometimes provide the same product to several companies but with
specific functionality to each company. Basically this product has the same basic
functionality. However, differences might appear in the specific requirements for
each customer. In order to handle this variability problem, software companies
developing automated approaches to manage common functionality for a
software product. Then they refine the source code to develop a solution to a
particular customer.
The approach presented in this paper to solve the variability issue in software
product line comprises three steps:
1. Computing the system dependence graph (SDG): SDG is a graph
consisting of nodes and edges which together can show the program
elements and the different dependencies between them. The approach in
this paper relies on the SDG.
2. Extracting presence conditions for the conditional system dependence
graph (CSDG): adding conditions on the SDG dependencies to describe
the program variation points, usually conditions would be a Boolean
formula used to configure options.
3. Identifying inactive code: finally when the CSDG is ready, the
identification of the inactive code for a specific product variable is
accomplished using the CSDG and the product concrete configurations.
Product configurations sets values to the conditions formulas in the
CSDG and according to this values active and inactive code can be
identified by a reachability analysis.
In order to evaluate the approach they checked effectiveness, accuracy, and
performance. The evaluation was performed using the approach as implemented
and customized for Keba’s KePlast product line.
41
Effectiveness:
It has been shown that the cost of code maintenance increases with source code
size and complexity. Then the goal is to evaluate to what extent the approach can
ease maintenance tasks for application engineers by reducing code size?
To check the approach, it was used on the code of two different product variants.
Results show that there is a significant size of inactive code for most of the
configuration options; the inactive code also scatters over several files for a
majority of the configuration options. Therefore, the inactive code will not be
obvious
to
a
developer
when
performing
a maintenance task
Accuracy:
Approach accuracy was evaluated regarding the identification of inactive code for
configuration options. Approach results were compared to domain expert’s
results. According to the comparison results the approach has a very high level of
accuracy and sometimes better than the domain expert (in cases where the code
is scattered in many files). However, the domain expert can identify irrelevant
code which the approach can’t. E.g. assignments to variables which are never
read expert also removed declarations when the declared element was not used
anymore.
Performance:
Evaluate the run time performance of the approach implementation. The
measurements were made for 15 different product variants. Performance
measurements were conducted to determine the times needed to parse the
source code, to build the CSDG, to extract the presence conditions and to
perform the configuration-based pruning.
The results of the evaluation show that the performance of the configurationaware analysis is sufficient for application in industrial settings.
42
So for conclusion, how does all this relate to the course and to variability
implementation
We presented an approach for identifying inactive code in product variants of a
product line using code analysis technique.
In industrial product lines customer-specific products are frequently developed in
a two stage process: first the required features are selected to create an initial
product.
Then the code of the initial product is refined and adapted in a clone-and-own
way to address specific customer requirements.
The technique allows hiding inactive code in a product configuration to support
application engineers. It is based on a system dependence graph (SDG)
encoding all the global control and data dependencies in a program.
Furthermore, the SDG is configuration-aware, i.e., representing the variability of
the system.
The approach has currently been implemented for a programming language used
in the domain of industrial automation.
The approach also supports configuration parameters and branch statements in
code – two widely used mechanisms for implementing variability in product lines.
This work is regarded as a first step towards comprehensive configuration-aware
code analysis technology for product lines.
The conditional SDG represents a strong basis for implementing diverse analysis
techniques.
Furthermore, the KePlast product line comes with a Java-based HMI framework,
where analogous variability mechanisms are used.
43
5. Process and Conclusions
The work was conducted in three phases:
Phase I : Select Domain
In this phase we came to agreement for one particular domain (Big
Data Analytical Platforms). Then we defined domain boundaries, found
systems that are related to the domain and those that are not. We
discussed particular examples of the systems that are bordering the
domain.
Phase II : Describe Particular Systems
During second phase we worked on investigating four different systems
in our domain and extracting related features and describing the
systems with UML diagrams.
Phase III: Describe Domain
The last phase we modeled domain of Big Data Analytical Platforms
based on description of four different systems from previous phase. We
applied domain models to the UML design of each from the four
systems and highlighted common, optional and system-specific
features and elements. We used OVM modeling methodology for
describing the variability allowed inside the domain.
44
Conclusions
During the work on this course we experienced iterative process of
modeling software domain. The process begun from discovering the
domain, refining the domain borders, describing in details particular
participants of the domain and finally gathering all the information
together into domain model.
The high-level review of this endeavor allows us to point out several
key point relevant to the academic course:
● The iterative nature of this process forces one to review his own
earlier conclusions with new learned knowledge over and over
again.
● Precision in software domain definition is important and not
trivial task for software engineer. This task must be completed
on the very early stages of Software design and planning
process.
● During our work we learned usage of OVM modeling technique.
45
6. Bibliography
6.1 Bibliography For Definitions and Applications
What
is
Business
Intelligence?
-Retrieved
from
http://searchdatamanagement.techtarget.com/definition/business-intelligence
●
What is Distributed Computing? - Definition from WhatIs.com. (n.d.).
Retrieved from https://en.wikipedia.org/wiki/Distributed_computing
●
What is Computer Cluster? - Retrieved from
http://www.techopedia.com/definition/6581/computer-cluster
●
What is Map Reduce? - Retrieved from http://www01.ibm.com/software/data/infosphere/hadoop/mapreduce/
●
What is Hadoop? Retrieved from http://www01.ibm.com/software/data/infosphere/hadoop/
●
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion
Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
● Apache Storm Application Details. retrived from
http://hortonworks.com/hadoop/storm/
● What is Apache Storm? Retrieved from
http://en.wikipedia.org/wiki/Storm_(event_processor)
● Apache Storm Application Details. retrived from
https://spark.apache.org/
● What is Apache Spark? Retrieved from
http://en.wikipedia.org/wiki/Apache_Spark
● Apache Spark details Retrieved from
https://databricks.com/spark/about
● Features in Apache Spark Application Details. retrived from
http://java.dzone.com/articles/6-sparkling-features-apache
46
● HPCC Application Details. retrived from ttp://hpccsystems.com/
● http://hpccsystems.com/demos/data-profiling-demo
6.2 Bibliography For The Domain
●
Hsinchun Chen, Roger H.L Chiang & Veda C. Storey (2012), Business
Intelligence And Analytics From Big Data To Big Impact. retrieved from:
http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf
● feature model to orthogonal variability model transformation towards
interoperability between tools.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.188.8846&re
p=rep1&type=pdf
● http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics
● http://www.sas.com/en_us/insights/analytics/big-data-analytics.html
● http://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-analyticssoftware/
● http://www.webopedia.com/TERM/B/big_data_analytics.html
● http://www.smartdatacollective.com/bernardmarr/287086/big-data-22-key-termseveryone-should-understand
● http://www.techopedia.com
● http://searchdisasterrecovery.techtarget.com
References
47
1. Jeffery Lucas, Uzma Raja, Rafay Ishfaq. How Clean is Clean Enough?
Determining the Most Effective Use of Resources in the Data Cleansing Process.
International Conference on Information Systems, Auckland 2014
2. Zhang Ruojing,Jayawardene Vimukthi, Indulska Marta, Sadiq Shazia, Zhou
Xiaofang. A data driven approach for discovering data quality requirements. ICIS
2014: 35th International Conference on Information Systems
3. Mike Ferguson. White paper prepared for IBM: Architecting A Big Data Platform
for Analytics. October 2012.
4. An Oracle White Paper.September 2013. Big Data & Analytics Reference
Architecture.
5. Hewlett Packard. BIG DATA PLATFORM. http://www8.hp.com/il/en/softwaresolutions/big-data-platform-haven/index.html
6. Andrea
Mostosi.
The
Big-Data
Ecosystem
Table.
http://bigdata.andreamostosi.name/
7. Ericsson White paper. August 2013. Big Data Analytics.
8. Stephan Jou. Towards a Big Data Behavioral Analytics Platform. ISSA Journal,
August 2014
9. Jimeng Sun, Chandan K. Reddy. Big Data Analytics for Healthcare. SIAM
International Conference on Data Mining, Austin, TX, 2013
10. Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and Nina
Kruschwitz. Big data, Analytics and the Path From Insights to Value, MIT Sloan
Management Review.
11. CLOUD SECURITY ALLIANCE. Big Data Analytics for Security Intelligence.
September 2013
12. Rita L. Sallam, Joao Tapadinhas, Josh Parenteau, Daniel Yuen, Bill Hostmann.
Magic Quadrant for Business Intelligence and Analytics Platforms. Gartner, 20
February 2014
48