Download Extract Transform Load (ETL) Offload Reduce Growing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expense and cost recovery system (ECRS) wikipedia , lookup

Clusterpoint wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Big data wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

SAP IQ wikipedia , lookup

Data analysis wikipedia , lookup

Database model wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
ETL Offload
Ample White Paper
Extract Transform Load (ETL) Offload
Reduce Growing Data Movement Cost
Ample White Paper
A leading Big Data analytics company
ETL Offload
Ample White Paper
The Need for ETL (Extract Transform Load) Offload
In the EDW (Enterprise Data Warehouse), the ETL (Extract-Transform-Load) architecture is one of the most vital
processes in the gathering and storing of bulk data. The three separate functions of the ETL process are combined
into a single programming tool, as illustrated in Figure 1. Each of the green arrows pointing right represent one of
the processing stages of the ETL process, with the origin of data coming from either structured or unstructured
data. The top three arrows represent processing of structured data and the three arrows at the bottom represent
processing of unstructured data.
Figure 1
Extraction – The first stage of the ETL process extracts data from various disparate source systems. The
data is either structured or unstructured and typically exists in a variety of data types and formats. The
processing pattern followed in this first step is a ‘one-to-one’ function where source data is obtained as
it is, with a minimum of validation rules. Common database sources include ERP, CRM, SCM systems, flat
files from IMS (Information Management System), VSAM (Virtual Storage Access Method) files or ISAM
(Indexed Sequential Access Method) files, weblogs, social media files and machine logs.
Transformation – The second stage of ETL processing is transformation where many business rules are
applied to the extracted data to ready it for loading into the target stage. Enterprise data from multiple
source systems are cleaned and transformed for consumption by many business units in an enterprise.
The key activities in data transformation are:
A leading Big Data analytics company
ETL Offload








Ample White Paper
Standardizing cryptic codes to consistent business specific definitions
Calculating and deriving new values from existing values for easy consumption
Locating and merging data for joining purposes
Filtering duplicate data
Aggregating or summarizing data for multiple but similar rows
Generating surrogate keys for storing data uniquely in the warehouse platform
Pivoting and re-arranging multiple columns into multiple rows, or multiple rows into
multiple columns
Validating that the data is correct
Load – The third stage of the ETL process is the load, which transfers data into target systems such as
the Data Warehouse (DW) or the Enterprise Data Warehouse (EDW).The load phase enforces database
constraints defined in the database schema as data quality constraints and includes unique values,
referential integrity constraints as well as required values.
There are several different patterns that can be used for loading data, depending on the needs of the
enterprise. These patterns are capable of the following:



They can override existing data with extracted cumulative data.
They can update extracted historical data from daily, weekly or monthly updates.
They can add new extracted data from more frequent intraday intervals.
ETL Challenges and Limitations
The support and maintenance of EDW processes have become more complex due to the increased
growth of data sources, which at the current rate doubles almost every year. For example, new data
sources may be added to the data warehouse monthly, if not weekly. In addition the IoT (Internet of
Things) is adding a tremendous amount of data to the Internet and experts estimate that the IoT will
consist of almost 50 billion objects by the year 2020. Synchronization is a common problem when data
needs to be consistently distributed among several databases. Currently, objects are processed in their
databases sequentially, and sometimes the slow process of database replication may be involved as the
primary method of transferring data between databases. Problems arise when the time to extract and
load the data is limited. A further complication is that existing EDW platforms are generally not scalable;
consequently they cannot process incoming data rapidly enough. To summarize, the challenges faced by
current ETLs are as follows:

They expend too much time when processing high velocity data.
A leading Big Data analytics company
ETL Offload



Ample White Paper
They cannot support high volumes of data on existing data platforms.
They are not able to support a variety of data types derived from semi and unstructured
data.
The cost per movement of terabyte data is prohibitively expensive on existing platforms.
Solution: The Ample Big Data ETL Offload Process
Ample Big Data provides solutions to solve the problems inherent in traditional data warehouse
technologies by employing the NoSQL (Not Only SQL) database design, so that massive amounts of data
can be stored and processed without the need to specify a schema when writing that information.
NoSQL is based on the concept of distributed databases, where unstructured data may be stored across
multiple processing channels and often across multiple servers. Unlike relational databases that are
highly structured, NoSQL databases are unstructured, trading off stringent consistency requirements for
speed and agility. This distributed architecture allows NoSQL databases to be horizontally scalable. As
data volume continues to increase, more hardware is added to keep up with processing, without slowing
the performance down.
The Hadoop software ecosystem allows for massive parallel computing. It is an enabler of NoSQL
distributed databases that allows data to be spread across servers with little reduction in performance.
Because Hadoop is schema-free where a key characteristic of the implementation is ‘no-schema-onwrite’, it can manage structured and unstructured data with ease, and aggregate multiple sources to
enable deep analysis. Hadoop makes it is unnecessary to pre-define the data schema before loading; it
can be loaded as is, regardless of whether incoming data has an explicit or implicit structure. After
loading, the data can be integrated and then prepared for business consumption through the access
layer.
Due to the economies of scale that Big Data represents, enterprises no longer need to discard unused
data that may otherwise provide value when later analyzed from multiple perspectives. Ample has built
multiple connectors to NoSQL platforms such as Hadoop, Cassandra, MongoDB and SQL Server for rapid
bulk data ingestion. Parallel processing and unstructured data ingestion with Big Data technology can be
leveraged to enhance the existing ETL capabilities for most enterprises.
Ample has built a proven framework called the ‘Ample Data Bus’, to address the problems that arise
during ETL processing. The Ample Data Lake can resolve these problems holistically, or it can determine
a solution in a step-by-step fashion, depending on the nature of the problem being analyzed. Ample has
benchmarked its Data Lake solutions at every stage of the process; by terabyte utilization, by CPU
workload, and by I/O limitations. Multiple channels where customers interact have their own unique
A leading Big Data analytics company
ETL Offload
Ample White Paper
data silos with data sets existing in multiple data structures, data formats and types. Ample has
developed the models necessary for extracting, loading and transforming this varied data, making it
available for analytic processing.
Ample Big Data ETL Offload Benefits
The Ample Big Data framework saves time while delivering more comprehensive and optimized ETL
solutions. Because of the inherent nature of the Big Data platform and the infrastructure used for the
Data Bus process, scaling of these technologies proves to be extremely cost-effective.
•
•
•
•
•
•
Enterprises no longer need to discard large data sets.
Business’s focus on logic and analytic functions is reduced by 70 to 80 percent for ETL jobs.
ETL performance is increased forty to fifty fold.
Queries about process corresponding and related data are simpler.
Data growth is more predictable and consequently more manageable.
The cost of transferring and processing terabytes of data is decreased.
Conclusion
The advent of Big Data technologies has provided us with many opportunities and challenges. Among all
the opportunities, the ETL offload process has been applied most prominently in the realm of Big Data.
Needless to say, the exponential growth in data and different data types is the main reason why the Big
Data platform is appropriate for ETL offloading.
Focusing on data driven strategies has never been as prevalent as it is now when data ingestion, data
processing, data mining and data visualization are integrated together using Big Data technologies and
environments. It is difficult, if not impossible, to perform this level of analysis on traditional platforms
because of the non-scalable aspects of ETL platform costs, ETL complexities, and data modeling
complexities. In addition, semi and unstructured data sources pose other technological challenges for
traditional ETL platforms. An added challenge is that any solution to these issues must be accomplished
with quality data oversight and a high-degree of data governance.
Why compromise on quality, principles, best practices and utilization when the Ample ETL Framework
provides a comprehensive solution for providing services based on data driven strategic needs?
A leading Big Data analytics company