Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Expense and cost recovery system (ECRS) wikipedia , lookup
Clusterpoint wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Database model wikipedia , lookup
Forecasting wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
ETL Offload Ample White Paper Extract Transform Load (ETL) Offload Reduce Growing Data Movement Cost Ample White Paper A leading Big Data analytics company ETL Offload Ample White Paper The Need for ETL (Extract Transform Load) Offload In the EDW (Enterprise Data Warehouse), the ETL (Extract-Transform-Load) architecture is one of the most vital processes in the gathering and storing of bulk data. The three separate functions of the ETL process are combined into a single programming tool, as illustrated in Figure 1. Each of the green arrows pointing right represent one of the processing stages of the ETL process, with the origin of data coming from either structured or unstructured data. The top three arrows represent processing of structured data and the three arrows at the bottom represent processing of unstructured data. Figure 1 Extraction – The first stage of the ETL process extracts data from various disparate source systems. The data is either structured or unstructured and typically exists in a variety of data types and formats. The processing pattern followed in this first step is a ‘one-to-one’ function where source data is obtained as it is, with a minimum of validation rules. Common database sources include ERP, CRM, SCM systems, flat files from IMS (Information Management System), VSAM (Virtual Storage Access Method) files or ISAM (Indexed Sequential Access Method) files, weblogs, social media files and machine logs. Transformation – The second stage of ETL processing is transformation where many business rules are applied to the extracted data to ready it for loading into the target stage. Enterprise data from multiple source systems are cleaned and transformed for consumption by many business units in an enterprise. The key activities in data transformation are: A leading Big Data analytics company ETL Offload Ample White Paper Standardizing cryptic codes to consistent business specific definitions Calculating and deriving new values from existing values for easy consumption Locating and merging data for joining purposes Filtering duplicate data Aggregating or summarizing data for multiple but similar rows Generating surrogate keys for storing data uniquely in the warehouse platform Pivoting and re-arranging multiple columns into multiple rows, or multiple rows into multiple columns Validating that the data is correct Load – The third stage of the ETL process is the load, which transfers data into target systems such as the Data Warehouse (DW) or the Enterprise Data Warehouse (EDW).The load phase enforces database constraints defined in the database schema as data quality constraints and includes unique values, referential integrity constraints as well as required values. There are several different patterns that can be used for loading data, depending on the needs of the enterprise. These patterns are capable of the following: They can override existing data with extracted cumulative data. They can update extracted historical data from daily, weekly or monthly updates. They can add new extracted data from more frequent intraday intervals. ETL Challenges and Limitations The support and maintenance of EDW processes have become more complex due to the increased growth of data sources, which at the current rate doubles almost every year. For example, new data sources may be added to the data warehouse monthly, if not weekly. In addition the IoT (Internet of Things) is adding a tremendous amount of data to the Internet and experts estimate that the IoT will consist of almost 50 billion objects by the year 2020. Synchronization is a common problem when data needs to be consistently distributed among several databases. Currently, objects are processed in their databases sequentially, and sometimes the slow process of database replication may be involved as the primary method of transferring data between databases. Problems arise when the time to extract and load the data is limited. A further complication is that existing EDW platforms are generally not scalable; consequently they cannot process incoming data rapidly enough. To summarize, the challenges faced by current ETLs are as follows: They expend too much time when processing high velocity data. A leading Big Data analytics company ETL Offload Ample White Paper They cannot support high volumes of data on existing data platforms. They are not able to support a variety of data types derived from semi and unstructured data. The cost per movement of terabyte data is prohibitively expensive on existing platforms. Solution: The Ample Big Data ETL Offload Process Ample Big Data provides solutions to solve the problems inherent in traditional data warehouse technologies by employing the NoSQL (Not Only SQL) database design, so that massive amounts of data can be stored and processed without the need to specify a schema when writing that information. NoSQL is based on the concept of distributed databases, where unstructured data may be stored across multiple processing channels and often across multiple servers. Unlike relational databases that are highly structured, NoSQL databases are unstructured, trading off stringent consistency requirements for speed and agility. This distributed architecture allows NoSQL databases to be horizontally scalable. As data volume continues to increase, more hardware is added to keep up with processing, without slowing the performance down. The Hadoop software ecosystem allows for massive parallel computing. It is an enabler of NoSQL distributed databases that allows data to be spread across servers with little reduction in performance. Because Hadoop is schema-free where a key characteristic of the implementation is ‘no-schema-onwrite’, it can manage structured and unstructured data with ease, and aggregate multiple sources to enable deep analysis. Hadoop makes it is unnecessary to pre-define the data schema before loading; it can be loaded as is, regardless of whether incoming data has an explicit or implicit structure. After loading, the data can be integrated and then prepared for business consumption through the access layer. Due to the economies of scale that Big Data represents, enterprises no longer need to discard unused data that may otherwise provide value when later analyzed from multiple perspectives. Ample has built multiple connectors to NoSQL platforms such as Hadoop, Cassandra, MongoDB and SQL Server for rapid bulk data ingestion. Parallel processing and unstructured data ingestion with Big Data technology can be leveraged to enhance the existing ETL capabilities for most enterprises. Ample has built a proven framework called the ‘Ample Data Bus’, to address the problems that arise during ETL processing. The Ample Data Lake can resolve these problems holistically, or it can determine a solution in a step-by-step fashion, depending on the nature of the problem being analyzed. Ample has benchmarked its Data Lake solutions at every stage of the process; by terabyte utilization, by CPU workload, and by I/O limitations. Multiple channels where customers interact have their own unique A leading Big Data analytics company ETL Offload Ample White Paper data silos with data sets existing in multiple data structures, data formats and types. Ample has developed the models necessary for extracting, loading and transforming this varied data, making it available for analytic processing. Ample Big Data ETL Offload Benefits The Ample Big Data framework saves time while delivering more comprehensive and optimized ETL solutions. Because of the inherent nature of the Big Data platform and the infrastructure used for the Data Bus process, scaling of these technologies proves to be extremely cost-effective. • • • • • • Enterprises no longer need to discard large data sets. Business’s focus on logic and analytic functions is reduced by 70 to 80 percent for ETL jobs. ETL performance is increased forty to fifty fold. Queries about process corresponding and related data are simpler. Data growth is more predictable and consequently more manageable. The cost of transferring and processing terabytes of data is decreased. Conclusion The advent of Big Data technologies has provided us with many opportunities and challenges. Among all the opportunities, the ETL offload process has been applied most prominently in the realm of Big Data. Needless to say, the exponential growth in data and different data types is the main reason why the Big Data platform is appropriate for ETL offloading. Focusing on data driven strategies has never been as prevalent as it is now when data ingestion, data processing, data mining and data visualization are integrated together using Big Data technologies and environments. It is difficult, if not impossible, to perform this level of analysis on traditional platforms because of the non-scalable aspects of ETL platform costs, ETL complexities, and data modeling complexities. In addition, semi and unstructured data sources pose other technological challenges for traditional ETL platforms. An added challenge is that any solution to these issues must be accomplished with quality data oversight and a high-degree of data governance. Why compromise on quality, principles, best practices and utilization when the Ample ETL Framework provides a comprehensive solution for providing services based on data driven strategic needs? A leading Big Data analytics company