ETL (extract, transform, and load) process is not a rule or a compulsory model for data integration and governance. Instead, it is considered one of the most effective approaches to extracting data from various resources, transforming this data into a compatible form, and finally loading it into the data warehouse schema. Currently, a wide range of ETL tools have been developed due to the advancement of the big data world and the high demand for business intelligence. In this blog, we will compare several prevalent ETL tools (AWS Glue, Pentaho, Talend, etc.) and investigate their pros and cons. Additionally, some practical cases corresponding to each tool will be displayed.
What are ETL tools?
Extract-Transform-Load (ETL) tools are specialized tools that are responsible for extracting data from multiple sources, cleansing, transforming, customizing, and importing it to a data warehouse.
Since 1970, organizations have started to exploit diverse data repositories to archive multiple business information. From 2018 to 2026, it is estimated that ETL tools and solutions will expand significantly due to the enlargement of the ETL market. In addition, an increase in Big Data, the Internet of Things (IoT), demand for Cloud Computing, and business data volume accelerates the power of ETL tools.
Recently, many organizations have taken advantage of ETL tools to manage big data sources’ volume, variety, and velocity.
According to Talend, more than 50% of enterprise data is allocated to the cloud system, emphasizing the influence of external data sources in every company. Hence, it is urgent to implement modern tools to efficiently process and integrate data by entering data warehouse space and accommodating workloads.
How to evaluate ETL tools?
Currently, there are various technology providers, such as AWS, IBM, Oracle, Talend, etc., offering ETL solutions. However, each enterprise has to decide which ETL tools are most efficient and match their operations. A standard framework was conducted based on academic research to compare ETL tools against each other after referencing different articles, journals, and books.There are four categories in the finalized framework as below:
- Ease of use
The first category is “Price”. The Oracle expert – Abramson rated costs as one of the most important criteria to assess an ETL tool. This criterion involves several fundamental “costs” that the organizations have to take into account:
- License Cost: it emerges when buying one license.
- OS Costs: it is operating system costs.
- Support Costs: it occurs when raising additional service support.
- Hardware cost: it is an amount of money used to buy the hardware that is needed to run the program.
According to the author of Information Management Direct – Mark Madsen, the “Functionality” category was conducted to check if they could process their data.
It relates to a dozen of functionalities, namely:
- Basic processing support
- On-demand support
- Secure Packages
- ETL reporting
- Data Warehouse support
Ease of use
The category “Ease of Use” is created to determine the usability of the ETL tools. According to Mark Madsen, it is difficult to establish criteria here because every user has different preferences on how a program should work (Madsen, 2008). After research, they established the following criteria for comparison in this category:
- Completeness of the GUI (Graphical user interface): a good visual interface is given.
- Custom Code: it allows the user to enter source code or to customize the process highly.
- Integrated Toolset: if the ETL tools are not integrated into one program, it is possible to purchase backup tools or add-ons for the product.
- Debugging support: it allows users to set breakpoints to analyze errors (Madsen, 2008) easily.
- Source Control: it is easy for the user to select and integrate different sources.
According to the Clickstream Data Warehouse book, the “Architecture” indicates information about the hardware and operating system (OS) supported by the software in terms of platform, backup, and performance. Although each ETL tool advances different architectures, the authors of the research concluded several integral criteria:
4 types of ETL tools
Recently, many ETL tools have been transformed and developed to satisfy various needs and requirements of users. These appliances fall into different categories based on several criteria such as functionality, structure, volume performance, etc. According to Hubspot, there are four basic types of ETL tools, including Open-source, Enterprise, Cloud-based, and Custom ETL tools.
Open-source ETL tool
Open-source tools are freely available and created by software developers. Each tool has distinct characteristics in terms of quality, integration, ease of use, adoption, and availability of support. Recently, there have been many open-source options for organizations to take into account, namely Pentaho, Hadoop, Hevo Data, Airbyte, etc. This blog will take Pentaho and Talend Open Studio as examples of open-source ETL tools.
Pentaho Kettle Enterprise, also known as Pentaho Data Integration, is a platform’s ETL model. Accordingly, it allows extracting data from numerous sources, transforming, and loading it into the Enterprise Data Warehouse (EDW), either a Relational Database or NoSQL Database. The organization could utilize the Pentaho tool for further transformation from EDW to Data Marts or Analytic Databases.
The table below indicates a comparison between Pentaho and other ETL tools:
|Feature||Pentaho Kettle Enterprise||Informatica Power Center||Microsoft SSIS||Talend integration Suite||IBM Data Stage||Clover ETL|
|Easy to use||Yes||No||No||No||No||No|
|Multi-role Team collaboration||No||No||No||No||No||No|
|Large volume Performance||Yes||Yes||Yes||Yes||Yes||Yes|
|Plug-in external programs||No||Yes||No||Yes||Yes||No|
Case study – GEM’s Client ETL project
Background: Currently, GEM is implementing the ETL as well as EL (Extract, and Load) processes for its client, a company in the Telecommunication sector. The client requires daily, hourly, or even minutely data updates to facilitate business operations properly.
Problem: the data comes from various sources and is enormous to manage daily. As a result, Pentaho is one of the options that can address client’s demands. And GEM has to use the ETL and EL to help them retrieve and govern a massive amount of data.
This work comprises two phases: phase 1 – EL process and phase 2 – ETL process, as shown in the map below. Throughout both phases, GEM team takes advantage of the Pentaho tool to run EL and ETL processes.
- Phase 1: Raw data is extracted from the data source that the client provides under various types (structured, unstructured, etc.). Next, GEM team loads all of the extracted data into a data warehouse. The loading step delivers customized Output 1 according to the client’s requirement, which becomes the data source for Phase 2.
- Phase 2: Continually, GEM draws out data from Output 1. The following step is to store all the data from the previous step, regardless of any type of data. Afterward, the data is transformed according to the client’s needs. The final result is Output 2, which is data marts or data lakes. The converted data is material for data visualization or reporting later.
Results: GEM team has progressively worked on the project, which has yielded favorable results. Specifically, the client could standardize the data processing and governance efficiently, so they make rational predictions and decisions faster.
Cloud-based ETL tool
Many cloud service providers (CSPs), such as Amazon AWS, Google Cloud Platform, and Microsoft Azure, have developed ETL tools on their own infrastructure. It has resulted from the proliferation of cloud computing and integration-platform-as-a-service. Currently, AWS has taken the largest market share among various cloud-based ETL tools.
AWS is a prevalent cloud-based ETL tool that assists visual and code-based clients, which would deliver sound business intelligence. The serverless platform provides multiple features, such as the AWS Glue Data Catalog for detecting data from various organizations and the AWS Glue Studio for visually arranging, performing, and managing ETL pipelines.
The map below is an illustration of how to create, run, and assess the ETL process without writing code, thanks to AWS Glue Studio. Initially, you just compose ETL tasks to move and transform data with the drag-and-drop editing tool. Then AWS Glue will automatically develop code for your tasks. Additionally, the task runtime console of AWS Glue Studio allows you to manage ETL execution and track the progress.
According to academic research, the graph below indicates that cloud services have become more prevalent around the world. Between Q3 2019 and Q3 2020, spending on cloud services increased by 32.72%. The top three leaders take dominance over the market, namely AWS, Google, and Microsoft. Based on Fig 6, AWS accounted for the largest share, increasing from $9.075 billion to $11.68 billion. The following leader was Microsoft, comprising $6.935 billion in Q3 2021.
|Cloud service provider||Pros||Cons|
|AWS||– Breadth and depth of its services|
– Developer functionality
– Economic benefits for customers
– Gold standard for reliability and security
– Control market position
– Sizeable, develop offerings
– Helpfully for huge organizations Worldwide reach
|– Cost prohibitive|
– Usage is not facile
– Stewardship of price
Technical support fee
|Microsoft Azure||– Adjustable billing|
– Platform-as-Service (PaaS) is a well-defended suit of Microsoft
– Accuracy and expandable
– High-level availability
– Price-effective differentiate from the competition
– Compatible with Microsoft devices and software
– Integrated into public and private cloud
|– Ineffeciency with documentation|
– Imperfect management devices
– Comparatively hard to use
– Data transfer cost
– Require platform expertise
|Google Cloud Platform||– Advance costing than Competitors|
– Live Migration of Virtual Machines
– Adjustable pricing model
– Deep expertise technology
– Current innovation, well- authorize in cloud computing
|– Safety and privacy|
– Bounded control and flexibility
– Vendor pin-down
– Insufficient characters or services
– Historically not as an enterprise-focused
Case study – Burt Corporation
Burt Corporation, a start-up data company, delivers data intelligence and analytics solutions to many major online publishers, which helps to understand and optimize the clients’ online marketing strategies. The company has implemented various tools such as AWS Glue, Amazon Redshift, and Amazon Athena to foster efficient data collection, processing, analysis, and decision-making capabilities.
Commercial ETL tool (Enterprise)
Commercial tool distinguishes itself from the other two by 2 notable features: Modification and Data inputs. In terms of modification, while open-source software allows basic rights to the general public, commercial tools are modified by only the organization that created them. Additionally, unlike cloud-based tools that only process online data sources, commercial ones accept sources from online and offline databases.
Commercial tools offer graphical user interfaces (GUIs) for designing and executing ETL pipelines. It also facilitates relational and non-relational databases such as JSON and XML, event streaming sources, etc.
The next section will investigate Informatica PowerCenter as an example for a better understanding of enterprise ETL tools.
Currently, Informatica is an industry leader in ETL. It has the best-in-class data integration products for quickly integrating data and applications.
Informatica PowerCenter is an on-premise ETL tool that can link to a number of different legacy database systems. The tool also allows for data governance, monitoring, master data management, and masking. Users can view servers on the company’s premises using ETL, a batch-based ETL application with a cloud counterpart. It also provides a number of data management and software-as-a-service options.
Informatica is an ETL tool that can be used to build corporate data warehouses. It also provides a range of data masking, duplicate data, merger, consistency, and ETL products. The ETL tool allows users to connect to and view data from a variety of sources, as well as perform data processing.
|– Easier for server manager to schedule|
– Informatica’s archive manager can help with data preservation and recovery
– Informatica is a mature, eminent business data integration framework
|– Informatica is a bit expensive. It is costlier than Datastage but cheaper than Ab Initio.|
– To use Informatica’s services, one must pay the single and multi-user licensing costs
– Only available on a commercial basis.
– Informatica’s custom code incorporation through Java conversion is relatively complicated
Case study – Radobank
Rabobank is a multinational banking and financial services company. The organization is a global leader in food and agricultural financing and sustainability-oriented banking, with a mission to grow a better world together.
- Understand data lineage across the bank to create business value and growth opportunities while driving digital transformation
- Comply with regulatory requirements for financial services, such as BCBS 239 for risk data aggregation and reporting
- Promote data governance and improve overall data quality to help increase the value of data assets and maintain that value over time
“Informatica Enterprise Data Catalog helps us increase and promote the value of our data assets, which are the foundation of everything we do.”, Anil Bandarupalli – Solution Architect for Data Management Rabobank said.
- Inventory and catalog data using Informatica Enterprise Data Catalog for end-to-end data lineage and complete tracking of data movement
- Give employees an easy-to-use interface and simple search to quickly discover data and its lineage for auditors or internal projects
- Reduce the duration of root cause analysis processes to empower data stewards to identify and remediate data quality issues faster
Custom ETL tool
Despite the widespread use of graphical user interface (GUI) – based solutions, some organizations choose hand-coded ETL tools for their ETL processes. In some contexts, the custom approach could be cheaper, faster, and more attainable than GUI-based tools.
Enterprises would build their custom ETL tools based on programming languages such as Python, R, Java, etc. In academic research, the authors introduced several custom tools, namely Pygrametl and Petl (Python-based), Scriptella (Java-based), etl (R-based).
The most notable example is Petl, a general-purpose Python library that can carry out typical ETL processes, supported by the MIT License. PETL which is used for extracting, transforming, and loading tables of data.
The design of Python ETL (petl) is easy and convenient for users. Hence, the tool is preferable for working with mixed, unfamiliar, or heterogeneous data. In addition, you can build tables in Python from various data sources such as CSV, XLS, HTML, TXT, JSON, etc, and drive them to the prescribed data storage. Another benefit of using Petl is that it can be used for migrating between SQL databases efficiently and smoothly.
ETL tools are of immense importance in the field of business intelligence. Using a proper ETL tool might drastically affect the business outcome. Hence, it is important to choose the right ETL tool according to business requirements and investments. This blog presented a review of distinct features of some mainline ETL Tool suites and how these tools are applied in reality. In the future, the market of ETL tools will expand significantly due to the demand for data integration and governance. More importantly, these tools have always been transformed and upgraded, which requires humans to learn and adapt to changed things.
GEM Corporation is a leading IT service provider who empowers its business clients in their digital transformation journey. Based in Hanoi, Vietnam, GEM is characterized by competent human resources, extensive and highly adaptive techstack, and excellent ISO-certified and CMMi-based delivery process. GEM, therefore, has been trusted by both start-ups and large corporations from many global markets across different domains.