What is ETL Testing?

A database is one of the most important assets an organization may own, and the data contained within that database is generally invaluable. It is common now for an organization to collect and analyze real-time data for corporate decision making, reporting, data mining and for reviewing historical trends. To be utilized as such, data must be gathered from a number of different sources, then transformed and stored in a single database so that the organization can monitor, analyze and report on the data. This model is referred to as the data warehouse (we will not get into the complexities of ETL/data warehouse architecture). The data warehouse is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, and enables an organization to consolidate data from several sources.

This leads us to the importance of ETL testing. For those who are unfamiliar with the term, ETL stands for Extract-Transform-Load, and is the process for consolidating all of the data from various sources into the data warehouse model. Because data is coming from various sources, and in various forms, it is vital that the ETL process is working correctly. In software development, the cost of finding and fixing a defect rises exponentially as the development lifecycle progresses. In the data warehouse model, the cost related to the failure of discovering a defect early in the process is often worsened by the fact that key business decisions are potentially being made based on the data presented.

How Do We Test ETL?

Testing, preferably by an independent party, should be undertaken to verify and validate the ETL process, thereby ensuring the quality, completeness and robustness of the data warehouse. There are a variety of tools that can be used for ETL testing, but that decision can be left to the organization and the test team. There are several levels of testing that should be performed during ETL testing. Some levels of testing are outlined below, but this should not be considered an exhaustive list of testing activities.

Requirements Testing

    • Are the requirements complete?
    • Are the requirements testable?
    • Are the requirements clear (is there any ambiguity)?

Data Validation Testing

    • Compare record counts between data sources
    • Ensure that the ETL application properly rejects, replaces with default values and reports invalid data
    • Verify that data is transformed correctly according to system requirements and business rules
    • Compare unique values of key fields between source data and warehouse data
    • Ensure that all projected data is loaded into the data warehouse without any data loss or truncation
    • Test the boundaries of each field to find any database limitations

Integration Testing

    • Verify the sequence and outcome of ETL batch jobs
    • Verify that ETL processes function with upstream and downstream processes
    • Verify the initial load of records on data warehouse
    • Verify any incremental loading of records at a later date for newly inserted or updated data
    • Test the rejected records that fail ETL rules
    • Test error log generation

Report Testing

    • Verify report data with the data source
    • Create SQL queries to verify source/target data
    • Verify field-level data

User Acceptance Testing

    • Verify that the business rules have been met
    • Confirm that the system is acceptable to the client

Performance Testing

    • Verify that data loads and queries are executed within anticipated timeframes
    • Verify that maximum anticipated volume of data is loaded within an acceptable timeframe
    • Verify load times with various amounts of data to predict scalability

Regression Testing

    • Ensure that current functionality stays intact whenever new code is released

Test Team Resources

Defining the right resources to test an ETL application is a key to success. It is imperative to have at least one resource on the test team with in-depth experience in databases and data models. This resource should be involved early and often in the project planning phase, not just at the end of the project during the testing phase. This will keep the test team abreast of requirements, design and process changes. Test resource allocation should correlate directly with the size of the data warehouse team.

Challenges

Testing an ETL application differs from conventional testing. Below is a list of some challenges that may face the ETL test team:

    • Unclear or ambiguous requirements
    • Incompatible and duplicate data
    • Data loss during the ETL process
    • Unavailability of a dedicated test environment
    • Complexity of the data
    • Test data – mining and creation
    • Missing business flow information

 

Report Testing

    • Requirements are the blueprint for understanding and communicating the business rules. If requirements are not concise and complete, implementation will be difficult.
    • Automating processes whenever possible will save tremendous amounts of time.
    • The entire ETL team should share the same tools, repositories and defect tracking system.
    • Do not over-simplify the testing efforts that are needed in order for the project to be a success.

Conclusion

It is vital that the organization recognizes that the test phase is part of the data warehouse life-cycle. The test phase should be planned at the beginning of the project. ETL testing requires a lot of “nose to the grindstone” effort on behalf of the test team which should be comprised of testers, developers, business analysts, database administrators and end-users. It is the responsibility of the entire project team to ensure the quality and success of the ETL application.