Thu Aug 08 2024

Enhancing Data Quality with dbt Unit Testing

By Andy Ho

In the complex and evolving world of data, ensuring the integrity and reliability of our data pipelines through robust testing mechanisms is increasingly crucial.

dbt (data build tool) has been at the forefront of this evolution - moving software engineering practices into data workflows. Dbt has long provided data tests and constraints, and with the recent introduction of unit testing in dbt Core v1.8 and dbt Cloud environment, a new valuable addition has been offered to our utility belt. This new feature enhances the quality and reliability of our data transformations, maintaining stakeholder trust to make confident decisions.

My name's Andy, and I work as an analytics engineer at Xomnia. In this blog post I'll discuss:

  • dbt and its role in modern data workflow
  • The concept of Test-Driven Development (TDD) and its benefits
  • How unit testing applies to dbt, and why it is needed
  • Practical applications and best practices for implementing unit tests in your dbt framework

What is dbt?

Figure 1: dbt DAG (Directed Acyclic Graph) visualizing data model dependencies (Courtesy of dbt)

For those new to it, dbt, or Data Build Tool, is designed to handle the 'T' (Transform) in ELT (Extract, Load, Transform) processes. It leverages SQL queries and scripts to transform raw data into well-structured, analysis-ready datasets through a unified framework with modular, reusable, and testable SQL code.

dbt is available in two flavors:

  1. dbt Core: An open-source CLI version run locally or integrated into existing data infrastructure.
  2. dbt Cloud: A fully-managed, web-based service offering collaborative development, job scheduling, and enterprise-grade features.

We won’t go through the entire ecosystem in our discussion but some of their key features include:

  1. Modular and reusable code with macros: dbt promotes writing modular SQL code, utilizing macros as parameterized SQL snippets to enforce DRY (Don’t Repeat Yourself) principles. This approach breaks down complex transformations into reusable models, ensuring efficiency and consistency across projects.
  2. SQL-based transformations: dbt allows you to write SQL queries to transform your data, making it accessible to a wide range of data professionals proficient in SQL.
  3. Dependency management: dbt automatically manages the dependencies between your models, ensuring that your transformations are executed in the correct order.
  4. Job Scheduling: dbt Cloud offers advanced, built-in job scheduling capabilities, allowing you to automate and manage your data transformation pipelines efficiently.
  5. Testing and documentation: dbt provides a framework for writing tests to validate your data transformations and generate documentation for your data models.

Now that we've established what dbt is and its key features, let's explore how we can enhance our data pipelines through unit testing.

Unit Testing in Traditional Software Development

Unit testing is a concept brought from software engineering involving testing individual units or components of a system in isolation. A unit refers to the smallest testable part of any software and this could be a single function, method, or class.

Typically, unit testing is done within a Test-Driven Development (TDD) environment that aims to cover all possible use cases and unknown edge cases, and to ensure that your code behaves as is intended under all circumstances.

Highlighted in Figure 2, TDD involves writing tests before the actual code then writing the code that will pass those tests. This approach is an important bit that we’ll discuss later in the blog, but the essential component is that it forces you to think deeply about the core functionality of what you're writing and what you want it to achieve.

Figure 2: Simple Test-Driven Development (TDD) workflow

Benefits of Test-Driven Development (TDD):

  • Improved Code Quality: Unit tests enforce better coding practices and ensure that each part of the codebase works as intended. By focusing on small, testable units of code, TDD promotes modular and loosely coupled designs.
  • Simplified Debugging: When a test fails, it's easier to identify the specific part of the code that caused the failure, making debugging faster and more efficient.
  • Early Error Detection: By testing individual components in isolation, unit tests catch errors early on in the development process. This prevents small issues from becoming major problems down the line.
  • Documentation: Unit tests serve as documentation for the code, providing examples of how different parts of the codebase are intended to be used and behave.
  • Design Improvement: Writing tests before implementation forces developers to think about the design and interface of their code upfront. This often leads to better-structured, more maintainable code.

Now, here's a fun general tip: if everything seems to work perfectly right off the bat, be suspicious. In my experience, that usually means I've missed something.

Read further: Practical tips on writing clean code: Improve your coding & enhance your software

Unit Testing in dbt: A Paradigm Shift

In the dbt framework, unit testing is adapted to meet the unique needs of the data transformation workflow, focusing to ensure SQL transformations and the resulting data are correct. This is done through validating the logic and changes in your data through the use of sample data to simulate different scenarios and edge cases.

Unit testing allows you to make updates and ship code changes confidently and efficiently through a low-cost method before materializing your model.

Why Do We Need Unit Testing?

At the beginning of the blog, I mentioned that while dbt’s long-standing testing mechanisms (end-to-end and generic data tests) are very helpful for data quality, they have certain limitations around granularity that become increasingly apparent as data projects grow in complexity and scale. They include:

  • Limited Coverage: Schema tests in dbt ensure data integrity by checking constraints like uniqueness and non-null values, but fall short in validating transformation logic. Complex business logic and calculations within transformations can go untested, leading to potential errors.
  • Manual Validation: Without automated unit tests, developers manually verify complex transformations by running queries and inspecting results, which is time-consuming and error-prone.
  • Resource Intensive: Data tests in dbt run after a model is materialized. This requires significant resources to load and process when working with large datasets.

Unit testing in dbt helps to address data quality gaps by:

1. Validating Complex Logic: Unit tests allow you to verify intricate business logic and calculations within your transformations, ensuring accuracy before materialization.

2. Automating Validation: By replacing manual verification with automated unit tests, you save time and reduce human error in the validation process.

3. Testing Efficiently: Unit tests run on small, representative datasets, providing quick feedback without the need to process large volumes of data.

4. Detecting Errors Early: By catching issues before model materialization, unit tests help prevent errors from propagating through your data pipeline.

One additional point that doesn’t exactly sit within data quality gaps but is an important point worth noting is cost. Unit tests run on small, representative datasets. This means you can validate complex logic without processing your entire data warehouse, saving both time and computational resources. Additionally, by catching errors before materialization, unit tests prevent the need to rerun large, expensive transformations with billions of rows. This is especially crucial in cloud environments where compute time directly translates to costs.

What is the Difference Between Unit Tests, Constraints, and Data Tests?

In the dbt ecosystem, there are multiple ways to ensure data quality and integrity. While we won't dive into exhaustive details, it's crucial to understand the key differences between unit tests, constraints, and data tests.

  1. Constraints, as detailed in dbt's documentation, are rules applied directly to columns in your models. dbt constraints help enforce data integrity at the database level, ensuring that only valid data is inserted or updated in your tables.
  2. Data tests, on the other hand, are SQL queries that return failing rows. dbt data tests are typically used to validate data quality after a model has been materialized, checking for conditions like uniqueness, non-null values, or specific business rules.
  3. Unit tests, which we've focused on in this blog, validate the logic of your transformations before they're applied to your full dataset.
Table 1: Comparison of unit tests, constraints, and data tests in dbt

Table 1 compares the differences between unit tests, data tests, and constraints in dbt. These quality checks should be used intentionally to ensure the reliability and quality of your data transformation processes. It’s not about choosing one over the other, but about leveraging each type to strengthen the quality of your data. By understanding and utilizing these different types of tests and constraints, you can build a robust data quality framework that catches issues at various stages of your data pipeline, from transformation logic to final data validation.

Practical Application of Unit Testing in dbt

Now that we've clarified the distinctions between different types of tests in dbt, let's dive into a practical example to see how unit testing works in action. The SQL query below in Figure 4 showcases a retail platform using RegEx within dbt to validate that each customer ID follows the required format of three uppercase letters followed by three numbers and categorize them as valid or invalid.

Figure 4: Sample RegEx for customer ID validation

In retail analytics, accurate customer identification is crucial for various business processes, from personalized marketing to fraud detection. Invalid customer IDs can lead to misattributed sales, skewed analytics, and potential regulatory compliance issues. By implementing rigorous validation through unit tests in the following sections, we not only ensure data integrity but also safeguard critical business operations that rely on this data.

Implementing Unit Tests

To test our customer ID validation logic, we created a unit test shared in Figure 5.

Figure 5: dbt unit test configuration for customer ID validation

Let's break down step-by-step what’s happening in our YAML file:

  1. unit_tests: This top-level key indicates that we're defining unit tests.
  2. name: This descriptive name helps identify the purpose of our test.
  3. model: This specifies which dbt model we're testing, matching the name of our SQL model file (without the .sql extension).
  4. given: This section defines our test input data.

- input: ref('customer_metrics'): Specifies that we're providing input for the 'customer_metrics' table.

- rows: Here, we define our test input data with a mix of valid and invalid customer IDs to test different scenarios.

5. expect: This section outlines what we expect our model to output given the input data.

- rows: Lists the expected output for each input row, including the customer_id, expected valid_customer_id boolean, and expected customer_id_status, allowing us to comprehensively test our model's logic by including both valid and invalid customer IDs.

To enable a centralized approach to managing your tests (see Figure 6), unit tests can be defined in the model YAML file, alongside other data tests and constraints. Additionally, dbt's unit testing framework only needs to define the input data relevant to your test. So in this case, we're only providing the customer_id column, even if our actual customer_metrics table contains additional columns.

Figure 6: dbt YAML configuration - unit tests and data tests for customer ID validator

Before we continue, you may have noticed that we wrote our SQL logic before creating the unit test. While this approach still provides the benefits of unit testing, it doesn't fully align with the TDD principles we shared at the beginning of the blog where tests are written first.

However, the dbt documentation itself presents examples in this order of SQL first. For this, we can adapt the TDD principles to fit the realities of data work, which is that we often don’t know exactly what our data will look like until we start working with it.

Based on my experience, here's a workflow that balances the benefits of test-driven thinking with the practical needs of data transformation:

  1. Explore and analyze the data to understand its structure and content.
  2. Develop initial transformation logic based on this understanding.
  3. Identify key assumptions and edge cases in your transformation.
  4. Write unit tests to validate these assumptions and handle edge cases.
  5. Refine and iterate your transformation logic based on test results.
Figure 7: Adapted TDD workflow for dbt data transformations

This approach allows us to maintain the spirit of TDD: thinking critically about our expectations and desired outcomes - while acknowledging the exploratory nature of data work.

Don't miss Xomnia's latest business cases and blogs. Subscribe to our newsletter.

When Should You Apply Unit Tests?

While our example above covers one case of RegEx unit testing for customer ID validation, let's explore some other key scenarios when you should consider applying unit tests:

  1. Date Mathematics: When performing calculations or comparisons involving dates and times.
  2. Window Functions: When using analytical functions like ROW_NUMBER(), RANK(), or LAG() to perform calculations across rows.
  3. Complex CASE WHEN Statements: Particularly when there are multiple conditions or nested logic.
  4. Truncation Operations: If you're rounding or truncating numerical values, which could lead to precision issues.
  5. RegEx Usage: Again, as our example above, it is good to do unit tests for pattern matching or data validation.
  6. Custom Logic: When you're writing specialized dbt logic to process input data, similar to creating a function.
  7. Edge Case Handling: To ensure your model correctly handles scenarios not yet encountered in production data.
  8. Stable, Well-Defined Transformations: For data models with clear, stable business logic that you expect to maintain long-term.
  9. Critical Data Pipelines: For transformations that feed into critical reports, dashboards, or decision-making processes.

Supported Data Formats for Unit Tests

Now that we know when to apply unit tests, it's important to understand the different ways we can provide data for these tests. Currently, dbt supports three main formats for flexible mock data in unit tests:

  1. dict (default): Inline dictionary values, as shown in our example above.
  2. csv: Inline CSV values or a reference to a CSV file.
  3. sql: Inline SQL query or a reference to a SQL file.

Each of these formats have its own advantages and use cases. The 'dict' format is often the most straightforward for simple tests, while 'csv' and 'sql' can be useful for more complex scenarios or larger datasets.

For a detailed explanation of these formats and how to use them, we recommend referring to the official dbt documentation on data formats for unit tests.

Running Unit Tests

To run unit tests on our model, use the following command:

Figure 8: Running unit tests in dbt

Let’s break down what is happening:

  1. dbt test: This is the base command to run tests in dbt.
  2. --select customer_id_validator: This selects your specific model.
  3. test_type:unit: This filters to run only unit tests.

Essentially, by combining these selectors, you're telling dbt to run only the unit tests associated with the customer_id_validator model.

Something worth noting is that dbt Labs strongly recommends running unit tests only in development or CI environments. Since the inputs are static, there's no need to use additional compute cycles in production. Use them in development for a test-driven approach and in CI to ensure changes don't break existing functionality.

Unit Test Exit Codes

After running your unit tests, dbt will provide feedback in the form of exit codes that are represented by two simple indicators:

  • Pass (0): Indicates that all unit tests have passed successfully.
  • Fail (1): Indicates that one or more unit tests have failed.

These straightforward exit codes make it easy to integrate unit testing into your CI/CD pipelines. For instance, in a CI environment, a non-zero exit code (1) would typically trigger a pipeline failure, allowing for quick identification of issues in your development process and alerting the team before reaching production.

Join our team of top-talent data and AI professionals. Click here to view our vacancies.

Unit tests in dbt Best Practices

  1. Comprehensive Testing: After fixing an error, always rerun all test cases, not just the one that previously failed. Changes made to fix one issue might inadvertently affect other parts of your model. Always test all your inputs to ensure consistent outputs across your entire model.
  2. Use descriptive test names: Make your test names clear and specific to easily identify what's being tested.
  3. Cover edge cases: Include tests for boundary conditions and unusual scenarios.
  4. Regularly review and update tests: As your models evolve, make sure your tests stay relevant.

Conclusion

The power of unit testing in dbt is clear. It enhances the reliability and quality of data transformation processes, allowing teams to catch errors early, simplify debugging, and ensure each component of the data pipeline works as intended. As data pipelines grow in complexity, comprehensive testing strategies become paramount.

Remember, the goal isn't to choose between unit tests, data tests, or constraints, but to leverage the strengths of each. By integrating unit testing alongside existing quality assurance measures, data teams can build more reliable, maintainable, and trustworthy transformations, ultimately leading to better decision-making and business outcomes.

It's time to elevate your data pipelines. Start by identifying one critical model in your dbt project and write your first unit test today. Every test you write is a step towards more trustworthy data and confident decision-making. Make unit testing an integral part of your workflow, and watch the quality and reliability of your data pipelines soar.

Xomnia is the leading data and AI consultancy in The Netherlands. If you are interested in what we can do for your business, contact us!