unit_abhishek_1735609911
unit_abhishek_1735609911
in Databricks
What is it, How to Implement, and Examples
Abhishek Agrawal
Azure Data Engineer
Q1. What is Unit Testing?
Unit testing is the practice of testing individual units of code (usually
functions or methods) to ensure they work as expected. In the context of
Databricks, unit testing focuses on testing specific parts of your data
engineering pipeline, such as PySpark transformations and business logic.
Key Points:
Ensures that individual functions perform as expected.
Helps catch errors early in the development cycle.
Improves code reliability and maintainability
Example of Workflow:
import unittest
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
class TestDataFrameFunctions(unittest.TestCase):
@classmethod
def setUpClass(cls):
# Create a Spark session and a sample DataFrame for testing
cls.spark = SparkSession.builder.appName("UnitTesting").getOrCreate()
cls.df = cls.spark.createDataFrame([(1, 'John'), (2, 'Jane')], ["id", "name"])
def test_add_column(self):
# Test the add_column function
df_with_new_col = add_column(self.df, "age", 25)
self.assertIn("age", df_with_new_col.columns) # Check if the new column is added
self.assertEqual(df_with_new_col.select("age").distinct().collect()[0][0], 25)
# Check if the new column has the correct value
@classmethod
def tearDownClass(cls):
# Stop the Spark session after tests are complete
cls.spark.stop()
if __name__ == '__main__':
unittest.main()
import pytest
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.appName("PySpark Unit Test").getOrCreate()
def test_add_column(spark):
df = spark.createDataFrame([(1, 'John'), (2, 'Jane')], ["id", "name"])
df_with_new_col = add_column(df, "age", 25)
assert "age" in df_with_new_col.columns
Explanation:
pytest.fixture: Creates a Spark session for the test.
test_add_column: Validates that the column is added.
%sh
pytest path/to/test_file.py
Use Mock Data: For testing, use small, isolated datasets to simulate
various scenarios.
Key Takeaways:
Abhishek Agrawal
Azure Data Engineer