Setting up CommCare Data Pipeline with the Azure Ecosystem

Introduction

This guide provides a comprehensive overview of establishing a robust data pipeline to extract, transform, and load (ETL) data from CommCare into the Azure ecosystem. This setup leverages various Azure services to ensure seamless data integration, transformation, and storage for enhanced data management and analytics. Here’s the ETL workflow breakdown:

Extract: Retrieve data from CommCare and save to Azure Data Lake.

Transform: Process and transform the data using Databricks.

Load: Load the transformed data into Azure SQL Database.

Here’s a quick video showcasing the data pipeline workflow https://www.loom.com/share/07c8d69de40c458b98132c56ca9f7b04

Data Pipeline Overview

Data collected in CommCare can flow into the Azure ecosystem in multiple ways. The assumptions for this tutorial are the following:

  1. Form data is being pulled into Azure via CommCare APIs

  2. Azure setup includes:

    1. Azure Data Lake Storage (ADLS) with storage containers for storing raw JSON of CommCare data. See https://learn.microsoft.com/en-gb/azure/storage/blobs/data-lake-storage-introduction

    2. Azure SQL Server for the final output of formatted CommCare data in structured tables. See https://learn.microsoft.com/en-us/azure/azure-sql/database/?view=azuresql

    3. Azure Data Factory (ADF) for setting up the API pull from CommCare to save raw JSON to ADLS. See

    4. Azure Databricks for processing raw JSON into structured data for copying to SQL server. See

Resource setup on Azure is not within the scope of this tutorial as there could be different ways in which partners might choose to set up their Azure environments. Please use the links above for guidance if setting up for the first time.

Using CommCare APIs to Create Raw JSON in ADLS via ADF

  1. Navigate to Azure Data Factory and setup a new pipeline. Choose Move and Transform from Activities and select Copy Data.

  1. Create a new source dataset by selecting REST API.

  1. Enter CommCare Form Data API endpoint. Choose Anonymous Authentication and add a header called ‘Authorization’ and add ‘ApiKey username:apikey’ in the value field. Use your CommCare username and associated API key for that particular project space. For example: ‘ApiKey rrath@dimagi.com:123xyz’

  1. Create a sink dataset by selecting the ADLS storage path of choice.

Navigate to the file location in ADLS to check if the JSON file is saved.

We have have successfully saved raw data from CommCare to ADLS with these steps.

Processing Data with Azure Databricks and Moving to SQL Server

Create a new notebook like this and setup connection with the Data Lake Storage json file and process it for copying to SQL server.

A series of spark jobs below will allow for data processing and copying over to a SQL database as needed.

Step 1: Setup the connection with ADLS blob container with the JSON data

Step 2: Inspect raw JSON data from API pull

Step 3: Explode the objects array to access nested objects and fields (all form data)

Step 4: Flatten the nested JSON and identify the fields for loading into SQL

Step 5: Explode the arrays for each field and flatten the data so that it outputs individual records per row in SQL

Step 6: Copy data to SQL

Final output in SQL database:

We have successfully transformed and copied CommCare data to a SQL database with these steps.

Validation and Monitoring

  • Validate Data in Azure SQL Server:

    • Perform validation checks to ensure the data in SQL Server is accurate and complete.

    • Compare the processed data with the raw data to verify transformations.

  • Set Up Monitoring and Logging:

    • Configure monitoring and logging in Azure Data Factory and Databricks.

    • Use Azure Monitor and Databricks logs to track pipeline performance and address any issues promptly.

Conclusion

By following these steps, you can efficiently move data from CommCare into the Azure ecosystem, leveraging familiar tools and processes to manage your data pipeline. This tutorial is aimed at showing one specific ETL workflow for setting up a data pipeline between CommCare and the Azure ecosystem.