Data, ETL, Azure, ADF
April 20, 2021
One of the biggest challenges we face working in big data environments is moving and structuring data. It has been said that as much as 80% of a data professional’s time is spent doing Extract-Transform-Load (ETL), leaving only 20% of their time to do the important work of Analytics and Insights. Modern tools such as Azure Data Factory (ADF) seek to flip those percentages so that data professionals can spend more time providing value.
Here at Causeway Solutions, we have been leveraging ADF for several years, and have succeeded in lowering the time spent performing ETL even farther. According to Microsoft, ADF is a “fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale."
In this article, we will focus on a simple strategy you can use to improve the reusability of your ADF Pipelines that focus on data movement.
Datasets are the primary means of describing data for both sources and sinks. Dataset properties are defined by their underlying type of connected Linked Service. For instance, an Azure SQL Dataset will include properties such as Schema and Table, while a Blob Storage Dataset will include properties such as Container, Directory, and File.
Most Datasets also have the ability to define a Schema. This is great if you need to access the columns during processing or if you want to enforce a particular schema. In those cases, Datasets tend to be highly focused or tied to a specific data source (like a SQL Table or CSV file). In many scenarios, however, like moving data from an SFTP site to Blob Storage, a schema is not necessary for the given ETL operation. Whether you need a schema or not, parameterizing a Dataset will make it reusable across a variety of activities and scenarios. Reusable Datasets will limit the number of ADF resources you need to manage and help simplify your overall solution.
The most efficient means of moving data files in their natural state is binary transfer, so ADF includes a Binary Dataset type. Binary does not have a schema and does no data transformation or translation, so it is ideal for moving files from one place to another, even if the service types are completely different.
Start by creating a Binary Dataset and connecting it to a Blob Storage Linked Service:
Next, open the Parameters tab and add the necessary parameter variables:
Next, switch back to the Connection tab. If you click inside one of the File path boxes, you’ll expose the “Add dynamic content” option:
Click on the “Add dynamic content” link to open the editor. Under the “Parameters” section, select the related item:
Click on the “Finish” button to return to the Dataset editor, and the parameter reference will appear in the box:
Repeat this to reference the rest of the parameters:
NOTE: while this example parameterizes every value, it is certainly not required. For instance, if you are working against a known container, you could hardcode the container value while parameterizing the directory. You could also just reference the container and directory to create a folder-level Dataset.
To demonstrate this, we’ll build the ADF version of “Hello World” - copying a blob from one Blob Storage container/directory to another container/directory in the same account using Copy Activity. Another great use case is downloading files from an SFTP server to Blob Storage. In our version, thanks to parameters, we will be able to use the same Dataset for both the Source and the Sink.
Executing the pipeline will now copy the Blob from one container/directory to a different container/directory.
While brief, this introduction should serve to get you started with parameterized Datasets. When combined with pipeline parameters, variables, expressions, and other activities such as Get Metadata, they can go a long way towards making your ADF patterns more robust and reusable. Until next time, Happy Coding!
Ready to learn more? Contact Causeway Solutions to get started!