Active Directory, ADF, ADLA, Azure, Data Factory, Data Lake, USQL
April 30, 2018
Here at Causeway, we’ve been working with Azure Data Lake (ADL) as our big data analytics solution for a while now. We store national-scale data from a variety of sources, and over time we have developed analytics routines and workloads that push the boundaries of what a single SQL Server can comfortably handle. By transferring these workloads to ADL, we have been able to execute jobs in minutes that took hours or even days in SQL Server. Needless to say, we’ve become quite enamored of Data Lake: our collective professional opinion is that ADL is simply amazing, but that certainly does not mean it is without its challenges. One of those challenges we ran into recently was creating an Azure Data Lake Linked Service to execute U-SQL tasks in an Azure Data Factory pipeline. It took quite a bit of research and trial and error to get it right, so hopefully this post will help you get over the hurdle a bit faster. If you are reading this article, odds are good you are already familiar with Azure’s Data Lake and Data Factory services, but just for completeness we’ll begin with a brief overview of each service.
Azure Data Lake is Microsoft’s cloud hosted big data platform. ADL is made up of two major and independent components. First is Azure Data Lake Storage (ADLS), a massive scale, Active Directory secured, HDFS-compliant storage system. The second component, and the one we at Causeway are chiefly concerned with, is Azure Data Lake Analytics (ADLA), a massively parallelized analytics job execution service. While ADLA supports a number of languages such as R and Python, the primary language for ADLA is U-SQL, a hybrid language that merges T-SQL and C# into a single syntax. The power and flexibility of U-SQL is one of ADLA’s most compelling features, and this article will focus on preparing to execute U-SQL tasks from Azure Data Factory.
Azure Data Factory (ADF) is Microsoft’s cloud hosted data movement and orchestration tool. ADF is used to define, manage, and monitor ETL and ELT pipelines. ADF offers a variety of scheduling options and supports connections to numerous data sources, including both cloud and on-premises options. This article will focus on ADF V2, which is superior to V1, but still in Preview. ADF V2 development is not currently supported in Visual Studio, so our examples will use the online Data Factory tooling. In the Azure Portal, I’ve created a new Data Factory and opened the “Author & Monitor” window:
As an orchestration tool, ADF manages activities across a variety of data sources such as Azure Blob Storage, Azure SQL, Azure Data Lake, HDInsight (Hadoop), SSIS, on-premises systems, and more. To communicate with these sources, ADF requires a connection definition known as a Linked Service. Each source type has its own requirements, which makes sense when you consider that connecting to a Storage Account has different requirements than connecting to an on-premises SQL Server.
For most source types, creating a Linked Service is pretty straight forward. On the surface, this is no different for Azure Data Lake. To get started, expand the “Data Lake Analytics” menu in the pipeline and drag a U-SQL activity (the only option) to the pipeline:
In the lower pane, select the “ADLA Account” tab:
This will open the “New ADLA Linked Service” blade:
Most of these options are self-explanatory and simple to configure. Using the “From Azure subscription” selection mode is a particularly useful feature presuming you have access to the subscription that hosts the Data Lake. Select the options for “Azure subscription” and “Data Lake Analytics account name”.
The last two options in the blade are far less self-explanatory, “Service principal ID” and “Service principal key”:
These values are required to create the Linked Service, which is required to allow the Data Factory to execute U-SQL Scripts in Azure Data Lake. There is quite a bit of back end configuration required to acquire these values.
A Service Principal is like a proxy account for services. They are used by Azure services to allow some services to connect to others, for instance we need a service principal to allow ADF to connect to ADLA. That being said, you’d think you could go to the Azure Portal and find a “Create Service Principal” feature, but no such thing exists. In fact, when we first started this process, there was precious little documentation to be found regarding Service Principals, but a recent addition to the documentation is very useful. That documentation is very similar to what I’m about to outline, but since it isn’t specific to Data Lake, there are a few changes you’ll need to connect ADF to ADLA. Creating a service principal boils down to creating an Azure Active Directory (AAD) application, generating a key in the application, and granting the application permissions.
We’ll start by registering a new application in AAD. Go to the Azure portal and select “Azure Active Directory”, and then “App registrations”:
This will open the App registrations blade. Select “New application registration”:
This will open the “Create” blade. Enter the name of the app and and URL. Application type should be “Web app/API”. Press the “Create” button (not shown below). NOTE: the URL can be (and most likely will be) totally fake:
Click Create to register the app. Once the app is created, you should be taken to the Registered app blade. If not, go back to the list of registered apps in your AAD and open it.
Remember the “Service principal ID” value we needed for the ADLA Linked Service? Copy the “Application ID” value in the registered app and paste it into the “Service Principal ID” box in the Linked Service:
Back in the registered app blade, open the Settings and select the Keys button:
This will open the “Keys” blade:
Under the “Passwords” section, enter a Description (the name of the key) and select an expiration value, then press “Save”:
When you save the key, it is VERY IMPORTANT that you copy the generated key value immediately as this is the ONLY TIME the key will be visible:
NOTE: I recommend pasting this into a text file in case something goes wrong in the rest of the setup and you need to access it again. If you lose it, don’t fret, you can always generate another Key.
This value is the “Service principal key” required for the Linked Service:
At this point, you may think you have finished the Linked Service, but if you press the “Test Connection” button, the following error will let you know we aren’t quite there yet:
Return to the registered app, select “Settings” and open the “Required permissions” blade:
Click the “Add” button to open the “Add API access” blade:
Click “Select an API” to open the “Select an API” blade. Scroll down to locate and select “Azure Data Lake”, then press “Select”:
This will open the “Enable Access” blade. Check the “DELEGATED PERMISSIONS” box and press “Select”:
This will return you to the “Add API access” blade, press “Done”. Azure will spin for a moment as it adds the necessary permissions. Once it is complete, you should see Azure Data Lake in the Required permissions blade:
We’re done with the registered app now, so you can close those blades.
We’re almost there. So far we’ve created the registered app (Service Principal), generated a Service Principal Key, and given it Data Lake API access. Our final step is to grant the Registered App (Service Principal) permissions to the specific Data Lake. To do so, we need to grant the user developer permissions to the Data Lake resource. Navigate to the Azure Portal page for the Data Lake Analytics account and select “Access Control (IAM)”. Press the “Add” button:
This will open the “Add permissions” blade. For “Role” select “Data Lake Analytics Developer”. For “Select”, enter the name of the registered app you created above and click them to select. Then press “Save”:
Now we need to add the user to the Data Lake instance itself. If this is confusing, just remember that IAM is an Azure resource level (external) permission, so now we are creating a Data Lake level (internal) permission.
Go to the Azure Portal page for the Data Lake Analytics account and select “Add user wizard” on the left navigation bar to open the “Add user wizard: blade”. This should also automatically open the “Select user” blade. Under “Select”, search for and select the registered app name created above. Press the “Select” button:
This will advance you to the “Select a role” step. Select the “Data Lake Analytics Developer” role:
This will automatically advance you to the “Select catalog permissions” step. Change the Permissions for the desired Database from “None” to “Read and write” and press the “Select” button:
NOTE: Do NOT change the Catalog or master Database permissions!
This will automatically advance you to the “Select file permissions” step. Change all Account permissions from “This folder only” to “This folder and all children” and press the “Select” button:
NOTE: All the Read, Write, and Execute boxes should already be checked. If not, check them all.
This will advance you to the “Assign selected permissions” step. Press the “Run” button to assign the permissions:
The job may run for a while, depending on how many objects have to be updated: do NOT close the blade until the task completes. When it does, press the “Done” button:
You are (finally!) finished configuring the Service Principal, so you can close all the portal blades and return to the Data Factory.
If we return now to the Linked Service, we should be able to successfully test the connection:
If it doesn’t work, then you probably had the Linked Service option open too long, so you’ll need to close it and recreate the Linked Service. That happened to me as I was writing this tutorial – thankfully I saved my Service principal ID and Key in a text file while I was working!
There’s no doubt that this process seems long and convoluted. The good news is that once you’ve done it a couple times, it becomes tolerable. And at least you can refer back to this handy guide! I hope you find it useful.HOW TO CREATE AN AZURE DATA LAKE LINKED SERVICE IN AZURE DATA FACTORY V2