THAT ESCALATED… REGISTER USQL ASSEMBLIES

Bill Skelly

Data, DataWarehouse, USQL

February 14, 2017

Well, we thought the next “chapter” of this USQL / Data Lake journey would be table query performance… but we quickly escalated from “hey let’s load this table from a small csv” to “WHERE DID THAT ASSEMBLY I JUST REGISTERED GO AND HOW DO I REFERENCE IT?!”.

In layman’s terms, we went from having another cup of coffee to opening the box of wine.

Better yet – given the lack of discoverable documentation on what we just did, it’s time to write this down again.

CREATE TABLE

So here’s the easy part… creating the table.

USE DATABASE Superior;
DROP TABLE IF EXISTS CorePartyModels;
CREATE TABLE IF NOT EXISTS CorePartyModels
(
CS_RegID Guid,
State string,
DateKey int?,
Date DateTime,
Client string,
Description string,
RParty float?,
DParty float?,
…,
INDEX idx_State
CLUSTERED (State)
DISTRIBUTED BY ROUND ROBIN);

So… technically that works. A table was created. We did, however, discover that in subsequent runs it doesn’t actually DROP the table NOR create a new one. But that quickly became the least of our problems so let’s flag that to come back and fix later.

Keys things to remember:

  1. It “worked” in the sense of working the first time. As I just noted, we have to go back and figure out why it doesn’t drop.
  2. You need to set an index field. We chose state since that’s how this data is best partitioned. Again – we’ll go back and test that.
  3. The data must be distributed. Since we don’t have a skew-free way to hash the data inherently, we just distributed based on round robin. Plenty of articles to read on this so no need to elaborate here.

THE CHALLENGE WE RAN INTO…

What would have been easy to do: Create a table that matches our CSV and just ram data into it.

What we chose to do: Create the table how we envision it looking down the road and nuance the data into it.

The table above contains a bunch of additional model fields in the ellipses section that it’s not worth bothering people with so lets just focus on the two model fields present: RParty and DParty. For the sake of argument, it’s possible for a person to have a scored value for RParty BUT NOT for DParty (or vice-versa).

To account for this, we wanted to evaluate the field that SHOULD contain this data in the CSV and, if present, convert the string to a float and insert. If not present, or if the string didn’t convert to a float we wanted to catch that error and return a null value.

HOW WE TACKLED…

We quickly realized that this would be one of our first use cases for the merged U-SQL and C# language that exists in Azure Data Lake. Some quick googling (bing?!) lead to very little information but it became clear we need to rely on the float.TryParse() method in c# to evaluate the data being passed in.

To invoke this method, we need to declare a float variable as output from the method that we can return in the case of a successful parse. We can’t do this inline with our U-SQL, so we need to create a user defined function. This is where we got lost… so, without dragging you along for our journey, here are the steps:

STEP 1: ADD A USQL CLASS LIBRARY TO OUR SOLUTION

Easy enough. Add -> New -> USQL Class Library

STEP 2: CREATE A PUBLIC CLASS AND A METHOD TO EXECUTE THE PARSE SCRIPT

public class USQLConversions
{
public static float? superior_TryFloatParse(string floatParse)
{
float outputfloat;
if (float.TryParse(floatParse, out outputfloat))
return outputfloat;
else
return float.NaN;
}

STEP 3: REGISTER THE ASSEMBLY TO THE AZURE DATA LAKE ANALYTICS DATABASE

This took a while.

Right click on the project -> Register Assembly -> Select appropriate ADLA resource and database -> Register

Being a administrative task, this simply consumes 1 ADLAU and executes pretty quickly based on above.

STEP 4: CONSUME THE ASSEMBLY IN YOUR SCRIPT

The assembly didn’t really register the first time we tried this. I mean, it said it did, but it didn’t.

We refreshed the ADLA portal several times in our server explorer and eventually the registered assembly appeared. Once it did, we could reference the assembly in our script. To do so, rather than adjusting the C# script associated with the USQL query (which we assumed was the correct path to take) you actually have to explicitly define the reference at the top of the USQL query itself:

REFERENCE ASSEMBLY [AssemblyName];

Once we did that we were able to reference the assembly inside of our USQL query itself:

SELECT CS_ID
, StateAbbreviation
, ModelDateKey
, RowCreateDateTime
, “ClientName” AS Client
, “Client Test” AS Description
,USQLClassLibrary.USQLConversions.superior_TryFloatParse(RParty) AS RParty
,USQLClassLibrary.USQLConversions.superior_TryFloatParse(DParty) AS DParty
,...

That allowed us to get the data converted and into the table… and took enough trial and error that we needed to get it written down so we could follow this breadcrumb back to success when we’re ready to ACTUALLY tackle this table.