As with all database technologies, performance comes down partially to how much horsepower you can throw at a problem. At some point we’ll have to deal with distribution, skew and partitioning of the inbound files but the starting point for optimization is looking at the parallel processing capabilities – measured in ADLAUs (or parallelism, depending on either whom you ask or on which day of the week you ask, apparently) – of Data Lake Analytics.
The parallelism (I’m just going to stick with the term ADLAU although I can’t tell if that has already been deprecated) is simply a measurement of how many concurrent vertices you allow for the query to consume… and, as such, pricing is a combination of TIME and ADLAUs used. The challenge – at least one we haven’t solved for yet – is that you cannot consume more ADLAUs than the maximum number of vertices at a given stage of the execution plan… which can’t be determined until execution. So for now, set it and forget it.
But we did do some testing on performance and costs associated with the additional horsepower, and here’s what we found:
The data being analyzed is a basic voter registration database of 190MM registered voters in the United States distributed into 51 csv files of varying size (essentially one per state plus DC). This results in a dataset of 198.41 GB in total distributed across files ranging from 17.6GB (California) to about 300MB (North Dakota). Hardly a big data set and hardly all the data that exists for these states – but hey, let’s start somewhere.
Once we established that single file baseline we joined in a single set of predictions for each voter: 51 csv files (one per state) consisting of the same 190MM registered voters with a total data size of 139.84GB.
Each of these queries have their base data stored in an Azure Blob w/ Standard Performance Read-Access Geo-Redundant Storage. We didn’t use Azure Data Lake Storage because as of the time of writing this very little interacts with this type of storage and we hope to utilize Polybase and SQL 2016 to tie all this together (spoiler alert for later!).
We started with the following queries:
Give me a count of voters by political party for each state off of a national EXTRACT using wildcard to iterate blob CSVs.
Give me a count of voters by political party for one state (CA) off of a single csv EXTRACT in blob.
Query the national EXTRACT to get a count of political parties in CA Congressional District 24 using wildcard to iterate blob CSVs.
Query the single state EXTRACT to get a count of political parties in CA Congressional District 24.
We then ran essentially the same queries with the modeling data joined:
Give me a count of likely gun owners by political party for each state off of a national EXTRACT using wildcard to iterate blob CSVs.
Give me a count oflikely gun owners by political party for one state (CA) off of a single csv EXTRACT in blob.
Query the national EXTRACT to get a count of likely gun owners by political parties in CA Congressional District 24 using wildcard to iterate blob CSVs.
Query the single state EXTRACT to get a count of likely gun owners by political parties in CA Congressional District 24.
Time to go to work…
[Go grab a glass of wine or something… we did]
So, we really did work. And… ADLAUs do matter… to an extent.
First and foremost:
Don’t use 1 ADLAU. Just don’t. After staring at Visual Studio almost catch fire we abandoned that. So, we quickly just started our benchmarking at 5.
Second, here’s our work:
Extract From CSV w/ No Join Statement
Extract From CSV w/ Join Statement
So, there’s definitely a law of diminishing returns when it comes to the cost and time trade-off. In the first set of queries, going from 5 to 10 ADLAUs decreases the time it takes to execute the query by at least half in each instance without adding to the cost of the query. And, we see that same benefit when we jump from 10 to 20 ADLAUs. However, jumping from 20 to 50 doesn’t show tremendous time savings (except for the third query) and we begin to see significant spikes in query cost. (Note: we’re basing the cost here on the “off-the-shelf” 3 cents per analytical hour as displayed on their website… go ahead and negotiate a better rate and redo the math above yourself if you need.) In fact there was ZERO difference essentially between 20 and 50 ADLAUs in queries 2 and 4 which is largely in part due to vertex allocation. Queries 1 and 3 had over 50 vertices for the query so it utilized all 50 in the final runs, whereas queries 2 and 4 only had 19 vertices so was not able to utilize any more resources once we passed that ADLAU count. Presumably as we kept adding more we’d never see an improvement in either query as we had reached the maximum vertex limit in both.
Once we applied the join: HOLY THROTTLING LIMIT
Well, that was unexpected. We have two paths each evaluating 51 streams which aligns well with how the data is distributed in the blob storage. There are 167 vertices on the left and 268 on the right, so my hopes are high that additional ADLAUs will help… BUT EVERYTHING FUNNELS DOWN TO A SINGLE VERTEX ON THE FINAL NODE.
In fact, for queries 1 and 3 the SV156_Combine step takes 20+ minutes regardless of how many ADLAUs you throw at it. So additional ADLAU allocation can speed up the time TO GET TO this node, but cannot speed up the time TO GET THROUGH this node. Ouch.
That throttling really kills any cost benefit we receive as we add more ADLAU units.
But I guess there is a silver lining: Query 1 w/ the Join took about 42 minutes to run in the Data Lake. Using the same tables in the same structure with the same syntax in SQL Server 2016 took 63 minutes. We got that down to 26.4 minutes for the low-low price of 66 cents. The SQL Server runs for about $4.16 for an hour (which was the length of this query) and I still have to pay for it while I’m over here typing this post and it’s presumably just sitting there idle… and I can’t say the same about Data Lake.
I think we’re onto something here… time to get this data loaded into tables instead of extracts and see how that performs.