Joel Cochran
Data Engineering, SQLDW, Microsoft Fabric
September 9, 2024
As I celebrate my 7th anniversary at Causeway Solutions, I'm reflecting on the data engineering and technology changes I’ve experienced during my tenure.
Causeway Solutions was a customer of mine when I worked at Microsoft, and I was able to see first-hand the challenges they had with very large datasets. At the time, they were using large SQL Server VMs. I introduced them to Polybase, which cut their ingestion processing time in half. This was one of my first real experiences in large-scale data-engineering and it was very rewarding. Soon after, I joined Causeway Solutions full-time.
Causeway Solutions had just begun adopting Azure Data Lake Analytics (ADLA), which was a Microsoft proprietary Massive Parallel Processing (MPP) offering in Azure. The code language in that system was U-SQL, a mashup of C# and SQL. I remember really liking that language, but, apparently, I'm in the minority on that one. In fact, I once asked the Synapse team to bring it back, but the request was politely declined.
ADLA had its share of problems, but it was the first time we were able to really process large amounts of text files in Azure Blob Storage. Among ADLA's challenges, the complete lack of an interactive mode was the most problematic for us. Plus, there was the time I personally crashed the ADLA service by implementing a Regular Expression that was too large for the deployed version of .NET, but I digress.
Azure SQL Data Warehouse (SQLDW) Gen1 was too limited, but when Gen 2 was released, it finally had the features we needed. We spent a fair amount of time and energy migrating our workloads to SQLDW, but it paid off in the long run. This was the first time we were able to run interactive queries over our data and get results in a reasonable timeframe. We learned a lot about performance tuning, partitioning, and distribution schemes in the process.
A couple years into this phase, Azure Synapse arrived, and we became early (pre-GA) adopters. It was the natural successor to SQLDW, which was rebranded "Dedicated Pools" in Synapse but was really SQLDW Gen3.
As we learned more about Serverless SQL and Spark Notebooks, we quickly gravitated to a purely Storage based infrastructure. We heavily leverage Pipelines, Data Flows, and Notebooks to process frequent but irregular updates from numerous sources. Synapse has worked extremely well for us and serves today as our primary Data Engineering platform.
That does not mean we are set by any stretch! We have worked for the past year with Microsoft Fabric primarily in support of our Power BI and Reporting needs. While the promise of Fabric is not yet fully realized, it is exciting to see what may be possible in the near future. As we push further into the Fabric realm, I fully expect more of our processes to migrate to "Fabric first" solutions.
That's five major systems in seven years. It has been a great time full of changes and challenges. I'm excited to see what comes next!
Causeway Solutions delivers the analytics and strategic data insights our clients need for successful marketing plans, business decisions, and political campaigns.
Contact us to learn more!
To learn more, visit Causeway Solutions to get started!