The analytics world seems like it’s at a turning point. In the last two years, new technologies like Snowflake and BigQuery have grown to prominence, finally delivering on the promise of a “single source of truth”, and forming large and excited communities around them.
Meanwhile, thanks to the likes of dbt & Fivetran, workflow and process improvements have allowed data engineers and analysts to spend less time fixing pipelines and more energy driving real outcomes at their organisations.
As we enter 2021, it feels as if organisations who have successfully adopted the modern data stack are finally in a position where they can truly harness the value of data being produced.
Here is part one of our view on the ways we'll get even better at analytics this year.
In-database processing continues to grow
ELT leaves ETL for dead
The modern data revolution began with the rise of cloud data warehouses, and the removal of constraints on both storage and scalability. Instead of the traditional pattern of E-T-L, where you extract, transform and then load a minimal amount of data into your warehouse, modern data practitioners use E-L-T, replicating whole copies of source systems into their warehouses and running transformations on top.
There is a perfectly logical stack of reasons why, most concisely described here , but the biggest benefit is agility. With ETL, changing your analysis meant starting the entire process from the beginning. Now, transformation for analysis can almost happen on demand. If you’re moving to a modern data platform and not ELT-ing yet, we reckon you should.
ETL is dead going into your warehouse and it’s equally irrelevant at the other end of the pipeline. However, modern data teams are still forced to hark back to old methods when data leaves the warehouse destined for operational systems. This remains a challenge and hampers all of those hard won gains. Omnata is trying to solve this last piece.
Pre-built dbt packages for SaaS apps proliferate
In-database transformations led to the emergence of arguably the biggest and most organic open source communities in recent years: dbt , a project launched by Fishtown Analytics. The dbt Slack community just hit 10k members and their first online conference, Coalesce , garnered thousands of modern data practitioners around the world. It doesn’t appear to be slowing down.
Within this community, users openly share dbt packages and tips on how to best normalise and model data for analysis. Fivetran recently released a set of free dbt packages to transform the data after load, which makes total sense if you’re a best-in-class replication technology. We think other vendors will follow suit, releasing dbt packages for their own apps or exposures to support their related platforms.
The easier and faster these repetitive tasks become, the more time data-engineers can spend on higher-value intelligence, like machine learning.
ML preprocessing moves in-database, too
We see in-database ML feature engineering as a natural extension to these trends. As Tristan of Fishtown Analytics points out , each step of building an ML model often occurs in different systems, creating slowness and adding cost.
In our view, if the architecture is right and the use case is common, then you shouldn’t need to transfer data out of your warehouse for preprocessing. Why move data out of your single source of truth if you don’t have to?
Omnata co-founder James Weakley explored this territory in his post, Feature Engineering in Snowflake and also built a dbt package to enable ML preprocessing directly in Snowflake, Redshift or BigQuery. In the spirit of contributing to the analytics community, we’ve open sourced it under Omnata Labs .
If you’re a data scientist or analytics engineer, give it a go! We’d love to get your take on our approach and hear what you’ve used it for.
If the technology is one leg forward, the process improvements are the rest of the dog. Making ML faster and less expensive to develop will give analytics teams more intel to operationalise and deploy. This last-mile still has some creases to iron out.
To be continued...