This talk is an almost-transcript of a recent talk I gave at the February 2021 edition of the DataEng meetup; available here.
About 5-6 years ago, I started coming across situations where I was building software that happened to be processing large datasets. I remember having to produce a time sensitive data extract over hundreds of millions of records. Which got me into the Hadoop/Spark space where you could use those building blocks rather than write something entirely from scratch.
I’ve been in that space ever since, focusing on cloud data warehousing and machine learning, as opposed to business systems. I’m a big fan of open source. As an early Snowflake customer in Australia, I authored a bunch of open source projects to manage Snowflake ( see Snowflake Labs Github ). I haven’t always been successful at operational analytics projects, but I’ve always learned from our mistakes. In fact, I’ve built a software company around my lived experience.
By the end of this article, regardless of your stack, hopefully you’ll have an appreciation for the general challenges when taking analytics projects into production and some ideas on how to overcome them. Or at the very least, some new things to consider.
Navigate to the sections you need:
What is operational analytics?
I think this is pretty much what most people now mean when they say analytics in business.
"We believe one of the biggest sources that we see clients struggling with is the last mile. That is the gap between great analytic output and actual changed behavior that creates value in the enterprise - whether it's a frontline worker, a manager, or even a machine."
Chris Brahm, Bain Consulting
This quote from Bain is particularly telling. Statistically speaking, according to the big consulting shops, you are likely to encounter challenges that you didn’t expect.
Before we step in, it would be remiss of me not to recognise that most projects fail or die not because of technical challenges, but because of organisational ones. I haven’t been part of any project where we decided “we just can’t fix this bug, let’s wind it up”, it has always been something more like “we’re not sure we want to resource this in light of other priorities”.
You must lay solid foundations for success with the business decision maker.
Is the opportunity well-understood?
Do the stakeholders agree that it’s a problem worth solving?
Does it deliver significant enough cost savings?
Once you know that, you can have an idea of what your budget could be, and it might be more than just what’s sitting in the technology budget. Too often, technical people evaluate product pricing based solely on what they feel is reasonable or already have access to, rather than considering the benefit of the opportunity. The same is true for resourcing. You can always extend your workforce temporarily with contractors, so knowing the overall “size of the prize” might broaden your options.
Develop and nurture strong relationships with decision makers. This ensures they stop perceiving you as a cost centre, and will listen to your suggestions. Of course, you should listen carefully to them, too. The more you understand the costs and revenue of the business, the more impactful your ideas will be.
You also need to get in the mental space of the operational staff of the business. Don’t just assume that once you change their system everything takes care of itself. Test your assumptions first.
Part of this is, can the problem be economically solved? Being able to predict something, doesn’t always mean you can do something about it.
Don’t forget the owners of operational systems either. As you plan out your data stack and how it fits in with everything else, don’t assume that they’re across it or that they agree with the technical vision. Keeping them looped in will ensure there are no surprises when the time comes to plug into their systems.
The path to now
First up, a short history lesson to set the scene.
This is a standard historical understanding of operational systems vs analytics. You can find hundreds of diagrams like this on the internet.
On one side, there was operational, where business processes were implemented, and on the other, analytics, where trends and patterns were discovered to inform leadership. Clearly simplified, both sides now are composed of various APIs (internal & external), serverless tech, and PaaS/SaaS. But you get the idea of the historical delineation.
The systems that back these functions evolved according to their selective pressures. Operational outages are costly, hence there were decades of huge investment into areas like version control, automated testing, etc. This grew to incorporate not only the applications themselves but the infrastructure they run on.
In contrast, in the analytics world, the big problems were inaccuracies in data, disparate data, and the ability to consume data. So the effort went into data warehouse modelling, data quality, and Business Intelligence tools.
So now that we want operational analytics, clearly, there is a need for the right side to catch up to the left in some areas. A bar needs to be met if these systems are to be trusted to occupy important operational roles.
You’re only as good as your weakest link. So where to start?
It’s a large space
Consider how big this ecosystem is, and how many products exist: Matt Turk’s diagram . There are clearly many ways to solve the challenges you’ll face, but we’ll just focus on some basic principles in this article.
The modern data stack
I submit to you, my biased view of the bare bones modern data stack.
Now, we can argue about products, (Stitch vs Fivetran, Looker vs PowerBI, etc), but don’t get too hung up on this or we’ll end up back at the diagram above. Similarly, we don’t have time to incorporate specialised tools for security, data governance, observability, etc, nor will I specifically address unstructured data analytics.
I want to talk through the components, and the architecture as a whole, and what to consider from an operational point of view.
From left to right:
- Data acquisition from source systems, assuming an ELT pattern which is now the norm, lands data quickly into a staging area of your warehouse
- Next we have transformation and modelling of the data in the warehouse, which could include merging multiple data sources or potentially external ones
- Then all the customers lining up to feed off this data-centric architecture.
For other apps, we no longer need enterprise middleware, we can continue the data pipeline out to other apps. For third party apps, there are general purpose tools which will replicate data out from Snowflake to their APIs.
But you should first consider whether or not you actually need to move the data out of your warehouse. If you’re serving high volume external traffic, sure there’s probably a good reason to and a lot of third party apps will require a copy of the data locally. But otherwise, say you’re a Snowflake user and can dedicate auto-scaling bandwidth, just use the client library for your language of choice.
Then there’s data science and advanced analytics where we’ll aim to make predictions based on the events of the past. The way we operationalise these predictions falls into two categories:
a) Real time, where we don’t have the luxury of waiting for the data pipeline. For example, dynamic information that we’ll show customers on our website.
b) Batch, where the predictions are a data asset that should be stored in the warehouse for reuse elsewhere, particularly by other apps. You want to avoid having long ML pipelines that serve just one purpose and you can also use micro-batches for near real time.
- Last but by no means least, we have the traditional BI use cases. They are still as important as ever for guiding decision makers, but we won’t focus here because there is limited downstream system use.
Building a reliable modern data stack
Let’s recap some of the foundational changes that have occurred in the software delivery space and how they can be applied to build robust operational analytics.
We want to deliver features safely:
- Version control
- Automated testing and deployment
- Resilient infrastructure
- Security built in
- Outsource (no “undifferentiated heavy lifting”)
- Scalability (public cloud)
We moved away from creating backups of code and embraced version control systems like git. This is foundational as it is not only an audit trail and method to allow collaboration, but it enables just about everything else we need.
Referring to the diagram above, from left to right again:
- Inbound data replication often isn’t subject to version control. For example, Fivetran has a point and click GUI to streamline setup. But at the same time, as a replication tool, per data source it doesn’t really change much once set up, so the risk is low.
- Snowpipe can be managed using git and Snowchange , one of my open-source projects. If you’re using the PaaS replication tools from your public cloud provider, these will probably have some way of expressing their configuration as code, which you should familarise yourself with. Data transformation is where the big change risks are. It’s where dbt really shines. If you’re wondering what all the fuss is about dbt, start with this benefit. It is driven off a code-based definition that resides in a version control system, usually git.
- On the consumption of the warehouse, I’ll reiterate, consider carefully the need to copy data out of it.
- Data science becomes tricky here. Notebooks are notoriously difficult to do good version control with and debate continues about whether they should be put into production. Netflix advocates for running notebook code in production , but respected architecture experts like Thoughtworks , strongly discourage this practise . W e’ll come back to this point soon. Managing change to Machine Learning APIs is definitely an emerging field. The versioning here is about both the endpoint deployment but also how the decision is made to deploy a new version. It’s often made at the end of a training experiment, which makes it unique. Sagemaker allows deployment of new instances through client libraries.
Automated testing and deployment
We moved away from manually testing our changes to using code to test our code. It’s important for agility as the system grows in complexity so your pace of change doesn’t slow down.
We moved away from manual deployments to deployment tools that are fully automated. This now encompasses not only the software itself, but the infrastructure it runs on. Now it’s common to deploy an entire application stack from a code branch.
Again, from left to right:
- On the data replication side, there is lower risk due to the low frequency of changes. As data lands, like any interface between systems, there is a contract (whether defined or not). So, you do have to figure out how to uphold it, particularly across teams.
- With dbt, you have data tests. Nowadays, “data downtime” is a thing since data quality issues in a data pipeline are as critical as logical errors in an app. They can cause the whole system to require shutting down if the impact is bad enough. So when using dbt, ensure that the testing occurs in a blue-green fashion so that errors are detected at the right time. Here’s a dbt community article on how to do this. Zero copy clones in Snowflake make this pretty easy to achieve without much fuss. This is also where the ELT pattern is so valuable. Being able to rebuild your transformations off a raw staging area if you discover an issue or need to roll back is critically important.
- With tools that load, or provide live access to the modelled data, you can approach this in a couple of different ways. In fact, this really applies to any consumers on the right hand side. There’s another contract at this boundary, and I’ve solved this in the past by making it really clear in dbt that a particular model is being consumed by an outside system. You could write a data test that reflects the expectations of the consuming system that protects against breaking the contract. I once went a step further in dbt and wrote a test that provides a performance assurance. In that situation, a widely used Tableau dashboard filtered on a particular Snowflake column. I created a dbt test which asserted that the clustering depth when filtering on that column was low. If the columns didn’t change but the layout of the data was going to cause slowdown, the deployment would fail. Aside from that, you need to understand how each downstream system tolerates schema changes and the extent that it can handle them gracefully.
- Going back to pick on notebooks and change management. I’ve experimented with a few options over the years. One big difference I would call out is that in the application world, because they implement business logic, they can exist separately to production in every important way. However, with analytics systems, because their behaviour is critically linked to production data, it’s more complex. You can’t test a change to a machine learning model’s hyperparameters without the data. This is because until you observe the impact to model accuracy from the real data, you don’t know whether to deploy it, or not. One thing I tried as a Databricks customer was adding parameters to notebooks which caused different database connection configurations to apply. You could have a notebook in “dev” mode which couldn’t write back to production and only the service account user that executed the notebook in production had permission to write back to the prod database. This is still not ideal though. These days, I advocate for notebooks for data discovery and exploration but execute in a proper CI pipeline for production.
High availability, through the public cloud, entered the reach of every business. This includes removing single points of failure, but also designing for failure and not allowing failures to propagate through the entire stack
Naturally, you need reliable data replication that can handle failures on either side. Just like with microservices in the application world, they must be designed for failure and graceful recovery. This is another key reason for the ELT pattern. Keep the replication layer nice and dumb, and don’t build your own replication tools unless you absolutely have to.
Data warehouses have come a really long way. All the cloud ones have great uptime and scalability. Snowflake is even multi-cloud, which not many other production systems can claim.
For all of these systems, it’s important to be able to detect and understand failure at the point where it occurs. This used to be called monitoring, now it has the fancy title of observability, so consider a centralised logging mechanism that provides this.
Security built in
Security is everyone’s concern. Unauthorized data access is a risk that must be carefully mitigated and built into the delivery mechanism.
Every cloud system has a shared security model, you should understand what falls to the vendor, and what falls to you.
However, with data analytics, there are some extra things to watch for:
- Analysts often have broad access to full datasets, along with the means of extracting en masse. Therefore it’s a risk that needs to be carefully managed. There’s no use building authentication and access controls in your data warehouse if people are routinely moving sensitive data in and out.
- I’ll balance this by saying, analysts need access to data to do their job. So simple blanket rules that work in the application space, like nobody has access to the prod database, are not practical. Instead, you have to put in the effort to instil a security culture with the right controls, but also remove the temptation for people to move data outside of where it’s protected.
phrase, stop doing undifferentiated heavy lifting. In other words, focus engineering effort on that which brings unique value to the business you are in and outsource the common tasks.
As it applies to analytics, outsourcing means leaning on vendors for the mundane parts of the stack.
Don’t underestimate the ongoing cost of bespoke integration. Further to this, when comparing vendors in the stack, make sure you do a trial and factor in how much effort they have put into the developer experience. A tool priced at $100/month that requires extra hours to configure or make changes, is probably more expensive in total cost than a $200/month equivalent that doesn’t. Time is your biggest cost. So, keep an eye on what you spend your time doing and whether it’s spent in the right areas.
As far as reliability of infrastructure, your analytics pipeline is going to include a lot of SaaS and/or PaaS, so it’s up to you to ensure that these vendors offer the uptime you need. It will differ for each business and each system. This also extends to renting infrastructure via the public cloud. For all the talk about elasticity of the cloud, nowhere is it more relevant than the peaks and troughs of data processing.
Consider future scale
You should take a forward-looking approach to your data stack.
Data sizes will continue to grow, so make sure that your choice of tools can accommodate. A lot of tools in the data replication space are billed on usage, which is great, and absolutely the right model. But make sure you budget accordingly so that you’re not tempted to try to architect around the pricing later.
Choose a data warehouse that can scale rapidly and supports high concurrency. This is really the key to delivering value outside of traditional data warehousing.
Make sure you understand the storage constraints of your cloud apps before pumping data into them. This is so that you don’t end up with storage quota issues or awkward data models.
Finally, for any real-time inference APIs, make sure you develop a plan for how they will be retrained to counter model drift. Consider MLFlow or similar to track model changes over time. Virtually all examples end at the deployment of the first version.