Subtitles section Play video
Hello everybody.
My name is Faisal Alshami.
I started my data career at TIER as a data analyst, and I slowly progressed into an analytics engineer.
I work heavily with dbt, which I really love.
Also focus a lot on operational analytics, a lot of stuff around fleet availability and operations outside on the streets and a little fun fact about me is I always say good morning, no matter what time it is.
And there's a reference there.
So I want to speak about the table of contents of the things that I'll be discussing today.
So first I will discuss about what is an incremental model?
What does that mean?
When do we use incremental models when not to use incremental models and basically some additional tips and tricks, maybe some small definitions around that, and then at the end, how we use incremental models at TIER.
Let's start with the first part.
What are incremental models?
So incremental models are built as tables in your data warehouse.
So as opposed to being materialized as a view, and then what basically happens when you load or when dbt runs the model for the first time, the next time you tell dbt what to filter for.
So what roles should it transform instead of going and transforming the whole data set?
And basically this is really nice because this reduces the build time and maybe the computational complexity or heaviness when you're transforming the new records.
Of course, when you build incremental models, this requires a bit of extra configuration on your side and you should be a bit careful and incremental models work really good with event style data.
So if you look at the right on the table, you have the source data on the left side and you add an incremental filter, which I will show you how to do later that defines, for example, show me all the data that comes in these two dates, and then dbt will go on to transform only these roles and then add them to the destination table.
When do we use incremental models?
If I want to make it really simple, it's basically to save on time and to save on money and resources.
So basically this stems to all the other reasons.
So if your source data is huge or your models are taking too long to run, or maybe you're having really heavy computations that are happening there, then incremental models might be the way.
Your historical data does not change a lot.
As I mentioned before, event style data is really good for that.
And basically if your models want to be updated frequently, so if you have a lot of models that are getting updated every hour or two, you have them on dbt to be run every hour or two, then maybe using incremental models is the way to go.
And then when not to use incremental models, technically it's the opposite of what I've mentioned.
So if your source data is for that extra configuration, if your data is consistently changing and you need to update it, maybe incremental models is not the way.
If your models rely heavily on window functions, for example, then incremental models won't work.
It's not necessary not to apply incremental models, but it's just a bit more complex and maybe it's better avoided because you have to integrate some more complex logic or some more steps into the way I'll start off with how to incrementalize a model.
Just to recap on the incremental filter, there is incremental filter.
Basically what it does, it's a macro that assesses a true or false statement.
When is it true?
So once you write the normal Kaban dbt run model, you have the incremental filter.
As you can see on the picture on the right from line 19 and 22, what it tries to check for true is the following.
So number one is the configuration of the model.
So this is basically line two.
If you look at materialized, it should be a materialized as incremental, right?
The second thing is that you have a destination table.
So this is what the macro assesses.
And then when you try to configure it, you can also define a unique key.
Unique key looks at the newly transformed records.
So the ones under the incremental filter, and then it looks back at the table that you already have and it sees if they're matching and if they're matching, then it will update the existing data on the table.
So it's very simple to apply incremental models.
So if you look here on the right, this is a very simple select statement from a source, from a table.
We've, we're done with our configuration at the end on the where statement.
I just did a incremental clause and this is the syntax of how you do it.
If the timestamp is greater than or equal to select the maximum timestamp from this model.
This here refers to the already existing destination table or this model, not the source data, so this is basically it.
I'll go now over some tips around that.
And maybe to clarify some stuff here.
This again, as I mentioned, this refers to the model you're actually working on or the destination table.
If you're using a lot of window functions, you might maybe consider that you have one step that basically loads the data incrementally.
And then you have another model that run these window functions.
The unique, this is important one.
The unique key does not ensure uniqueness in certain cases.
You don't need to add the unique key because it can be a bit more computationally expensive.
This is the first set of tips.
You have to think sometimes about if you have a lot of CTEs, where should you put the incremental filter?
Because it can impact the performance of your model.
Basically this runs the incremental model.
So no need to do anything else.
If you, for example, added the new column, or you want to fully refresh the whole table, like we run the whole table, we run the computations and so on.
You can run a command, which is the one you see at the bottom of the presentation, dbt run full refresh, select my incremental model.
And this is it for the tips.
So now I want to speak about the usages at tier.
So here at tier, we have around more than 400 dbt models out of which around 22% are incremental.
And on the right, this is just a graph on Looker.
We use the dbt API to get some stats and some data around dbt.
If you can see that we have different runs that run sometimes frequently, sometimes an hourly, sometimes more than an hour, sometimes at night.
Also, we have a lot of frequent models that run that way.
And a lot of them are incremental to save on time and resources, as I mentioned.
And we also run full refreshes on certain models, certain incremental models during the weekends.
I want to just mention something.
These are some of my models that I work with.
So I have a model that really has more than a billion rows.
Not computationally expensive, just a lot of case ones and unions.
And if I run it normally, or if I run it as a full refresh, it takes around 55 minutes to run.
But now the incremental model, I think I have a period of 24 hours to refresh data.
It runs around 42 seconds.
And I have another model that's the same, a bit more computationally expensive or complex, let's say.
It's also like running 1.2 hours.
It's taking one minute to run incrementally.
And then I have another model that's a bit smaller than that, but it has a lot of heavy computations.
It's running around 25 minutes normally, but now with incremental models, it runs around three minutes.
So you can see here, this is really good.
This is saving a lot of time.
I want to discuss a few approaches we use it here.
First, it's a safe approach, which is the unique key.
So here we define a unique key.
We have the incremental statement and we basically tell the model, Hey, select the late everything that's after the latest timestamp and ensure the unique and ensure that any unique key that's already existed in the destination table can be basically updated.
This is the first safe, straightforward approach.
The second approach is the pre-hook configuration.
So what are pre-hooks?
As you can see here on the picture on the left, there's a pre-hook at line five, basically a SQL statement that runs before the model runs.
We applied an incremental filter there.
So when it runs, we have this running and this is what it basically does.
It, it starts the model, it deletes 24 hours of data, and then it incremental logic pulls another 24 hours of data to process it.
This is not a really good way to do things, but sometimes you have to do it because of certain limitations you have or the way the data is.
The first run of this will always fail.
This is just to note.
So you have to always make sure that you run it like on an ad hoc job or something and then run the models normally.
And then we have one of the latest last methods is the incremental usage with unions.
This is good if you're using window functions.
So basically what we're doing here is the first CTE where we're just taking an incremental model and we're just applying some incremental logic.
Let's say based on time, take everything after the latest timestamp and the destination table, and that's it.
And then the second one that's, I will take everything from the above CTE, but if I'm running it incremental union it with all the data from existing destination table, instead of cutting it in two models, you run everything at once.
And then after that, you can continue with window functions and so on.
Last method I want to talk about is like the elephant by the chunks.
Instead of reprocessing the whole data, you process only a set amount of data.
And what you have is the incremental logic.
What you're trying to do here is just introduce a statement that when you define in the run, which you can see at the bottom, it will just apply there in the where statement if it's incremental, so this is just an extra configuration that can let you load your data in chunks instead of having it run throughout all history, because maybe that's not necessary.
And I think that's it from my side.
Any questions?
What are the best practices to fill historical data on incremental models?
For example, when you add new columns to a model over time, especially on huge models where full refresh will take too much, there's different ways you can So if I understood correctly, like what you can do is, as I mentioned, like the last method in the slide.
So this one, what you can do is you can refresh it in chunks, start to column first and empty column.
And then you can run the chunk, like the method, the elephant by the chunk that you saw.
So I think this is one way you can approach it.
Another way is maybe you can have the historical table built in one way, like in a separate model, and then all the newer data in a different model.
Depends on the usage and how people are using it.
Maybe that's not the best way to, but yeah, I think the elephant method, I think that's the one that I would recommend, I guess.
And we use that a lot during migration.
When we migrated from Redshift to Snowflake to build the tables, a whole table that we knew was going to be huge and maybe fail at the very end.
There's a new thing, merge by column.
I think it's a new feature where you can just update certain columns and you can figure it out.
I think it's in the dbt documentation.
There's another one that you can like enable to add new columns.
So select is dynamic and then it would generate a new column.
Then you can just enable that in the config and it will create it in the incremental model for you.
Why do you regularly do full refresh runs on your incremental models?
Okay.
So we have different kinds of incremental models.
I think the ones with only pre hooks, for example, these are good candidates to get to refresh, but I think it's also to get like a full refresh regularly.
But I just want to point out that we do not fully refresh everything.
We fully refresh certain models, not all models, unless it's very, very necessary.
I think this is it.
We don't have to fully refresh everything, but I think the ones that use the pre-hook method prone to data leakage some way, at least in my experience, and it's good to refresh them every once in a while, but also there's a lot of factors, like how is the source data looking at?
Is it stable?
And so on.
How do you use a materialized view versus an incremental model, especially in the snowflake?
We usually use materialized on the first raw data and then almost everything else we materialize as table.
View is because the source is just like the query running.
It's like a snapshot.
The tables might take a bit more to run, but they're sometimes bigger.
They're computationally less expensive and the query faster.
So they take more to build, but the query of course, faster.
So let's say in Looker and so on.
Usually we have the first steps, the very raw tables as views, and then everything built on top of that usually.
So what I was referring to is not a standard view, but a materialized view as an object in Snowflake.
You can create a materialized view that is actually physically storing the data.
So performance is also good.
And Snowflake is taking care of the easy transformations underneath.
So you can also not use everything, but you can do easy mapping by timestamp and all of that is possible.
And also, if I may add, if you have it in dbt, it's a little bit more transparent also for the rest of your data team, right?
Like it would depend on how big I guess your data team is, but normally not everyone will have, let's say admin rights to be modifying and then having those that view or that incremental model in dbt would give transparency, it would add the version control, it would ensure actually, because you can still have version control, but it would ensure that you need to use version control for deployment.
Thank you.