Introducing Dagster+ - VoiceTube: Learn English through videos!

Subtitles section Play video

Hi, I'm Pete, the CEO of Dagster Labs.
Today we are launching Dagster Plus, the next evolution of Dagster Cloud.
Dagster Plus is a major rethinking of what a data orchestrator is, where it sits in the stack, and how data platforms are constructed.
Since our launch in August of 2022, we've seen enormous growth in both the Dagster Core open source project and our commercial product.
We now have hundreds of commercial customers and the fastest growing open source community in the category.
With this growth comes broader and deeper relationships with our diverse set of customers.
As we engaged with these customers throughout 2022 and 2023, we heard a few key challenges.
They were spending too much money on infrastructure, especially on the data warehouse, ETL tools, and AI workloads, and struggled to understand, optimize, and control their spend.
Their stacks were sprawling and complex with hundreds of different integrations and point solutions and varying levels of maintenance, and no single pane of glass for understanding the data platform as a whole.
Inspired by the data mesh architecture, they wanted to empower individual teams to work autonomously end-to-end, but struggled to balance this with the centralized needs of the data organization.
We realized that we could solve many of these problems for customers by making a significant investment in tooling that sits on top of the core data orchestration engine.
So we spent the last 12 months spinning up a team dedicated to these problems, building furiously, and iterating with design partners.
I'm proud of what the team has accomplished, and I'm excited to present the fruits of our labor to you today.
So let's dig into what Dagster Plus is specifically and why Dagster Labs is uniquely positioned to deliver this type of product.
First and foremost, Dagster Plus is the next evolution of Dagster Cloud, our enterprise-class hosted product.
This means it includes everything that was part of Dagster Cloud, including enterprise access control, seamless scale up to tens of thousands of assets and hundreds of thousands of partitions, and serverless or hybrid on-prem managed infrastructure.
This also means that we're retiring the Dagster Cloud brand today.
Second, Dagster Plus is a suite of tools that sits on top of Dagster's data orchestration engine and benefits from this deep integration.
Today we'll be focused on four primary areas of the product.
We've built a brand new data catalog and data discovery capability in Dagster Plus, which enables all users of the data platform to discover and leverage data autonomously.
Dagster Plus builds on Dagster Core's asset checks feature, introducing additional capabilities around anomaly detection, freshness, and alerting.
We've made massive improvements to our branch deployments feature from Dagster Cloud and are launching a killer new capability, change tracking, with Dagster Plus.
And finally, Dagster Insights enables data teams to manage their spend on tools like Snowflake, Fivetran, and OpenAI without doing any additional integration work.
With Dagster Plus, Insights is now generally available and comes with many new features and integrations.
The upcoming presentations will dive into each of these in detail, but before we do, I wanted to touch on why we, rather than our competitors, are uniquely positioned to address these challenges.
Dagster has historically been considered a data orchestrator.
All of Dagster's main alternatives, such as Apache Airflow, proudly declare themselves as workflow-oriented tools.
They are primarily focused on the scheduling of black box tasks.
While this makes these tools quite flexible and suitable for a variety of non-data applications, they are not specialized to the data domain.
In fact, we believe that there is an impedance mismatch between the workflow-oriented tools like Airflow, which deals primarily with tasks, and the rest of the data platform, which deals primarily with data assets, like database tables, machine learning models, and files.
Dagster takes a different approach.
Dagster is the only widely adopted asset-oriented data orchestrator, built from the ground up with data assets and data awareness at its core.
This unique architecture enables Dagster Plus to deliver a category-redefining, integrated experience, bringing together unique capabilities into one tool in order to speed up your development work, reduce your costs, and simplify your data platform.
I'll get off my soapbox now.
Let's hear from Jared, who will tell us about the data catalog.
Thanks, Pete.
I'm Jared, the head of product at Dagster Labs.
Today we're announcing a brand new data catalog as part of Dagster Plus, but before we get into that, let's talk about the unfulfilled promise of data catalogs.
As your organization scales, the number of assets you have to manage also scales.
It doesn't take long before you end up with a large number of rarely used data models with poor documentation and questionable provenance.
The central premise of data catalogs is to resolve all this by helping teams avoid rework and duplication, and help data practitioners work more independently by enabling them to discover and use trusted data assets.
Unfortunately, because standalone data catalogs don't have a native understanding of data operations, they have to create observations about the state of data assets from the exhaust of the loosely integrated tools that comprise the data platform.
As a result, data engineering team ends up constantly troubleshooting problems of accuracy and timeliness in the syncing the state of their data assets to the data catalog.
All the while, they are trying to coax the rest of their stakeholders to follow a new set of development practices to ensure that it remains in sync.
Typically, the end result is a lack of adoption from the very downstream stakeholders that the data team was trying to help.
Unlike traditional workflow orchestrators, Dagster is asset oriented.
It places data assets at the core of the development framework.
The result is a powerful combination of context about data operations and the outputs of those operations.
This view of data assets and operations makes Dagster a natural tool for data discovery and documentation, creating a data catalog that serves the needs of all practitioners.
With the launch of Dagster+, teams who want to spend less time managing integrations for their tools and more time getting things done will find it particularly useful as a new system of record.
And while we're still early in our journey, we are already seeing that, for many teams, this eliminates the need for traditional point solution data catalogs.
The first major set of improvements focuses on the kinds of data that Dagster collects and displays at the asset definition level.
On the asset details page, you'll now be able to find a new variety of information about your data asset that helps you understand the status, technical state, definition, ownership, location, and more all at a glance.
You can see things like the latest status of your asset, including the data quality testing configured on the asset, its description, who owns the asset, what groups and tags are associated with it, as well as where the asset is located, its compute details, and the associated resources and compute kind.
And for structured assets, you'll now find a new rich set of details, including any raw SQL that defines the asset structure, metadata about column names, descriptions, and data types.
And an exciting new change that we're launching for enterprise customers is the ability to explore the lineage of individual columns on the page.
When automatically ingested from DBT or derived from user-provided metadata, Dagster can compute the relationship of individual columns and represent them on the page similar to our asset lineage graph.
Column lineage provides a clear understanding of how data moves and transforms within a system, allowing you to track things like the proliferation of PII or tracing a mistake through its various downstream tables.
This new feature will help you manage data quality, compliance, and increase the speed of your troubleshooting.
But our feature release doesn't stop with the asset data model and UI.
We're also releasing a new, more powerful search experience to help you quickly locate the most relevant data.
From this interface, you can search for data assets by metadata, including their name, their compute kind, the asset group, the asset owner, or by any associated tag that you add to your assets, and see information about all of your data assets by their associated metadata.
I want to call out a feature here that might get passed over, which is definition-level tags.
With definition-level tagging, you can more easily add cross-sectional data to organize and group your data assets.
So now you can tag all of your intermediary assets, or your assets with PII, or the assets that support your finance team, and make these assets easily discoverable by their relevant stakeholders.
Dagster's data catalog experience is powerful because of how it combines context about data pipelines and the data assets they produce.
But we know that not everyone needs to look under the hood inside Dagster.
Sometimes your data practitioners just need to know where to find the right asset.
And that is why we're also launching a new capability we're calling Catalog Mode, which allows you to hide most of the operational views of Dagster, and instead focus on the assets that Dagster has documented.
This new feature can be turned on by default for your viewers in your organization, or turned off when they need to go deeper.
So that's our high-level introduction to the brand new data catalog and search features in Dagster+.
We expect this addition will make it easier for all teams on the data platform to find information about their data assets and status, and to operate more autonomously, while reducing the operational burden on the platform team.
You might find it helps eliminate the need for standalone tools which will reduce context switching, help break down data silos, and reduce the total cost of ownership of your data platform.
We know that data cataloging is an important capability for teams, and we're excited to be rolling out this first version, but we're not stopping here.
Sooner Catalog will also consume data about assets that are not managed by Dagster, and represent them in the data catalog as external assets.
I hope this first overview has you interested in the new data cataloging capabilities in Dagster+, but check out the companion blog to find out more.
I'm Sandy, the lead engineer on the Dagster project.
Today I'm going to talk about the new data reliability features that we're launching as part of Dagster+, and how they help you deliver trustworthy data to stakeholders across your entire data platform.
The job of a data orchestrator is to help you produce and maintain the data assets that your organization depends on.
Whether those assets are tables, machine learning models, or reports, in order to be useful they need to be trustworthy and reliable.
That is, they need to consistently contain up-to-date and high-quality data.
But bugs are inevitable, and pipeline developers don't control the data that enters their pipelines.
So the only realistic way to achieve trustworthy data is with monitoring and issue detection, so that issues can be addressed before they affect downstream consumers.
Most data teams struggle with data reliability because they don't have robust monitoring.
Workflow-oriented orchestrators like Airflow will report if your data pipeline hits an error during execution, but don't offer visibility into the quality, completeness, or freshness of the data that they're updating.
Teams that adopt stand-alone data reliability tools often end up abandoning them, because they're difficult to fit into practices for operating data.
Data engineers end up needing to visit both their orchestrator and their reliability tool to understand the health of their data pipeline.
And if they want to trace data reliability alerts to the DAGs that generated the data, they need to try to integrate their data reliability tool with their orchestrator.
This usually means contending with query-tagging schemes that are fragile to communicate with or configure and maintain.
At Dagster Labs, we believe orchestration and data reliability go hand-in-hand.
Dagster's asset-oriented approach to data orchestration enables Dagster Plus to offer a full set of data reliability features.
These help you monitor the freshness of your data, the quality of your data, and changes to the schema of your data.
Dagster helps monitor both the source data that feeds your pipelines and the data produced by your pipelines.
Monitoring source data helps you find out early if your pipeline is going to run on stale, bad, or unexpected data.
And monitoring output data helps you ensure that the final product is on time and high-quality.
At the center of Dagster's data reliability stack is a feature called asset checks.
An asset check is responsible for determining whether a data asset meets some property, such as whether it contains the expected set of columns, contains no duplicate records, or is sufficiently fresh.
Asset checks can be executed in line with data pipelines or scheduled to execute independently.
Optionally, Dagster can halt your data pipeline when an asset check fails to avoid propagating bad data.
Asset checks were introduced as an experimental feature last year, and we've marked them as generally available in Dagster's recent 1.7 release.
Asset checks sit on top of Dagster's rich metadata system, which can be used to store any metadata, from row counts to timestamps to table column schema.
With Dagster+, asset checks can be used as a basis for alerting and reporting.
When an asset check fails, Dagster can notify the asset's owner.
And Dagster Plus Insights allows understanding the results of asset checks in aggregate.
For example, did we violate our asset freshness guarantees more times this week than last week?
Because Dagster makes it easy to see the checks that are defined for any asset, asset checks are also helpful for describing and enforcing data contracts.
For organizations following data mesh approaches, teams can use asset checks to communicate invariants about the data products that they expose to the rest of the organization.
Asset checks are already widely used in the Dagster community for enforcing data quality, often by wrapping DBT tests.
In our latest release, we've expanded the utility of asset checks beyond data quality to also cover data freshness and data schema changes.
So let's talk about those.
Data freshness means tracking when data assets are updated and identifying when one of them is overdue for an update.
Dagster's asset orientation and metadata system make it straightforward to track when data assets are updated.
And once you're tracking data updates, you can set up freshness checks to identify when your data is overdue.
Freshness checks can either be based on rules or based on anomaly detection.
For rules-based freshness checks, you set limits on how out of date it's acceptable for your asset to be, for example, requiring an update every six hours.
For anomaly detection freshness checks, Dagster Plus looks at the history of updates to your asset to determine whether recent behavior is consistent.
This is especially useful when you have many assets to monitor and don't want to figure out the specific rule that applies to each one.
As with all asset checks, you can get alerts when freshness checks fail, so you can be notified about freshness issues before they affect your stakeholders.
In addition to catching data freshness issues, Dagster can help catch changes in data schema.
If a column is removed from a table, or if its type changes, then any software or dashboard that depends on that table is likely to be broken.
Sometimes changes to data schema are intentional, and other times a change to a table schema is an accidental result of a seemingly innocuous change to the SQL query that generates it.
Either way, it's important for pipeline developers and their stakeholders to be able to learn about these changes.
Again, Dagster's reliability stack helps out with this.
Dagster's metadata system can store table column schemas, and with Dagster's DBT integration, these are actually captured automatically.
Then Dagster offers built-in asset checks for catching changes to column schema.
And like all asset checks, they can be used as a basis for alerting, reporting, and control flow.
So stepping back, all of these different capabilities allow Dagster to function as a single pane of glass for the health of your data assets and pipelines.
Dagster's upcoming Asset Health dashboard manifests this by integrating pipeline failures, data quality, and freshness into a single view.
By handling data reliability in the same system that you use to develop and operate data pipelines, it becomes a first-class concern instead of something that's tacked on at the end.
Dagster's data reliability features allow you to standardize across your entire data platform the monitoring that you need to deliver trustworthy data to your stakeholders.
Hi, I'm Jamie, and I'm an engineer at Dagster Labs.
I'm going to talk about branch deployments in Dagster Plus and how they can enable teams to move faster, improve quality, and increase autonomy.
Branch deployments have been a core feature of our commercial product since its launch two years ago. They are lightweight staging environments created with every pull request and reduce the friction of testing, reviewing, and collaborating on your data pipelines.
Let's take a look at how these work.
Most users have a main branch that contains their code and a production deployment where that code is deployed.
Branch deployments follow a workflow that should feel familiar.
You create a branch, make your changes, and create a pull request.
Then Dagster takes over and creates a lightweight deployment where you can run and test the code in your pull request.
These deployments can be configured so that they interact with staging resources, which allows you to materialize your assets without affecting production data.
With the launch of Dagster Plus, we are releasing a new feature in branch deployments, change tracking.
Change tracking makes it easy to see exactly what assets have changed in your branch and launch materializations of those assets.
When a branch deployment is created, it is compared to the main production deployment and assets that have been changed in the branch are marked in the UI.
You can filter down to just these assets so that you can quickly see what has changed and launch materializations to test those assets.
This also makes branch deployments a great tool for collaboration and code review.
Rather than sifting through lines of code to determine which assets will be affected by a particular pull request, you can use the UI to quickly see the scope of code changes.
Here we have a PR where we've modified some dbt models.
We've also updated the start date of one of our partitions, but we aren't sure exactly how many assets this change is going to affect.
Let's take a look at the branch deployment.
Here we can see that the dbt models we've modified in our branch have been marked in the UI.
We can see how these assets have changed, and we can filter the graph down to just the modified assets.
In the global asset graph, we can also apply filters to see the assets that our pull request changes.
For example, we can apply a filter to show only the assets that are modified by the partition change we made in our branch.
This lets us see exactly which assets will be affected by that code change.
Finally, while we have primarily shown this feature for assets made from dbt models, change tracking can be used with any kind of asset.
Branch deployments are a unique feature to Daxter Plus that brings a truly modern developer experience to data engineering and enables teams to move faster, improve quality, and increase autonomy.
We continue to hear how much value they add to the day-to-day process of building and testing data pipelines.
We have lots of ideas on how to make branch deployments even more powerful, so you can expect to see further enhancements in the coming months.
Most companies' data platforms are large, complex, and mission-critical.
They integrate dozens of systems and serve hundreds of stakeholders.
Understanding the health of the data platform as it grows is difficult.
Getting key information, like how reliable the data is and how much money is being spent, is either impossible at scale or requires jumping through multiple tools and complex multi-quarter integration projects.
In short, platform owners lack a single pane of glass for understanding the state of their data platform.
Today, we are launching that single pane of glass for operational observability, Daxter Plus Insights.
Daxter Plus Insights allows everyone from data platform owners to individual practitioners to understand and optimize reliability, cost, and freshness.
Before we show some examples of what Daxter Insights allows you to do, let's talk briefly about why we believe the orchestrator is the right tool for building this kind of observability, as opposed to a dedicated point solution that sits outside of the execution flow.
All of the operational data is in one place, so you no longer have to jump between tools to get a complete picture of the platform's health at a glance.
By servicing the real cost of computation in the orchestrator, your team becomes more aware and sensitive to how much things are costing.
If you want to double-click, you do so at the asset level or the asset group level, which is the logical way of exploring your assets so you can quickly pinpoint areas for operational improvement.
And you get all of this bundled in Daxter, with no additional work.
In most cases, the hard work of identifying what assets to track and how to structure your metadata is already done in Daxter.
We are simply building on what's already there.
Okay, so what can Daxter Insights help you to track and observe?
Well, Daxter Insights is a powerful capability that supports many possible use cases.
Let's look at a couple of them, and I'll highlight some key features along the way.
A common example is optimizing spend on data movement tools, such as Fivetran.
Let's say you're using Daxter to orchestrate some Fivetran jobs to replicate data from your application Postgres database into Snowflake, and you've noticed that the number of rows on each set of assets is ever-increasing, along with your bill.
With Daxter Insights, you're able to look at the global level to see which assets and processes might be the most costly.
And you can see how those trends are changing over time.
To save money, you might look at ways to reduce your reliance on Fivetran for supporting the biggest and most expensive of your assets.
Thankfully, Daxter's embedded ELT functionality can handle this and help you save tens of thousands of dollars by moving a data-intensive process to a cheaper option.
Another example is reducing compute and query costs.
Let's say your primary storage layer is BigQuery, and GCP bills are getting larger.
Sadly, not everyone on your team is sensitive to the cost of their workloads because so far, nobody has visibility into which workloads are the most compute-intensive and what that means in terms of the bill.
By setting up Daxter's new integration for BigQuery, you're now able to emit and visualize bytes billable in terms of bytes as well as dollars.
Now your team can see which assets or jobs are the most compute-intensive and can translate that into dollars and cents.
You can sort by percentage change over time to see the biggest increases or decreases in cost for a particular process.
With our new deployment-level aggregation, you can start tracking expenses across all Daxter-controlled processes.
Furthermore, you can set alerts for when your team is spending more than usual.
We currently support Snowflake and BigQuery out-of-the-box, but users can add custom cost metadata using Daxter's flexible metadata system.
Daxter Insights also allows you to monitor and optimize your spend on AI workloads.
With our new OpenAI integration, your AI engineers can observe and manage their calls to OpenAI and visualize their token consumption, all right from within Daxter.
And for a final example, tracking the health of your DBT transformations.
Let's say you're using Daxter and DBT to manage a set of tables and you want to track the health of those processes.
So you've written a Daxter asset check, which counts the number of null values on the primary key.
Now, you're able to visualize this quality in two ways.
First, with Daxter's reliability features, you know each time this asset check fails as shown in our asset graph.
But in Daxter Plus, with the addition of Insights, you can view over time, both the number of instances where the asset check failed and the number of rows that failed, because the difference between one row violating primary key and a thousand can be the difference between an expected bad row to a major pipeline issue.
Now that you have a solid grasp on the health of your pipeline with Insights, you want to ensure that they stay healthy.
With alerts, Daxter Plus makes it easy for you to get ahead of point in time failures, ensuring your stakeholders aren't the ones who find errors in data.
And when a pipeline inevitably does fail, you can jump back into Insights to get a deeper understanding of the failure and whether it's part of a larger trend or an isolated issue.
With Daxter Plus and Daxter Insights, your team can truly understand what's going on with the data platform, all from within Daxter Plus.
Make better use of your budget and identify areas of inefficient spending.
Visualize trends in both quality and cost to identify problems earlier and spend less time troubleshooting.
And reduce the need for new standalone tools, which reduces context switching, breaks down data silos, and reduces the total cost of ownership.
Going forward, we will be adding features like tags and owner metadata to help you visualize data that reflects the schema of your organization.
We hope this overview of Insights in Daxter Plus is sparking some ideas on how you can get value from this great observability feature.
Hi, I'm Eric Chernoff, Head of Partnerships at Daxter Labs.
As the number of organizations using Daxter continues to grow, so does our network of As part of the Daxter Plus launch, here is an update on how we are building out the broader ecosystem to help support you in your adoption of Daxter.
First of all, for organizations looking for consulting help on initial build-out or migration, we have a strong network of independent implementation partners.
A big shout-out to Slalom, Rittman Analytics, Infostrux, 4Mile, Analytics 8, Brooklyn Data, Bytecode, Evantum.ai, and FortiSoft.
These nine partners have been supporting Daxter implementations ever since our early GA.
We are now expanding our partner program.
If you are either looking for an implementation partner, or you are part of a consultancy and would like to be part of the Daxter Implementation Network, you can find more information at daxter.io slash partners and sign up with the Become a Partner link at the bottom of the page.
Next, we are making it easy for you to buy Daxter Plus as part of your larger cloud platform transactions.
You can now transact for Daxter Plus via the AWS Marketplace, and we will be in all the major cloud marketplaces by the end of Q2 2024.
This often makes it easier to procure Daxter Plus as part of your existing cloud platform commitments and allocated budgets.
Finally, we continue our technology partnerships to make it easier for you to integrate key tools from across the ecosystem.
We continue to add new integrations for Daxter, including OpenAI, which we announced recently.
We are also deepening our integrations with partners like Snowflake, so look out for more news on that.
You can learn more about Daxter's integrations at daxter.io slash integrations.
As we scale, you can expect Daxter Labs to continue growing the ecosystem so you have the right resources to support you in building the most productive and valuable data platform possible.
As you've just seen, Daxter Plus is an exciting redefinition of the data orchestration category.
We believe that expanding the orchestrator scope can help your data team reduce spend through centralized asset-oriented cost observability, eliminate the need for many point solutions and their costly error-prone integrations, and enable your data teams to operate autonomously while simultaneously standardizing best practices and delivering a single pane of glass that supports centralized data platform objectives.
All of this is made possible by Daxter's asset-oriented approach to data orchestration and the efforts of our amazing team, whom I'm fortunate to work with every day.
We hope you try out Daxter Plus and see how all the pieces fit together.
I think you'll find that this is truly a category-redefining product that will change the way you think about your data platform.