Data Engineering from Ingestion to AI-Ready — BUILD 2025 Keynote — Full Transcript (June 7, 2026)

[00:00:00] . [00:00:27] Please welcome, Snowflake's Vice President of Product Management, Chris Child. [00:00:34] Hello, everyone, and welcome to Day 3 of Build. [00:00:38] Over the last two days, you heard from Christian about the incredible advancements all across Snowflake that are powering AI. [00:00:45] And yesterday, you heard from Jeff about the ways that customers are using Snowflake intelligence to pretty fundamentally change the way that they make decisions. [00:00:54] And a lot of companies are investing massive amounts of time and energy into their AI investments. [00:01:00] But without the right foundation in place, these projects are failing, and executives are coming to their data engineering teams asking them to fix it. [00:01:08] We have to fix this by rethinking not just data engineering, but our whole data foundation. [00:01:14] In the next hour, we're going to show you not just what kind of foundation you need to build to make your AI initiatives a success, but how to go about doing it. [00:01:24] Successful AI is powered by a strong data foundation. [00:01:29] If you know where your data is, who can access it, what it can be used for, and that it's trustworthy and easy to use, AI gets a lot easier. [00:01:39] And we're going to show you how many of our thousands of customers are already using Snowflake to build that foundation and how that foundation is enabling them to drive AI outcomes right now. [00:01:53] It starts with having all of your data stored and cataloged in a single place, with the ability to connect and operate multiple complex pipelines, and to do all of this at varying sizes, complexity, and scale. [00:02:07] Let's start with one problem that we hear from many of you consistently. [00:02:13] Data isn't available when and where the business needs it. [00:02:18] Analysts and AI tool builders are often spending a ton of time just figuring out where the data that they need lives and how to bring it all together. [00:02:27] Simplifying these end-to-end pipelines allows you to get data from more sources and put it into your data team's hands more quickly. [00:02:35] So to solve this problem, let's start at the data sources. [00:02:41] This is where we have focused on helping you deliver faster data across ETL, across data sharing, and across streaming capabilities. [00:02:50] Earlier this year, you heard us announce OpenFlow. [00:02:53] OpenFlow automates data movement from virtually any source. [00:02:58] The destination can be storage in Snowflake or in a cloud provider. [00:03:04] And OpenFlow has been generally available since last summer. [00:03:08] But that's required you to run it in a bring your own cloud environment, where you deploy it to your own AWS environment. [00:03:14] We now have a Snowflake deployment option through Snowpark Container Services that allows you to run OpenFlow in an entirely integrated managed experience. [00:03:24] And this is generally available on both AWS and Microsoft Azure. [00:03:28] This fully integrated experience removes the need for data engineers to manage infrastructure, configure networking, or worry about security boundaries between systems. [00:03:39] OpenFlow has more than 26 connectors, and that number is rapidly growing. [00:03:44] We continue to add new options on a regular basis. [00:03:47] And we came to a lot of you, and we asked, what sources are the most important for you? [00:03:52] And where did you really need an easy button? [00:03:54] What data was hard for you to collect, but incredibly valuable for you to use? [00:03:59] And consistently, we got the same four answers. [00:04:03] Now, these aren't exactly tiny companies that we can just go quickly build a connector with. [00:04:09] But what we've done is over the last year, we've gone and built deep partnerships with each of these four companies, [00:04:15] so that we can connect them all directly to Snowflake. [00:04:18] We've provided zero copy integrations now across a wide variety of the most important companies and data sources that you want to integrate with. [00:04:26] With Salesforce, we launched our zero ETL bi-directional integration, which allows you to access critical business data across the two systems. [00:04:34] We've also partnered with Oracle and launched a new change data capture capability for high-speed data replication that works across on-premise and cloud environments. [00:04:44] This is in private preview and will be in public preview soon. [00:04:47] We partnered with Workday and announced a partnership earlier this year that will allow you to unlock your HR and finance data directly with Snowflake. [00:04:55] This is in development, but will be available soon. [00:04:57] And we're incredibly excited to announce a bi-directional integration that extends SAP's business data cloud with fully managed data and AI capabilities powered by Snowflake. [00:05:08] This is in private preview right now. [00:05:14] Accessing the data from all of your different systems is just the first hurdle. [00:05:18] The next is speed. [00:05:20] Good agents rely on immediate and continuous access to the freshest data. [00:05:25] Snowpipe streaming went through a massive architectural overhaul over the last year. [00:05:30] That means that it's now faster with data ready to query within 10 seconds. [00:05:34] We call this architecture V2. [00:05:36] And this is generally available on AWS and Azure with GCP coming soon. [00:05:42] One of our financial customers, CBOE Global Markets, is already using this next generation architecture to achieve outstanding results. [00:05:50] Now, they needed to replace a legacy system that was limited to end of day batch processing, but their teams required access to near real-time market data. [00:06:00] With Snowpipe streaming V2, they're now able to process more than 100 terabytes of uncompressed data, which comprises more than 190 billion rows every single day. [00:06:11] And using our new pre-clustering feature, this ingested data is optimized for querying in-flight. [00:06:17] This means that data is now available to query at CBOE immediately. [00:06:22] That sounds great, right? [00:06:24] But what about the cost? [00:06:26] So, in addition to being significantly faster, Snowpipe streaming V2 is also much more efficient. [00:06:32] And in Snowflake fashion, this has allowed us to pass the savings on to you. [00:06:37] We've switched to a simpler price that's just per gigabyte ingested. [00:06:42] And for many customers, this leads to a cost savings of 50% or more on their ingestion costs. [00:06:49] Now, I've been talking about a lot of this, and I think what's better for this audience is to actually see it come to life in a demo. [00:06:55] So, yesterday morning, you saw a demo of Snowflake Intelligence and how easy it was to create an agent. [00:07:02] But let's go dig a little bit deeper and start to understand what the data engineering team had to do to prepare that data for those agents to be able to run. [00:07:12] I'd like to invite up Michael Koes, who's a product manager on OpenFlow, who's going to walk you through how to connect to the different data sources. [00:07:18] Michael, thanks for joining me. [00:07:20] And Vino, take it away. [00:07:22] Thank you so much, Chris. [00:07:24] Hello, everyone. [00:07:25] My name is Michael Koes. [00:07:26] I'm a product manager in the OpenFlow team. [00:07:29] And with me on stage, we have Vino Duraisamy, our developer advocate for data engineering. [00:07:34] Yesterday, you saw how you can talk directly to your data using Snowflake Intelligence to gain insights. [00:07:41] Today, we're excited to show you how you can get your data ready to build a foundation that Snowflake Intelligence needs. [00:07:48] Here at Stratos Dynamics, we're dealing with a problem. [00:07:54] We're dealing with a fragmented data estate, and we have to unify various data sources for business critical analysis. [00:08:01] We have to analyze structured purchase order transactions in MySQL, unstructured contract data stored as PDFs in SharePoint, and also historic logistics operations data in iceberg tables in S3. [00:08:15] So we're going to use Snowflake OpenFlow to get our structured and unstructured data ingested. [00:08:22] And then we're going to see how we can use catalog linked databases to access any data stored in an external REST iceberg catalog. [00:08:30] So with that, let's switch over to our demonstration. [00:08:37] In Snowflake, we've created two databases. [00:08:40] The first database for our structured data, and the second database for unstructured SharePoint data. [00:08:47] Both of these databases are empty. [00:08:49] Our entire goal is to populate these databases without having to write any complex ETL code and only using one single integrated tool, which is OpenFlow. [00:08:59] So let's head over to OpenFlow and explore the connectivity that it provides out of the box. [00:09:05] OpenFlow provides connectivity to databases, streaming sources, SAS systems, etc. [00:09:12] And it's not just about bringing the data into Snowflake. [00:09:15] You can also use it to send data to external systems. [00:09:18] And if you don't find the connector that you need, you can easily build your own custom connector by using OpenFlow's hundreds of composable processors. [00:09:27] But for our scenario, let's go back to MySQL and let's install our MySQL connector to get the structured purchase order transactions in. [00:09:37] So we're going to install this to one of our runtimes. [00:09:40] And runtimes provide the compute infrastructure for our connectors. [00:09:45] They automatically scale up and scale down as needed. [00:09:48] And generally you can think of them like warehouses, but for your connectors. [00:09:53] Once we've installed the connector, all we have to do is configure it with our destination database. [00:10:00] And most crucially, the tables that we want to replicate from our source databases. [00:10:05] After we've provided more connection information, how we can connect to our MySQL databases, we're good to go. [00:10:12] We save our configuration and we can start the connector. [00:10:15] Now, once we start the connector, it will perform the initial data load from our MySQL database sources. [00:10:22] Once the initial load is done, it'll automatically switch and transition into the change data tracking mode from where we will continue to pick up the events. [00:10:32] So in Snowflake, we can now see the tables that were replicated by OpenFlow, and we can see the actual data showing up. [00:10:41] Now that we've ingested our structured data, let's move on to our unstructured SharePoint data. [00:10:47] We've already installed the SharePoint connector, and now we're going to configure the parameters for SharePoint. [00:10:53] SharePoint gives us some additional configuration options, like it allows us to filter for the file type that we want to ingest, [00:11:01] and also exposes advanced configurations like OCR mode that will parse images found in those PDF files into text that is more easily readable by our AI tools. [00:11:14] So after providing all that information and connectivity, we can go ahead again and start that connector. [00:11:20] Back in Snowflake, let's see how the data looks like that we've ingested. [00:11:25] We can see that the connector again created several tables, and it didn't just dump the files into Snowflake, but it actually intelligently parsed it. [00:11:37] If we're looking at the chunks table here, we can see that this is now the actual text that was extracted from the PDF files and is AI ready. [00:11:46] In addition to that, the connector also brings in metadata like access control structures to protect your data once it's landed in Snowflake. [00:11:56] So you might be wondering, well, where are all these connectors running and how can I get started with OpenFlow? [00:12:01] So as Chris said, we're very excited today to share with you that OpenFlow deployments are now generally available in fully hosted Snowflake environments. [00:12:13] So all you need to get started is a name for your OpenFlow deployment. [00:12:17] That's literally it. [00:12:18] You need a name and then you click go and that's it. [00:12:21] So with bring your own cloud and Snowflake hosted deployment options available now, you can really choose the deployment form factor that works best for your use case and scenario. [00:12:34] So now that we've ingested our structured and unstructured data with OpenFlow, we're going to move forward and we're going to take care of our data stored in Iceberg. [00:12:42] At Stratos Dynamics, that data is managed by a different team and it would typically be very hard for us to get access to the data to ingest it and process it. [00:12:52] But not anymore, because now we can just create a catalog link database that allows us to connect to any third party Iceberg REST catalog. [00:13:02] Like in this case, it's an AWS Glue catalog. [00:13:05] And then what we can do is we can query the data without having to move it actually into Snowflake. [00:13:12] As new source databases and source tables are added to the AWS Glue catalog, they're automatically tracked and refreshed in Snowflake. [00:13:21] So your data always stays in sync. [00:13:24] So to recap the demonstration, in just a few minutes, you've seen how Snowflake OpenFlow eliminates all complexities from ingestion and processing by not requiring you to write any complex ETL code. [00:13:38] We've also seen how catalog link databases allow you to connect to external third party Iceberg REST catalogs and query data without moving. [00:13:48] So from empty databases to ready to analyze data, structured and unstructured, Snowflake gets your data ready for AI with simplicity and control and no trade-offs. [00:14:00] And with that, I'll hand it back to you, Chris. [00:14:03] Thanks, Michael. That was awesome. [00:14:06] So it was great to see how you were able to quickly connect to multiple data sources and ingest structured and unstructured data. [00:14:15] Now, as you connect more data and scale your AI initiatives, the demands for governance only increase. [00:14:22] Just like analysts do, AI needs clean, accurate data to be able to deliver the answers that you want. [00:14:29] And we hear, though, that customers and organizations are struggling with too many different systems. [00:14:35] You have some data in a lake, you have some data in a warehouse, you have some other data in a lake house, and none of them are talking to each other. [00:14:42] They each have their own governance system, their own lineage, and their own access rules. [00:14:47] This old way of stitching together a dozen different proprietary systems is fundamentally broken. [00:14:53] It creates lock-in, and it creates complexity. [00:14:57] What you need is a single catalog that works with all of your data to manage an open lake house. [00:15:03] With Snowflake Horizon Catalog, you can manage across all the data in your organization, collaborate across departments and across companies, [00:15:12] and deliver mission-critical workloads with business continuity and disaster recovery across clouds and regions. [00:15:19] This means that you have a single place that can catalog data, like open formats, like Apache Iceberg. [00:15:25] It can also support reads and writes from any different engine that supports the Iceberg IRC protocols. [00:15:32] And it gives you consistent governance across all of that data, no matter where it lives or how you're using it. [00:15:38] And you can now manage secure multi-engine access to Snowflake-managed Iceberg tables directly in Horizon Catalog via embedded open APIs like Apache Polaris and the Iceberg REST Catalog. [00:15:51] This makes it easy for you to centralize in one catalog all of your Snowflake and Iceberg data. [00:15:57] This also makes it significantly easier to access Snowflake-managed Iceberg tables from external query engines that support the Iceberg REST protocol. [00:16:06] So instead of setting up separate Apache Polaris accounts, configuring the integration, managing a separate set of users and roles, and setting up different security configs depending on where you're accessing from, [00:16:17] you can now simply access the tables directly from Horizon Catalog in your Snowflake account. [00:16:22] External engines can read or write from Snowflake-managed Iceberg tables. [00:16:26] Reads will be in public preview soon, and writes will be in private preview soon. [00:16:30] Now Snowflake remains committed to the open source community and continues to be a heavy contributor to both Apache Iceberg and Apache Polaris. [00:16:38] We're also extending our zero ETL data sharing capabilities to open table formats. [00:16:43] And this includes Apache Iceberg as well as Delta Lake tables, regardless of which metadata catalog the data is in. [00:16:50] This is generally available now. [00:16:52] Support for the latest Apache Iceberg v3 capabilities, many of which were contributed by Snowflake, such as new variant and geospatial data types, will open up even more use cases for your Iceberg tables. [00:17:03] And these are all in private preview today. [00:17:07] One of our customers, indeed, is finding great success getting data into their organization's hands fast by using the combination of Iceberg and Horizon Catalog. [00:17:16] It's really accelerated the way that they deliver value to their users. [00:17:21] Fast, though, isn't enough in delivering value. [00:17:24] Fast needs to be applied to transformation logic too. [00:17:27] We're hearing from many data engineers like you that building anything from simple to complex transformations, [00:17:34] it requires you to do a tremendous amount of work to get the right performance at the best price. [00:17:41] And some of you have even called this a bit of a dark art. [00:17:44] That's why in our end-to-end data engineering solution, we're building solutions that remove and streamline much of the development process, [00:17:52] so that you can focus on the outcomes you want and getting the data in the shape that you need it. [00:17:57] We want you to be able to work in the language of your choice, with Snowflake giving you huge boosts in performance, [00:18:03] without all of the tinkering that's required in many other solutions. [00:18:07] Now, let's start with a very simple form of this, declarative definitions and dynamic tables. [00:18:13] When you build a declarative pipeline, you simply define the desired state of your data using a standard SQL statement, [00:18:21] and Snowflake manages updating that table incrementally and all of the orchestration. [00:18:26] There's no work for you to do. [00:18:28] And these pipelines are fast. [00:18:30] They pair incredibly well with the Snowpipe streaming architecture we talked about earlier, [00:18:34] to deliver data for queries quickly. [00:18:36] And we've been working to improve the performance of new different types of use cases, [00:18:41] and pass those savings on to you. [00:18:43] Dynamic tables also now work on Apache Iceberg tables. [00:18:47] They can lock different portions of the output, which allows you to control costs with an immutable clause. [00:18:53] You can use Snowflake Cortex AI SQL, including LLM functions, in the select clauses for dynamic tables, [00:18:59] when you're doing incremental refreshes. [00:19:01] And all of this is available in dynamic tables, and in general availability right now. [00:19:09] Dynamic tables can be an incredibly powerful accelerator for your teams. [00:19:13] Customers like Travel Pass are using them to speed up not just development, [00:19:17] but also to lower the costs of running and operating their pipelines. [00:19:22] Now, often you need to go beyond a declarative pipeline. [00:19:25] But writing code to process data is hard. [00:19:28] And there are many languages and tool specific nuances you need to consider. [00:19:32] This is why we built and continue to invest in Snowpark. [00:19:35] With Snowpark, your code, whether it's Python, Scala, or Java, [00:19:39] it executes directly within a Snowflake warehouse. [00:19:42] We take care to make sure that your code executes efficiently, [00:19:45] right where your data lives. [00:19:47] This makes it ultimately a better experience for you, [00:19:50] while also saving time and money for your organization. [00:19:53] And we heard from a lot of organizations who had existing [00:19:57] Apache Spark data frame workloads. [00:19:59] And they wanted a more direct route to run that code on Snowflake [00:20:02] and get the benefits of the improved execution engine. [00:20:05] And so we built Snowpark Connect for Apache Spark. [00:20:08] This allows you to run your existing Spark data frames and Spark SQL code [00:20:13] directly on Snowflake without making any changes to the code. [00:20:16] You don't have to worry about the complexity of maintaining or tuning separate Spark environments. [00:20:21] You don't have to manage dependencies. [00:20:23] You don't have to worry about version compatibility and upgrades. [00:20:26] In fact, running your Spark code on our Elastic Snowflake engine [00:20:30] delivers massive performance gains and lower and more predictable costs [00:20:34] than any managed Spark environment. [00:20:36] And Snowpark Connect for Apache Spark is now generally available to all of our customers. [00:20:41] We're doing a lot of work behind the scenes to ensure that the Snowflake engine [00:20:46] delivers on best-in-class price performance and ease of use. [00:20:50] Now, Treedance is a popular solution integrator [00:20:53] who has experience with many different data engineering platform vendors. [00:20:57] And they ran a comparison of Snowpark Connect with managed Spark and other vendors. [00:21:02] And across hundreds of sample queries, [00:21:04] they found that Snowflake was more than eight times faster [00:21:08] and three times less expensive than running the exact same jobs on managed Spark. [00:21:13] Which is a pretty incredible capability. [00:21:17] And this performance advantages, [00:21:19] these are starting to play out with many of our early users. [00:21:22] As an example, Booking.com took a large number of more than one hour workloads [00:21:28] where they were running Spark jobs that took over an hour to run. [00:21:31] And they're seeing these run at least 70% faster [00:21:34] since they moved them over to Spark Connect. [00:21:36] There's a huge performance and savings boost. [00:21:39] And this they found to be a smooth transition, [00:21:42] even though they were one of the very first customers to adopt it. [00:21:47] It's hard to spark joy and accelerate your work [00:21:51] when you're bogged down by lots and lots of manual tasks. [00:21:54] And this is one of the other large problems that we hear from customers. [00:21:58] AI SQL is one of the most exciting new features we've introduced. [00:22:03] And it makes it as simple as writing a SQL function [00:22:06] to tap into a lot of capabilities of powerful purpose-built models [00:22:10] and common large language models to use for common data tasks. [00:22:14] These AI SQL capabilities allow you to streamline workflows [00:22:17] and remove many different processing steps that you normally had to do manually. [00:22:22] We've also brought a lot of time-saving updates to our new IDE, WorkSpaces. [00:22:27] This file-based development environment provides inline text-to-SQL co-pilot [00:22:32] along with Git integration for code management [00:22:34] and the ability to organize your workspace to your preferences. [00:22:38] And WorkSpaces is generally available. [00:22:43] In addition, we've added Cortex code directly in WorkSpaces. [00:22:47] Cortex code can help you understand your Snowflake usage, [00:22:50] optimize your complex queries, and fine-tune your results. [00:22:54] Cortex code is in private preview. [00:22:57] And we've talked a lot about how to build some of these pipelines [00:23:00] and how these tools are going to help you move faster. [00:23:02] But again, let's see it for real. [00:23:04] This is the best way to go understand exactly what's happening. [00:23:07] I'm incredibly excited to welcome Shruti to the stage [00:23:10] to show us how this all works. [00:23:11] Shruti. [00:23:12] Hello, everyone. [00:23:16] And thank you, Chris. [00:23:18] My name is Shruti Anand. [00:23:19] I'm a product manager at Snowflake working on Snowpark Connect. [00:23:23] For the purpose of this demo, I am going to be a data engineer [00:23:26] working at Stratos Dynamic, helping my business inside teams [00:23:29] to understand what logistical delays are impacted by weather conditions. [00:23:33] I'm joined here with my partner, Vino, who is going to help run the demo. [00:23:38] So let's get into it. [00:23:41] So as my friend Michael showed you how to unify data for your AI use cases, [00:23:48] I am a data engineer who is going to build on that data [00:23:51] and some of the real-time insights that have been gathered from OpenFlow [00:23:55] to run a bunch of my pipelines. [00:23:59] Before I do that, I'm also aware that some of these historical data [00:24:03] was already present in the lake house, which Michael already showed you [00:24:06] how to connect using your catalog link databases. [00:24:09] But the same team also owned a bunch of Spark pipelines. [00:24:13] Now, let's get into the demo. [00:24:16] As you can see, I have a Snowflake notebook open right here. [00:24:20] And I'm doing a bunch of already available aggregation functions [00:24:24] using Snowpark Connect on my Snowflake notebook. [00:24:27] You can see I'm looking at the liability risk. [00:24:30] I'm looking at destination facility. [00:24:32] All of this different information and trying to combine all of that [00:24:36] and creating a data frame, which basically shows me what destination facility, [00:24:41] what source was the delivery coming from, what is the cost of these deliveries, [00:24:46] all of that information. [00:24:48] And then finally, writing it to a table that will be called as Deliveries at Risk. [00:24:57] So as you can see, all of this information is available. [00:25:00] The final part of the PySpark code that we're using Snowpark Connect for [00:25:04] is writing to a Snowflake table, Deliveries at Risk. [00:25:07] And you can see that table is already available in my object explorer [00:25:12] with all the information that I gathered through my analysis, [00:25:16] doing transformation using Snowpark Connect. [00:25:20] Now, once this information is available, I want to make sure that my business team [00:25:24] also has some real-time insights that's flowing in through my Slack messages [00:25:29] that I've already ingested in using OpenFlow. [00:25:32] But I want to make sure that my transformations are also appended [00:25:36] with those real-time vendor insights that are coming through these Slack messages. [00:25:42] So how do I go about it? [00:25:43] First off, I want to make sure that we use AI Redact in order to redact any sensitive information [00:25:50] that might be present in these vendor insights information. [00:25:54] For example, email, name, all of these are very sensitive information. [00:25:58] So I'm going to use AI Redact in order to redact sensitive information [00:26:02] before I can run any analysis on these messages that are coming in via Slack. [00:26:07] Then I'm going to use the AI Extract functions, which is newly available AI SQL function to extract real meaningful and structured insights. [00:26:18] As you can see on the screen, I'm already using that to gather a bunch of information that's related to carrier and vendor as well as delays. [00:26:26] I'm going to run this SQL quickly to join this with my previously available operational insights information. [00:26:35] Then I'm going to use dynamic tables with a very simple single line of SQL code to ensure that my data pipeline is running at a near real-time basis. [00:26:48] Because I want to make sure that my end users are having information at their fingertips at nearly every five minutes so that they can make the right decisions. [00:26:57] You can see right on the screen that I was able to create dynamic tables that join Slack insights from OpenFlow with historic logistical data from Iceberg table. [00:27:06] Our data pipeline is a single line of SQL code. [00:27:09] Deliveries at risk enriched with the Slack insight message. [00:27:13] So, what did we learn from the demo? [00:27:15] We just demonstrated the flexibility and advanced capabilities of Snowflake for data transformation and intelligence. [00:27:22] We successfully ran existing historic Apache spark code against data in Iceberg without the need of managing any additional infrastructure, all reusing your existing spark code. [00:27:34] And we were able to create dynamic tables to make sure that my business insight team always had the right information available on a near real-time basis. [00:27:44] We also used AI SQL tools like AI Redact and AI Extract tooling in order to make sure that I'm not having any kind of sensitive information in my data. [00:27:56] As well as I'm extracting and appending any historical and real-time insights to the information and making sure my business team can make the right decisions. [00:28:05] Back to you, Chris. [00:28:08] Thank you, Shruti. [00:28:10] And we saw how easy it is to build a pipeline. [00:28:13] But what happens when you go from one pipeline to a hundred or a hundred to a thousand? [00:28:20] We are hearing from more and more of our customers about how complex their data engineering infrastructure is getting and how complex the world is that they have to operate in to feed all of these different AI models. [00:28:32] And that's why we're thinking about and building features that help you operate at scale. [00:28:39] Extending governance into observability, automating pipeline development with CI/CD, and we've been ensuring that the synchronized orchestration that you need is easy so that you can deliver trusted data for your consumers. [00:28:53] Snowflake is dedicated in continuously investing in an end-to-end data engineering platform to address all of these challenges. [00:29:00] From ingestion to AI and applications, Snowflake is simplifying your workflows and providing a scalable and cost-effective data engineering solution. [00:29:09] As one example, you can now run dbt core natively inside Snowflake. [00:29:14] This means that you don't have to host and manage dbt yourself. [00:29:18] We host and maintain dbt core for you. [00:29:21] And this consolidation of systems makes it easier to debug by having a single platform. [00:29:26] And we've done this in partnership with dbt labs, which means that we're working with them already to extend this beyond dbt core and to integrate the new fusion engine. [00:29:35] But dbt core is now generally available with dbt fusion to come. [00:29:40] There's a ton of excitement from the community about how this is going to streamline their collaboration and streamline their development. [00:29:47] This not only makes it faster for you to build and debug your pipelines, but also to deploy them and understand exactly what's going on across your entire infrastructure. [00:29:56] Now coming back to Horizon Catalog, it's also important for you to understand not just the governance of your data, but the quality of your data as well. [00:30:05] With our data metric functions, you can now monitor the state and integrity of your data directly within the platform. [00:30:11] This will help you understand and drill down into your data quality. [00:30:15] And to make this even easier, we've created a simple tab in Snowsight to show you exactly where your data quality issues are and what to do about them. [00:30:23] And this is now in public preview. [00:30:26] Now let's see how these tools can help you take your pipelines into production faster and with more confidence. [00:30:33] So please join me in welcoming Jeremiah Hanson to the stage. [00:30:36] Jeremiah. [00:30:37] Hey, thanks Chris. [00:30:38] All right. [00:30:39] Great to be here with everybody. [00:30:41] Thanks for having me. [00:30:42] My name is Jeremiah Hanson. [00:30:43] I'm on our applied field engineering team. [00:30:45] And I get the chance of working with customers on some of their most challenging data engineering tasks. [00:30:51] And with me, as you know, we have Vino running the demo for us. [00:30:54] So going back to the use case here, Michael and Shruti from Stratos Dynamics showed us earlier how to ingest and transform the data. [00:31:05] But now we actually need to do the work of governing it, making sure it's accurate and up to date. [00:31:11] And to do that, we're going to leverage features in Horizon Catalog along with our new AI-powered coding agent, CortexCode. [00:31:19] All right. [00:31:20] So with that, let's get into the demo, Vino. [00:31:23] Are you ready? [00:31:24] Let's go. [00:31:25] So the first thing we need to do is talk about governance. [00:31:29] We have sensitive shipment data here in the data set that we only want the analyst to be able to see the summarized version of. [00:31:37] We don't want them to get access to the raw tables. [00:31:39] So to do that, we're going to use Snowflake's RBAC or role-based access control. [00:31:44] But I don't always remember, if you're like me, the exact SQL syntax to use. [00:31:48] So we're going to use CortexCode here to help us. [00:31:51] So you can see that Vino has put in a prompt. [00:31:53] And just like that, it's able to suggest the code that we can use to do this. [00:31:57] And we can review the code, and then we can accept it here into the Workspace's editor. [00:32:03] CortexCode, like we talked about, is a new AI-powered coding assistant built right into SnowSight. [00:32:11] And with just a natural language prompt, we were able to generate all this code. [00:32:15] So let's actually go in now and test it. [00:32:17] The analyst should only be able to see the curated data, not the raw data. [00:32:21] And here you can see that they got an error querying the raw data, which is good. [00:32:27] The next thing we want to talk about is data quality, making sure our data is accurate. [00:32:31] So within Horizon Catalog, we have the new data quality page. [00:32:35] And there's a couple sections to this page. [00:32:37] The first one we're going to look at is the actual data profile page. [00:32:41] So here you're able to see your standard profiling information. [00:32:45] These are things like your null counts and your min/max values, top values, things like that. [00:32:50] So here we can understand the structure of our table and even find some things to look at at this point. [00:32:56] But we want to go a step further. [00:32:57] We actually want to enforce some policies around our data. [00:33:00] So to do that, we're going to create a data metric function or DMF. [00:33:04] And here Vino's got the SQL query for it. [00:33:07] So we're going to go ahead and create this. [00:33:09] We're going to look for a referential integrity constraint to make sure that we have data consistently between a couple tables in our data model. [00:33:17] So here we've got that created. [00:33:22] Perfect. [00:33:23] And then we're showing here now the final piece, which is actually the monitoring part. [00:33:28] So now that we've got all of our data metrics created, we can actually come in and see how they're doing. [00:33:32] And here we're seeing the referential integrity check. [00:33:34] And we can actually see that it has failed over time. [00:33:37] So this is great. [00:33:38] We're able to see what's happening. [00:33:40] We could programmatically create processes around this as well, if we choose. [00:33:46] The next piece is alerts. [00:33:48] So we want to be able to be alerted if a dynamic table would happen to fail a refresh by chance. [00:33:54] And to do that, we're going to use, within Snowflake Trail, the alerts capability. [00:33:58] So we can do that. [00:33:59] And again, if you're like me, you often forget the exact SQL syntax to use. [00:34:04] So here again, we're going to leverage Cortex code to help us generate the code for creating the alert and sending an email notification. [00:34:10] And here you can see, like that, it's suggesting all this code that we didn't have to create. [00:34:15] We can review it and accept it to our file. [00:34:20] And with that, then, we now have an alert set up. [00:34:23] This is an email alert, but we can easily extend this to support other things, like if you wanted to send a message to a cloud storage queue, or like Git and Slack, Slack and Teams integration. [00:34:33] All right, the final piece is Git integration. [00:34:37] We have to talk about developer best practices. [00:34:40] And to do that, we really have to have our code checked into source control. [00:34:43] And we've been working the whole time here in workspaces. [00:34:46] And so you're able to see that this has native Git integration. [00:34:49] It was built from the ground up with that in mind. [00:34:51] And so here, Vino created this workspace from an existing Git repository. [00:34:56] And you can see that here. [00:34:57] And this interface also lets you do things with your Git repo, like push and pull changes. [00:35:02] View diffs, look at all the -- we're going to do a push here. [00:35:08] So those files that we created through this process, Vino's going to be able to push those into the Git repo. [00:35:13] So she clicked on push, added a message, and now she's able to check that into her GitHub repository. [00:35:20] And we'll actually go out to GitHub and just confirm that that got added correctly. [00:35:25] And it did. [00:35:27] So that's great. [00:35:29] That allows us -- that's the foundation really for any automations we want to do with our code. [00:35:35] And actually, for learning more about DevOps and how to build CI/CD pipelines, join Vino and I this afternoon at the data engineering boot camp. [00:35:43] We'll go into that in detail. [00:35:44] All right. [00:35:46] So we saw a lot there. [00:35:48] That was quick. [00:35:49] We learned how to operate production scale pipelines, how to integrate governance, observability, and DevOps into those all in a single platform. [00:35:57] And we learned how to use Cortex code to help us do a lot of the heavy lifting and not have to write a lot of that SQL ourselves. [00:36:04] All right. [00:36:05] Back to you, Chris. [00:36:06] All right. [00:36:07] Thanks, Jeremiah. [00:36:08] That was fantastic. [00:36:09] Now, we opened today with the fact that many organizations are struggling to get AI into production and that your data foundation built to help data engineers is really critical to that success. [00:36:26] We saw how data engineers are critical to actually getting these pipelines in place and getting your organization ready for AI. [00:36:35] So I thought it would be helpful to actually bring someone out here who's part of the 4% that have shipped real AI products that are changing their business so they can talk a little bit about how they made that happen. [00:36:48] So please join me in welcoming to the stage Lakshman Pindakur, senior director of the Zealous Data Cloud at Zealous. [00:36:55] Lakshman, thank you for joining me today. [00:36:57] Hi, Chris. [00:36:58] Great to be here. [00:36:59] I'm very happy to be here and share some of the successes that we have, and hopefully this can inspire others to also successfully put it into production. [00:37:10] That's fantastic. [00:37:11] So let me start off with really why was it so important to build AI agents at Zealous? [00:37:19] Great question. [00:37:21] As the leader of Zealous Data Cloud, a modern data platform where we have been working for the last few years to centralize a lot of data into this data cloud. [00:37:32] This also means that a lot more questions are coming in to our insights team because of all this data that is now visible and available for a lot more users. [00:37:45] This means you have two options. [00:37:47] One, hey, I can have an army of analysts on our insights team. [00:37:51] Or two, can I use some of these AI features and some of the native features that are now available in Snowflake to be able to provide these insights more real time? [00:38:01] Getting the agent was the first stop, and this has now dramatically reduced the number of questions that comes to our insights team to ask these questions. [00:38:11] So this is why we feel this is very impactful. [00:38:15] That's fantastic. [00:38:16] What was it that made you able to get to this point successfully? [00:38:21] Because we hear from lots of customers who struggle to get to exactly where you are. [00:38:26] First, we had a strong foundation because of the effort that we had put in to migrate all of our legacy warehouses, all the structured data that we had in our transformation and in our curated datasets. [00:38:41] It was so easy for us to, as soon as the semantic view was available, and we already had built the semantic models or layer in our BI tools, it was so easy for us to take that and create in the semantic view. [00:38:58] And as soon as the Snowflake intelligence and codex agent was available, I think it came out probably on a Tuesday, and then we pinged our account team and said, "Hey, can you please enable this for us?" [00:39:10] And they enabled it in a day, and we then were able to create the semantic view and the agent, provide it to our insights team by Wednesday, and then they tested it out. [00:39:22] They were looking at it and said, "Hey, this is awesome. [00:39:25] This is very helpful. [00:39:26] What are we waiting for? [00:39:27] Let's move it to production." [00:39:29] And we were in production on Friday. [00:39:31] That was the speed with which we were able to move, and that's also because of all the native features and not a heavy lift that we had to do. [00:39:42] And once we had this, what we then did was, we actually have a Snowflake office hour every Friday, and the following Friday, we actually did a demo of this saying, "Hey, this is now available for all the other departments." [00:39:57] And now another, you know, pinged us and said, "Hey, can we create one?" [00:40:02] And we actually, you know, one of our engineer got into a call, and he said, "Hey, do you want to see something?" [00:40:09] And they were thinking, "All right, it'll be a long discussion about requirements and things like that." [00:40:14] And actually, the agent was already available. [00:40:16] So they just almost fell off of the chair. [00:40:18] And they are now able to test it and move that also into production. [00:40:24] This is how we have been able to scale up because of all the effort that we are putting in to put all the structured data. [00:40:29] And we now were able to move from one to two agents to our 20 plus agents is what we have now. [00:40:36] There's something incredible in what you're saying, too, around the data team moving from having to respond to requests that you're sort of getting constantly from business teams to actually proactively going out and demoing them now what's possible and what they can do. [00:40:50] And I'm curious about, you know, this required, as you mentioned, having the right foundation in place. [00:40:55] That's really what enabled you to move so quickly. [00:40:58] So I'm curious about how are you evolving your data engineering teams, your data engineering workflows, and how are you kind of operating on that to be able to scale your team? [00:41:06] Again, that's a great question. [00:41:08] So this is all good for what we already have. [00:41:12] And now, you know, because we, you know, this started from our data warehousing modernization. [00:41:17] It was, again, the batch, as you were mentioning in, you know, your earlier, you know, segment. [00:41:21] We wanted to like, it was great. [00:41:24] Now we wanted to go real time. [00:41:25] And this is where we're looking at all the different tools and like, you know, and here comes OpenFlow. [00:41:30] And we, you know, started using OpenFlow on our AWS Postgre, and we were able to get real time of around 40 data marks. [00:41:39] And it was so easy. [00:41:40] Of course, we worked with the, you know, Snowflake product team, OpenFlow product team to, you know, you know, work, you know, great for us. [00:41:48] And with this, it was so easy that our engineer just said, it's very difficult to go back to how we were doing it before. [00:41:56] And now we are moving to, you know, what we call as Ingestion Framework 3.0, which is how OpenFlow is quickly becoming our, you know, chosen tool for bringing all the data into Snowflake. [00:42:08] We now, you know, we can add Horizon to it. [00:42:12] And this Ingestion 3.0, again, it's like, you know, get all the data, all the different patterns or velocity, and, you know, use Horizon, you know, catalog it, categorize it, and anonymize whether masking or encryption. [00:42:25] And it's all there. [00:42:26] And we can now also build data contracts with our back. [00:42:29] And now we can quickly enable our users. [00:42:32] Amazing. [00:42:33] Absolutely amazing. [00:42:34] So, you've talked a lot about building this, and, you know, we've talked a lot about building this. [00:42:37] And building the pipelines and running this. [00:42:39] But how are you actually helping your developers improve their productivity? [00:42:43] This is, again, a great question. [00:42:45] So, what we did was we actually used, you know, Snowflake, you know, Copilot for, and one of the, you know, we had a use case where it's like, you know, we had three meetings to talk about what we had to do. [00:42:55] And our engineer gets on, fourth meeting, users copilot, ingestion, done. [00:43:01] So, that's the speed with which we were able to move. [00:43:04] And then we also started using Snowconvert for migrating some of our legacy workloads, which also is, you know, is great. [00:43:11] And the third one, we also looked at Snowflake documentation, which is now available in Snowflake Intelligence. [00:43:17] And not a lot of people realize that actually that is Cortex, you know, that's actually, it's on steroids. [00:43:27] Copilot on steroids is what we call. [00:43:29] And we also have named it as Snowmate. [00:43:31] It's like a buddy that you'll have for your engineer. [00:43:35] You ask a question, it looks at documentation, not only provide whether this feature is available, it also gives you the code. [00:43:41] And we, again, use this for the same example I had, not only for the ingestion, we were also able to use this for all the transformations. [00:43:48] And we were done with the migration in a very short time. [00:43:51] Absolutely amazing. [00:43:52] And I love how this is stitched together, not just how do you help the developers be more productive, how do you build the right foundation? [00:43:58] And then really how that's helped you enable and activate AI. [00:44:01] So my last question for you is, now that you've gotten to this place that you're clearly rightfully proud of, what advice would you give to the audience here about what they should do to really be successful at bringing AI into their organizations? [00:44:16] Yeah, I realize that each one will be in a different, you know, space in where they are in their development lifecycle. [00:44:23] What I would say is start small, you know, work with your business users. [00:44:27] You want to make them successful. [00:44:29] And, you know, as soon as you get the stakeholder buy-in, it's very easy to scale from there. [00:44:34] That is what I would say. [00:44:35] Start small, you know, small successes, stepping stone to bigger successes. [00:44:39] Fantastic. [00:44:40] Well, hopefully we'll see more big successes like yours. [00:44:42] So, Lakshman, thank you so much for joining us today. [00:44:44] Thank you, Chris. [00:44:45] I really appreciate it. [00:44:46] Thank you. [00:44:47] I really love how it kind of all comes together like that and allows us to really build this in a single way. [00:44:53] One of the things I want to do as we sort of come together to close is we've spent a lot of time talking about how data engineers are becoming more visible, are becoming more important, are driving bigger business outcomes. [00:45:06] And I realize as we've gone through this whole day, we've had a data engineer up here who's been hiding a little bit back in the shadows. [00:45:12] So, what I'd like to do is bring Vino up to the stage, out from behind the computer, but to come up and join me here. [00:45:20] And I want to talk a little bit about your actual job as a data engineer and how a lot of the things that we're doing here have impacted you directly. [00:45:29] So, maybe we can start, Vino, I know that you've worked as a data engineer at a variety of companies, at Fortune 500s, at startups, at a lot of different places. [00:45:38] You also spent a lot of time with these features and you built every single demo that we saw today. [00:45:43] So, I think maybe we can start by just asking a little bit about what do you see as kind of some of the most exciting things and what really helped you kind of tackle and go after a lot of what we built today? [00:45:54] Firstly, excited to be here and also shout out to all data engineers to coming out of the shadows and getting the limelight they all deserve. [00:46:01] And talking about, you know, what I built today, listening to you talk about how AI is powered by data. [00:46:09] And I want to say that at the end of the day, as a data engineer, AI also helps me a lot, starting with AI SQL functions we use today. [00:46:19] Can you believe if I said during my grad school, I spent an entire internship building a data pipeline to anonymize data. [00:46:26] And today, all I had to do was write that one single SQL function that can do that for me. [00:46:32] And that, I absolutely love how we're bringing out the AI capabilities over to data engineers and making our lives easier and less chaos, like one SQL function at a time. [00:46:43] And I also want to definitely shout out to CortexCode. [00:46:46] And I know many of our data engineers here would relate to it because there are some functions irrespective of if you use it 100 times a day, you could never remember the syntax, right? [00:46:56] What argument does it take and what order do they go in? [00:46:59] Just the fact that I could talk to CortexCode in natural language and it would generate the code for me and I could just review it. [00:47:05] I don't have to blindly accept and run everything, but still have the option to just take a quick peek and make sure if it is what I want it to be and then just run it. [00:47:13] Absolutely loved it. [00:47:15] So I really want to highlight that as much as we talk about how data powers AI, as data engineers, we also want to embrace AI and use that in our workflows to make it more powerful and easier. [00:47:27] Fantastic. [00:47:28] So maybe I'm curious also about the operational side. [00:47:31] So we've talked a lot about zero ops and kind of minimizing the work that data engineers have to do to keep these running. [00:47:36] Can you tell a little bit about how that's kind of impacted you and how you've taken advantage of it? [00:47:40] Yep. [00:47:41] Today, for the example of Stratos Dynamics, the data was everywhere. [00:47:45] We had SharePoint data, like contracts in SharePoint. [00:47:49] We had MySQL, like transactional data coming in. [00:47:52] And we also had Slack messages and emails and customer call transcripts and voice. [00:47:56] Data literally was everywhere. [00:47:59] And you saw all the demos, right? [00:48:01] Did you see me at any point go under the hood, touch anything on infrastructure or like really tune the job or like do anything at all on the infrastructure level? [00:48:11] All the complexities were abstracted away for me. [00:48:14] So as a data engineer, I think I am 100% in for zero ops data engineering and let's do more of it. [00:48:22] Fantastic. [00:48:23] One of the ones I wanted, I was curious about too, is we've talked a bunch about interoperability, both on the iceberg side, but also on the Apache Spark side as well. [00:48:32] And I'm curious about how that, how you think that's kind of changed the role of data engineers and the work that folks are doing. [00:48:37] I think most of the time when we talk about data, we almost always think about ingesting or moving data from one place to another. [00:48:45] And data gravity is real. [00:48:47] You do not want to be just moving things around. [00:48:49] It's not fun. [00:48:50] Ask a data engineer, they will tell you. [00:48:52] So what I really loved about today was that we had data everywhere, but I don't have to necessarily move all of the data into Snowflake. [00:48:59] Thanks to the power of Apache Iceberg tables and cataloging databases by Snowflake, I could connect to the data anyway. [00:49:06] and everywhere that is supported by iceberg table format, of course, and just be able to use the data from across all of the data estate and still really leverage the Snowflake's powerful compute engine to really derive meaningful insights and help my businesses is absolutely fantastic. [00:49:24] And did I even tell you that I was running a bunch of spark code out there? [00:49:27] I didn't even have to write new Snowpark or Snowflake functions. [00:49:31] Take your existing spark code, Snowflake handles the rest of it. [00:49:35] Absolutely loved it. [00:49:36] It has been a delight building all of that. [00:49:38] It's been amazing. [00:49:39] And the demos are very great. [00:49:40] So we talked about how AI is improving your life as a data engineer. [00:49:43] We talked about how the interoperability has been helping a lot. [00:49:45] And we talked about how it's kind of removed the operational burden. [00:49:48] Anything else that you want to make sure that you say to all the data engineers who are paying attention figuring out how to do their jobs and their careers? [00:49:55] I think this is it. [00:49:56] I think one probable thing is, as data engineers, I don't necessarily think we all follow the software engineering best practices. [00:50:03] I am guilty of working with production data and maybe deleted a couple of tables here and there, but I'm sure the other data engineers out there too. [00:50:10] So if you want to use Snowflake workspaces, that probably is like unsung hero in today's demo. [00:50:16] I would want to call out, let's embrace more and more software engineering best practices and make our lives more structured and less chaotic and embrace AI and really go for it. [00:50:26] Thank you, Vino. I really appreciate you coming up and talking about this and for all the amazing demos today. [00:50:30] So thank you. [00:50:31] Of course. [00:50:34] Bringing this all together, I think the most important part is to remember that there really is no AI strategy without a data strategy. [00:50:42] Data engineers are critical to enabling successful AI projects across every organization. [00:50:49] And at Snowflake, we are committed to providing an end-to-end platform that simplifies the path all the way from raw data through to AI-powered outcomes. [00:51:00] Please enjoy the rest of this day of content. [00:51:03] We have a ton of really exciting sessions for you and we can't wait to help build some incredible experiences, pipelines, and AI insights with every single one of you. [00:51:12] Thank you and have a great day. [00:51:19] Thank you.

Data Engineering from Ingestion to AI-Ready — BUILD 2025 Keynote

Related Transcripts from Snowflake Inc.

Transcribe Any Video or Podcast — Free